floatingsheep: A Quick Look at Global Language Patterns on Twitter

Today's post is derived from some testing we were doing within our data on language and since the results were interesting, we thought we'd share. This is a first step of a longer process of comparing language use at the global scale so much remains to be done.

Starting from a 10% sample of all global geotagged tweets from the calendar year 2013, we collected tweets that used a variety of non-Latin characters as a proxy for linguistic prevalence (see the map titles below for the list of characters searched). Using composite counts of what we found to be the five most commonly used characters in each of the given languages, we mapped normalized values at the country level in order to understand where these languages are most dominant. In other words, these maps represent the relative level of tweets containing non-Latin characters compared to all tweets; the US has plenty of tweets with Arabic, Chinese and Korean characters but these numbers are small compared to the overall number of tweets within the country.

There are some issues with the data we collected -- for instance, we relied on non-definitive sources for our list of the most commonly used characters, and the constraints of the way we've structured our data makes (how we treat boolean queries and computing constraints) make our data somewhat incomplete. But still the initial results provide a reasonable snapshot of where Twitter is being used by people who don't speak languages which can be easily expressed in Latin characters.

Arabic Characters: ل ن م ي ا

The spatial pattern of Arabic-language tweeting is interesting in that it seems to mimic a conventional distance decay effect. Saudi Arabia is the undoubted center of Arabic tweeting, with its immediate neighbors having relatively lower amounts, with their immediate neighbors having even lower concentrations, with practically no discernible differences once you reach Sub-Saharan Africa to the south, India to the east, or Europe to the north and west.

Chinese Characters: 的一是不了

While Japan has the highest absolute number of tweets containing Chinese characters, due to the fact that the Japanese language relies on written Chinese characters, the relative measure shows China to, quite unsurprisingly, be the center of Chinese-language tweeting. The territory of Greenland shows up as well, mainly because of the relatively low number of total tweets making the few tweets with Chinese characters relatively more frequent. We could, of course, account for this by requiring certain thresholds but for this initial look, we left it in. Given the increasing dominance of China within the global economy, it's somewhat interesting to see that there is very little Chinese-language tweeting happening in other parts of the world.

Korean Characters: 뭐 그 안 근데 거

The final language we explored was Korean and while it is not surprising that South Korea has by far the most Korean tweeting, it is interesting to note that North Korea, despite its almost complete disconnection to the global system, also appears on the map. Again, it seems that the scattering of relatively high scores for places such as Greenland and Somalia has more to do with the relatively low level of overall tweeting in these places than with some previously unknown concentration of Korean-speakers.

While there's not much definitive here, we believe this to be a useful, if incredibly brief, look at how online spaces such as Twitter remain connected to conventional, offline geographies, such as those of language and culture. And given the recent emergence of domain names in non-Latin characters, these maps might offer clues into the evolving geography of domain names, while also offering some potential for future research using such data.