April 09, 2010

Mapping Wikipedia Biographies

The map below is a visualisation of references to places within 423,846 biography articles in the English version of Wikipedia. The definition of these bolded terms and the methodology used to obtain these data is discussed in more detail below.


Now compare this map to the below map of actual population density.


The differences are quite astonishing. What one sees is that articles about people in Wikipedia are highly likely to reference particular parts of the world (the US and Western Europe). This is a geography of people that is in no way reflective of the actual distribution of population on our planet.

Of course, because the data only includes biography articles in the English version of Wikipedia it is biased towards English speaking countries. This fact helps explain the concentration of articles that reference the US and the UK. However, language alone does not explain why countries where English is widely used (e.g. India) have a smaller presence than non-English speaking countries in Western Europe.

Most importantly, it is clear that Wikipedia has not yet attained its goal of storing the "the sum of all human knowledge." Wikipedia guidelines specify that biographies should only be about notable people and this map suggests that there are more notable people in Europe and North America (at least in the eyes of Wikipedians). Not to knock our home continents but it seems likely (especially after looking at some of the people deemed "notable") that Wikipedia is simply reflecting its user base who are disproportionally from these places.

In any case it shows that there are likely still a lot of possibilities out there for new Wikipedia articles (despite claims that Wikipedians are running out of new topics to write about).

And in the big picture it again raises questions about who participates in online discussions and what is discussed and documented in these conversations.

The data used to create these maps were collected by Adrian Popescu and are available here for anyone interested in playing with them. The data were actually collected through a rather complicated process that we'll explain below.

First of all, we need to define biography articles; basically, any article about a person in Wikipedia (e.g. Angela Merkel, Ron Jeremy or Gary Brolsma). A list of biographies was created using data harvested from the list of occupations.

We then geolocated each biography article. This was done counting the number of references to place names in each person's biography and then mapping only the most mentioned place in each article. Ranking of placenames was conducted not only using the English version of the article, but also using the equivalent in up to seven languages (English, German, French, Dutch, Spanish, Italian and Portuguese). The thinking behind this method of ranking is simple: the more article versions mention a given location, the more relevant for the concept that location is. We have, however, also done some analysis with the 2nd, 3rd etc. most mentioned places in each article and will be publishing a post on this work soon (along with analysis of Wikipedia data by century and the geography of specific occupations (e.g. artists, politicians and footballers) within the encyclopaedia).

It is clear that this method favors European locations at the expense of places in the rest of the world. Japanese and Arabic Wikipedias, for example, probably have a very different geography (something we are also working on mapping). The fact remains though, that the English language Wikipedia offers us a very particular worldview rather than access to "the sum of all human knowledge" (for the time being at least).

Hmmm....that reminds us, we should start up a Floatingsheep page at Wikipedia some time soon.

See also:

Adrian's analysis of Wikipedia: Adrian Popescu, Gregory Grefenstette Spatiotemporal Mapping of Wikipedia Concepts, JCDL 2010, June 21 - 25, Brisbane, Australia

...and some of our previous work on mapping Wikipedia here.

10 comments:

  1. Not that these graphs don't reveal disparity in Wikipedia's entries regarding the developing world, but the question it makes me wonder is what this implies about how different countries use the internet. Why is Wikipedia such a fascination to the west but not so much elsewhere? How do countries outside of the west use the internet differently.

    Other ideas - could a map of internet use density (rather than population density) be used to make a better comparison? Obviously that would not address the imbalance of Wikipedia articles available but it might reveal something about the relative use (or lack the of) of Wikipedia.
    - Is Wikipedia similarly imbalanced when studying non-English language use? Is the experience of Chinese or Indian users similarly locally skewed.

    Either way, great post!

    ReplyDelete
  2. Very Nice Blog
    -
    http://lwiki.blogspot.com/

    ReplyDelete
  3. Could also be a result of Wikipedia's definition of reliable secondary sources, the use of which is required for a biographical entry?

    ReplyDelete
  4. Perhaps the reason why they have biographies mostly in America, UK, and areas around there because there simply isn't a lot of famous or well-documented people in other areas like Africa and parts of Asia (not that I really know, frankly).Population doesn't really have a lot to do with it. How are they supposed to make a proper biographical entry without proper references on these people? Anyone can SAY a person did this or that, but how would you that that person is speaking the truth? There are too many different aspects to this subject to make a real conclusion from it.

    ReplyDelete
  5. i think 1 of the most important reason is. wiki doesnt have an official webstie in another language in those highly populated countries. in those countries they have their own Wiki in their own language. so instead of looking for knowledge on wikipedia. they look for their answers on their own 'Wiki' in a more understanding way. :)

    ReplyDelete
  6. Very interesting, so glad to find your site!

    Karena

    Art by Karena

    ReplyDelete
  7. I think it is an interesting view and way of looking at it. I always use wikipedia when I work but I never really stopped to think if it was really reliable. I also do speak multiple languages abs find the analyses interesting.

    ReplyDelete
  8. Interesting post, Mark! When I first looked at the two maps, my immediate reaction was to agree. Yes, the difference between geographical distribution of biographies and population density is quite astonishing.

    However, on second thought, I was wondering where that astonishment actually came from. So what makes this difference a difference -- and so strikingly astonishing?

    Maybe an alternative approach would be to have a careful look at what you actually do to bring the difference into the world. Admittedly, if someone had just told me about geo-distribution of Wikipedia biogs and population density, I would have thought: what on earth has one to do with the other? At least, it would not have been obvious to me.

    So what does the trick then? I think a big part of it is what you call "visualization". Projecting the two issues on the same world map somehow suggests that they /are/ related. "Look at the maps! It's obvious, isn't it?"

    An alternative way to explore "the difference" could be to have a look at how you (and we down here in the comment section!) bring it into being in the first place. What work do the maps do? What assumptions are embedded in this form of presentation and what makes them seem so immune to discussion? What does this form of visualization conceal?

    We rarely talk about these things, although it seems quite interesting to explore. Again, thanks for the stimulating post!

    ReplyDelete
  9. @Jack: Comparing these data to internet usage stats is a very good idea. I'm working on getting the appropriate data at the moment.

    @Malte: Our maps certainly do create and reinforce knowledge and ways of seeing. Yet, at the same time, one of these reasons we create these visualisations is to challenge other ways of seeing and other assumptions about our world.

    I think there is a strong case to be made for comparing Wikipedia bios to population. Biography articles are representations of "notable people." Comparing the density of "notable people" to the density of all people therefore tells us something about the types of knowledge (and people) that we (or Wikipedia editors) prioritise and consider to be important.

    Thanks for all the comments.

    ReplyDelete