May 13, 2013

FAQ: The Geography of Hate

Dear Readers,

Thanks to everyone (well, almost everyone) for their comments and constructive critiques on our Geography of Hate map. In light of all of the different directions these comments have come from, we wanted to respond to some of the more common questions and misunderstandings all at once. Before commenting or emailing about the map, please keep the following in mind...

1. First, read our original post. Second, read through this FAQ. Third, read the "Details about this map" section included in the interactive map, itself. We specifically spent time on these things in order to explain our approach, and they go into some detail about the methods we used. Nearly all of the critiques of our map are already included in one of these venues. We're happy to engage and confident in our methodology (not that any approach is perfect), but please, use the skills your first teacher gave you and take the time to read.

2. If you are offended by these words, and we sincerely hope that you are, remember that they are the object of a research project. As such, we felt compelled to reproduce the words in full in order to be as clear as possible about our project. While we agree that the use of these slurs can be hurtful to some, especially the groups that they are targeted at, we believe that there is a difference between including them as the object of our study and using them as they are 'meant' to be used.

3. The map is based solely on geocoded data from Twitter, and does not reflect our personal attitudes about a given place. The map represents real tweets sent by real people, and is evidence that the feeling of anonymity provided by Twitter can manifest itself in an ugly way. If you feel that the place you live is more or less racist than somewhere else and this isn't reflected in the map, please start a conversation with your community about these issues.

4. In order to produce this map, we took the number of geotagged hateful tweets, aggregated them to the county level and then normalized this count by the overall number of tweets in that county. This means that the spatial distributions you see for the different variables are decidedly NOT showing population density. As we mentioned above, this is clearly stated in all of the previously written material accompanying the map. And because we are specifically looking at the geographic patterns of Twitter activity, it makes more sense to normalize by overall levels of Twitter activity than by population.

Were that not enough, however, the fact that there is so little activity on the map in California - home to an eighth of the entire US population, including the cities of Los Angeles, San Francisco and San Diego - should be a clue that something else besides population is at work in explaining these distributions. While we share with the infamous xkcd cartoon a distaste for non-normalized data, just because you thought for a second that maybe it was relevant in this case doesn't make it so. There are many possible explanations for some of the distributions that you can see, and we don't pretend to have all of the explanations. But population just isn't one.  

5. This map includes ALL geotagged tweets for each of these words that were determined as negative. This is not a sample of tweets containing these words, but rather the entire population that meets our criteria. That being said, only around 1.5 % of all tweets are geotagged, as it requires opting-in to Twitter's location services. Sure enough, that subset might be biased in a multitude of ways when compared with the the entire body of tweets or even with the general population. But that does not mean that the spatial patterns we discover based on geotagged tweets should automatically be discarded - see for example some of our earlier posts on earthquakes and flooding


6. 150,000 is in no way a "small" number. Yes, it is less than the total population of earth. Yes, it is less than the number of atoms in the universe. But no, it is not small number, especially as it is the total population of the phenomenon rather than a sample (see #5). And were one to extrapolate out that, considering these 150,000 geotagged hateful tweets are only around 1.5% of the total number of hateful tweets, the actual number of tweets (both geotagged and not) containing such hateful words is quite a bit larger. Regardless, we think that 150,000 is a sufficiently large number to be quite depressed about the state of bigotry in our country.


7. Furthermore, given that each and every geotagged tweet including the words listed was read and manually coded by actual human beings (if you consider undergraduates to be human beings!), rather than automatically by a piece of software, 150,000 isn't an especially small number. For students to read just these 150,000 tweets, it took approximately 150 hours of labor. This isn't insignificant.

8. The original lists of words included were derived frohttp://en.wikipedia.org/wiki/List_of_ethnic_slurs and http://en.wikipedia.org/wiki/List_of_LGBT_slang and included the following words:

bitch
nigger
fag*
homo*
queer
dyke
Darky OR darkey OR darkie
gook*
gringo
honky OR honkey OR honkie
injun OR indian
monkey
towel head
Wigger OR Whigger OR Wigga
wet back OR wetback
cripple
cracker
honkey
fairy
fudge packer
tranny

A * indicates a list of lexeme variations was used, which accounts for alternate spellings of words. For example, "fag" was not just "fag," but also "fags", "faggot", "faggie", and "fagging", among other things. All geotagged tweets containing these terms were examined. All tweets that were not used in a derogatory manner were discarded during coding, and as a result some words no longer achieved a minimum number to be displayed on the map. For example, honky/honkey/honkie was discarded, as most of the tweets were positive references towards honky-tonk music and not slurs aimed at white people.  

In the end we were also constrained to words that could be manually coded, and words that could not. For instance, the 5.5 million tweets with reference to "bitch" were excluded from the list. Students were paid roughly $10 per 1000 coded tweets, and therefore including the word "bitch" alone would have cost roughly $55,000 to manually check for sentiment. Tranny/tranney would have been under $200. While we're obviously interested in including a wider range of hateful terms in our analysis, our research funds, and thus the scope of this project, are extremely limited. It's not like we have billions of dollars in funding lying around. If you feel strongly, feel free to donate to http://humboldt.edu/giving. and enter "The Geography of Hate Project" in your comments.

9. If you are a disgruntled white male who feels that the persistence of hatred towards minority groups is a license to complain about how discrimination against you is being ignored, just stop. You can refer to all of our previous commentary on this issue from November. Though we have typically refrained from deleting asinine comments to this effect - those who choose to make these comments do more to prove themselves to be fools than we ever could - we fully reserve the right to delete any and all comments we believe to be unnecessary.

May 10, 2013

The Geography of Hate

UPDATE (5/13/13 @ 10:45pm): We have written and published a FAQ to respond to some of the questions and concerns raised in the comments here and elsewhere. Please review our comments there before commenting or emailing.

Following the 2012 US Presidential election, we created a map of tweets that referred to President Obama using a variety of racist slurs. In the wake of that map, we received a number of criticisms - some constructive, others not - about how we were measuring what we determined to be racist sentiments. In that work, we showed that the states with the highest relative amount of racist content referencing President Obama - Mississippi and Alabama - were notable not only for being starkly anti-Obama in their voting patterns, but also for their problematic histories of racism. That is, even a fairly crude and cursory analysis can show how contemporary expressions of racism on social media can be tied to any number of contextual factors which explain their persistence.

The prominence of debates around online bullying and the censorship of hate speech prompted us to examine how social media has become an important conduit for hate speech, and how particular terminology used to degrade a given minority group is expressed geographically. As we’ve documented in a variety of cases, the virtual spaces of social media are intensely tied to particular socio-spatial contexts in the offline world, and as this work shows, the geography of online hate speech is no different.

Rather than focusing just on hate directed towards a single individual at a single point in time, we wanted to analyze a broader swath of discriminatory speech in social media, including the usage of racist, homophobic and ableist slurs.

Using DOLLY to search for all geotagged tweets in North America between June 2012 and April 2013, we discovered 41,306 tweets containing the word ‘nigger’, 95,123 referenced ‘homo’, among other terms. In order to address one of the earlier criticisms of our map of racism directed at Obama, students at Humboldt State manually read and coded the sentiment of each tweet to determine if the given word was used in a positive, negative or neutral manner. This allowed us to avoid using any algorithmic sentiment analysis or natural language processing, as many algorithms would have simply classified a tweet as ‘negative’ when the word was used in a neutral or positive way. For example the phrase ‘dyke’, while often negative when referring to an individual person, was also used in positive ways (e.g. “dykes on bikes #SFPride”). The students were able to discern which were negative, neutral, or positive. Only those tweets used in an explicitly negative way are included in the map.

Tweets negatively referring to "Dyke"
All together, the students determined over 150,000 geotagged tweets with a hateful slur to be negative. Hateful tweets were aggregated to the county level and then normalized by the total number of tweets in each county. This then shows a comparison of places with disproportionately high amounts of a particular hate word relative to all tweeting activity. For example, Orange County, California has the highest absolute number of tweets mentioning many of the slurs, but because of its significant overall Twitter activity, such hateful tweets are less prominent and therefore do not appear as prominently on our map. So when viewing the map at a broad scale, it’s best not to be covered with the blue smog of hate, as even the lower end of the scale includes the presence of hateful tweeting activity.

Even when normalized, many of the slurs included in our analysis display little meaningful spatial distribution. For example, tweets referencing ‘nigger’ are not concentrated in any single place or region in the United States; instead, quite depressingly, there are a number of pockets of concentration that demonstrate heavy usage of the word. In addition to looking at the density of hateful words, we also examined how many unique users were tweeting these words. For example in the Quad Cities (East Iowa) 31 unique Twitter users tweeted the word “nigger” in a hateful way 41 times. There are two likely reasons for higher proportion of such slurs in rural areas: demographic differences and differing social practices with regard to the use of Twitter. We will be testing the clusters of hate speech against the demographic composition of an area in a later phase of this project. 

Hotspots for "wetback" Tweets
Perhaps the most interesting concentration comes for references to ‘wetback’, a slur meant to degrade Latino immigrants to the US by tying them to ‘illegal’ immigration. Ultimately, this term is used most in different areas of Texas, showing the state’s centrality to debates about immigration in the US. But the areas with significant concentrations aren’t necessarily that close to the border, and neither do other border states who feature prominently in debates about immigration contain significant concentrations.

Ultimately, some of the slurs included in our analysis might not have particularly revealing spatial distributions. But, unfortunately, they show the significant persistence of hatred in the United States and the ways that the open platforms of social media have been adopted and appropriated to allow for these ideas to be propagated.

Funding for this map was provided by the University Research and Creative Activities Fellowship at HSU. Geography students Amelia Egle, Miles Ross and Matthew Eiben at Humboldt State University coded tweets and created this map.

The full interactive map is available here: http://users.humboldt.edu/mstephens/hate/hate_map.html

May 06, 2013

Tweeting the AAGs

Now that we've all had a couple of weeks after the AAGs to relax and make fun of certain unnamed party-animals, we thought we would reflect on how the conference itself was reflected in the Twittersphere. With comments abound that there was more conference-related Twitter activity than ever before, we wanted to see if we couldn't uncover some more specific trends.

Thanks to an enterprising geographer, we have an archive of all 3,154 tweets with the official conference hashtag, #AAG2013. We know from this database that those tweets came from a total of 697 users, of which the top 10 users contributed about 23% of the total number of tweets.

But cross-referencing the Eventifier database with DOLLY's archive of geotagged tweets with the conference hashtag, we can try to understand how and where some geographers tweet and whether geographers fit the overall profile of Twitter users in terms of geotagging. Do geographers geotag their tweets at a higher rate than the average user because of their heightened awareness of spatial issues? Or do they intentionally avoid geotagging their tweets due to sensitivity to location privacy?

According to DOLLY, there were just 137 geotagged tweets with #AAG2013, coming from just 41 users. So, rather than adhering to the oft-cited rule of ~1.5% of all tweets being geotagged, geographers in Los Angeles for the AAGs actually geotagged more than 4% of their conference-related tweets. Of the 137, 127 actually have exact lat/lon coordinates, so we're able to do some mapping at the urban scale in order to see where geographers were tweeting about the conference.

And because only 8 tweets came before the AAG started on April 9, and only 5 came after it ended on April 13, and these are roughly congruent with the 16 tweets outside of Los Angeles County, we'll focus on the 113 of 127 tweets with exact coordinates which were located in downtown LA. In other words, because most of the AAG-related tweeting happened during the conference and in its general proximity, it isn't too interesting to focus on the other locations from which the hashtag was being used.

AAG-related Tweeting Activity in Downtown Los Angeles
As is evident from this map, the vast majority of the tweets referencing #AAG2013 came from the Westin Bonaventure Hotel, the primary site of the conference. The second highest concentration of tweeting activity came from the Millenium Biltmore Hotel and LARTA, the secondary conference site and location of our IronSheep event, respectively, which were just half-a-block or so apart, and immediately adjacent to Pershing Square. But given the lack of free conference Wi-Fi and general lack of cell phone service in the Biltmore, it's even less surprising that it had quite a bit less geotagged tweeting activity. Other small pockets of tweeting activity around the downtown seem to be located in the general vicinity of bars that were known to be frequented by geographers, such as the Library Bar, which hosted multiple conference related parties over the course of the week.

As is the case with many of our maps, there's nothing too surprising here. Of course it makes sense that people tweet about the conference from the location of the conference. But we'd still be careful about reading too much into these results. More specifically, we shouldn't get the impression that geographers go to the AAGs primarily to sit in stuffy hotel rooms giving paper presentations rather than gallivant around town with old friends, instead, it seems more plausible that geographers are simply having too great of a time at various drinking establishments to tweet about it, or too smart to use the official conference hashtag when doing so!

May 02, 2013

DOLLY's Birthday!

We recently added a page outlining in more detail the DOLLY (Data On Local Life and You) project at the University of Kentucky to provide an overview to this ongoing and exciting project to make the massive datasets associated with geosocial media data (such as Twitter) accessible and explorable.  Yesterday we archived the 3 billionth tweet and it seemed worth recognizing DOLLY (along with all her algorithmic stream and process workers, since it was May Day) by declaring it to be her official one year birthday.  And since few of us can carry a tune (even with handles) we thought we'd let Satchmo serenade DOLLY.



We've posted some of our work based on DOLLY here including an analysis of tweets after the Boston bombingPremier League fandom in the UK, Flooding in the UKThanksgiving tweetsearthquakes in Kentucky and racist tweets after the 2012 election.

Now that the Spring semester is winding down we will be stepping up our work and posts here.  We have a couple of really great posts that will be appearing over the next week or so.

We see DOLLY as both a key tool for our own work but also as a means to break down the technological barrier that is often present for researchers that would like to study big data but do not necessarily possess the required technical skills.  So stay tuned.

April 23, 2013

Tracking personal activity at AAG: A cautionary tale of big data and lack of sleep

At FloatingSheep we are always seeking to push the envelop in terms of user-generated data, and so when it came to our attention that someone we know was sporting a Nike Fuelband, we couldn't resist taking a quick look at the data. For those of you unfamiliar with the Fuelband, it is a bracelet one wears to capture activity and exercise and "precisely" measure caloric consumption. Even better, it awards "points" so that you and your cyborg friends can compete for bragging rights. To be honest, we don't quite understand the appeal, but have little doubt everyone will be sporting these things in the near future as we bow down to our digital overlords happily greet each new consumer product as it arrives.

In any case, a well-known friend of the sheep (FOTS)[1] was sporting one at the recent annual meetings of the Association of American Geographers two weeks ago and was kind enough (or suffers from some sort of twisted exhibitionism) to share the data with us so that we could share it with you (see below). This FOTS was kind enough to also add yellow ellipses during his/her sleep periods and a handy counter of the daily ration of sleep (in terms of hours).


To provide a bit of a base line, the days before the conference (which began on Tuesday) are also included.  Note, the conference was in LA (Pacific Time) but the data is presented  in Eastern time, so the activity is actually three hours later than indicated in the chart. The big takeaway here is that this FOTS had only 13 hours of sleep from Tuesday to Sunday (mostly between 4 am and 8 am) until s/he boarded a plane and collapsed on Sunday. Given the crude nature of the data, other patterns are harder to distinguish but peaks in the late evening or early morning suggest dancing or other activities.

While just looking at this chart makes us tired (as well as giving us a headache) it does allow for some preliminary observations:
  • There is an important late-night component to the AAG (and academic conferences more generally) that deserves further study...sounds like a good field opportunity for auto-ethnography;
  • A cost saving measure for certain conference attendees (such as this FOTS) would be simply to not get a hotel room and stay up the entire time; and
  • Some people are having a lot more fun (or more precisely, activity) at the AAG than us.
We have no doubt that we'll be seeing more of this individual daily monitoring data in the months/years to come, and are placing bets on how long before it becomes smoothly integrated with GPS (the technology is already there) in order to produce spatial activity maps for everyone [2]. No more bragging about going to the gym (and then hanging out at the refreshment bar) or calling in sick so that you can go skiing. The data will know!

-------------------------
[1] But if you think you know who it is, feel free to leave a comment.  Chances are that you are right.
[2] Think Hagerstrand's space-time prism on steroids. 

April 19, 2013

New Article Published in Cartography and Geographic Information Science


We're happy to report that our article -- Beyond the geotag: situating 'big data' and leveraging the potential of the geoweb -- has been published in Cartography and Geographic Information Science as part of a special issue on "Mapping Cyberspace and Social Media", edited by Ming-Hsing Tsou and Michael Leitner. The article was written collaboratively amongst the five sheep, as well as Jeremy Crampton and Matt Wilson of the University of Kentucky. The abstract and full citation for the paper are below:
This article presents an overview and initial results of a geoweb analysis designed to provide the foundation for a continued discussion of the potential impacts of ‘big data’ for the practice of critical human geography. While Haklay's (2012) observation that social media content is generated by a small number of ‘outliers’ is correct, we explore alternative methods and conceptual frameworks that might allow for one to overcome the limitations of previous analyses of user-generated geographic information. Though more illustrative than explanatory, the results of our analysis suggest a cautious approach toward the use of the geoweb and big data that are as mindful of their shortcomings as their potential.

More specifically, we propose five extensions to the typical practice of mapping georeferenced data that we call going ‘beyond the geotag’: (1) going beyond social media that is explicitly geographic; (2) going beyond spatialities of the ‘here and now’; (3) going beyond the proximate; (4) going beyond the human to data produced by bots and automated systems, and (5) going beyond the geoweb itself, by leveraging these sources against ancillary data, such as news reports and census data. We see these extensions of existing methodologies as providing the potential for overcoming existing limitations on the analysis of the geoweb.

The principal case study focuses on the widely reported riots following the University of Kentucky men's basketball team's victory in the 2012 NCAA championship and its manifestation within the geoweb. Drawing upon a database of archived Twitter activity – including all geotagged tweets since December 2011–we analyze the geography of tweets that used a specific hashtag (#LexingtonPoliceScanner) in order to demonstrate the potential application of our methodological and conceptual program. By tracking the social, spatial, and temporal diffusion of this hashtag, we show how large databases of such spatially referenced internet content can be used in a more systematic way for critical social and spatial analysis.
Crampton, J.W., M. Graham, A. Poorthuis, T. Shelton, M. Stephens, M.W. Wilson and M. Zook. 2013. Beyond the Geotag: Situating ‘Big Data’ and Leveraging the Potential of the Geoweb. Cartography and Geographic Information Science 40(2): 130-139.

If you'd like the final publication version and don't have institutional access to the article, feel free to email any of us to get a copy.

April 17, 2013

Mapping the Boston Marathon Bombing

The tragedy in Boston this week shook us all. Several of us have strong ties to the area and the randomness and sheer viciousness of the event is stunning.

We noted that many people felt similarly and many took immediately to social media (such as Twitter) to participate in a larger discussion. Some used it to assure loved ones that they were OK while cell phone service was spotty. Some used social media to spread misinformation for personal gain or to make a political point. So too did the Boston Police and Fire Departments rely on social media to get a better idea of what actually happened. But the focus on social media's role in responding to the bombings neglected the intensely geographic element of such user-generated content as individuals and society tries to make sense of it all.  Thus, in an effort to document the diffusion of spatial awareness of the tragedy we offer the following analysis.

Using DOLLY, we collected all geotagged tweets in North America referencing "Boston" from March 1, 2013 through April 15, 2013. We've divided the data from the last month and a half into three separate temporal snapshots: from March 1 to March 31, from April 1 to April 15 at 2:45pm and, finally, April 15 from 2:45pm to 11:59pm, roughly the time following the first explosion on Boylston Street near the finish line of the race. While the visual differences in the maps below may be somewhat subtle, the data behind them is anything but.

For the entire month of March 2013, there were a total of 48,622 geotagged tweets with reference to "Boston", of which 44,221 had exact lat/lon coordinates. Of the 48K+ tweets, nearly half (23,895) of them were within Boston's city limits [1]. A fairly similar pattern was evident in tweets in the first half of this month, with 24,991 tweets total (23,151 had lat/lon coordinates attached) and 12,206 in Boston. These general trends were evident in earlier data as well, especially with respect to the pattern of roughly half of the references to the city being located within it.

References to "Boston" in the Continental USA, March 1 to April 15 [2]

But in the time since yesterday's bombing, tweeting activity about Boston has both intensified and dispersed. After 2:45pm EST on the 15th, there were 52,339 tweets in our dataset -- that is, several thousand more tweets in roughly nine-and-a-quarter hours than there usually are in an entire month, indicating an expected spike in overall activity as a result of the news coverage. But of these, a lower percentage (83.6%, as compared to 90.9% for March and 92.6% for the first half of April) were geotagged with exact lat/lon coordinates. And, perhaps most interestingly, an incredibly small number of these tweets originated within Boston.

Whereas roughly half of the tweets about Boston originated there in the earlier time frames, only 3% of tweets were located within the city following the bombings. All of this remains in stark contrast to the numbers from last year's Boston Marathon, where there were only 775 total mentions of the city in geotagged tweets from North America, with 333 (again, close to half) within the city. So not only was there a considerably smaller amount of geotagged tweeting, but so too did it remain concentrated largely within the city.

References to "Boston" in the Greater Boston Area, March 1 to April 15

In addition to the overall intensification of discussion about Boston in the wake of the bombing, there are a couple of distinct spatial patterns at play here. First, yesterday's tragic events led to discussion of Boston on Twitter to become much more spatially diffuse around the country. This is likely the result of a combination of things: people within the city tweeting less due to concerns for their own safety, people within the city not feeling it necessary to include "Boston" in all of their topically-relevant tweets, and a heightened interest nationwide in what is just the latest in a long string of violence in recent months.

But second, discussion of the city within the city is also more spatially dispersed. While the time frames prior to the bombing demonstrate a massive concentration of tweets in Downtown and the Back Bay -- the areas in closest proximity to the bombings, as well as some of the more densely populated during daytime hours -- tweeting activity after the bombings shows less focus on these areas and a more random spatial distribution throughout the greater Boston area, though these areas maintain the highest concentrations.

This analysis shows how established spatial patterns of place-based social media activity can be disrupted by extraordinary circumstances, such as a terrorist attack, as well as the importance of looking at how such spatial patterns change over time [3]. While there remains more one could do with this data -- including a focus on tweeting activity within particular spaces of the city near the bombing or looking beyond particular keyword searches, or using social network analysis to understand the spatial and temporal diffusion of the tragic news -- these maps and statistics provides an initial look at how tragedies such as these and the outpouring of emotions about them result in shifting geographies of social media activity.

---------
[1] The greater Boston area -- including Cambridge, Somerville, Brookline, Newton, etc. -- were excluded from these counts for reasons of convenience.
[2] Note that both maps are in reverse chronological order, with the post-bombing time frame shown at the first in each series.
[3] There are also some important and potentially anomalous patterns relative to some of our earlier findings but this awaits further study.