December 27, 2012

The Bluegrass Basketball Battle

In Kentucky, basketball means everything -- especially college basketball, and especially the intrastate rivalry between the Kentucky Wildcats and the Louisville Cardinals, one of the greatest in all of college sports. Growing up in Louisville, one can't help but choose sides and develop one's debating skills, arguing with classmates, family and friends over whether Patrick Sparks traveled in 2004 or whether Rick Pitino is the modern-day basketball equivalent of Benedict Arnold. But given our connections to the University of Kentucky (and Taylor's fandom), the upcoming game and the tools at our disposal, we thought it might be time to wade in on the age-old debate between the two sides.

A recent public opinion poll of Kentucky by Public Policy Polling piqued our interest, as it found that Kentucky fans outnumber Louisville fans in the state by an overwhelming 66% to 17% margin. But how do the two fanbases stack up on Twitter?

We took to DOLLY to collect references to the two general-purpose hashtags used by fans of each team and promoted by the respective athletics departments -- #BBN (for Big Blue Nation) and #L1C4 (for Louisville First, Cards Forever) -- in geotagged tweets created between June 21, 2012 and December 20, 2012, in order to measure the both the absolute numbers and geographic distribution of UK and UL fans at the national, statewide and local scales as reflected by Twitter.

Number of Tweets referencing #BBN or #L1C4
According to the aforementioned poll's 66-to-17 margin, there are ~3.9x more UK fans than UL fans in Kentucky. This finding is mirrored almost exactly by our measures of tweeting, where the 6,371 geotagged references to #BBN in the state are also 3.9x greater than the 1,628 references to #L1C4. And while the number of tweets for each team are essentially equal within the city of Louisville, UK fandom becomes even more dominant once one moves outside of the Commonwealth, with there being over 10.5x more #BBN tweets than #L1C4 tweets in the US outside of Kentucky, for a total of 4.9x more UK tweets than UL tweets nationwide. So not only does UK hold an ever-so-slight advantage within Louisville's homebase, it shows increasing popularity as one moves to the larger scales of the state and nation.

#BBN vs. #L1C4 Nationwide
But when we visualize these tweets, we get a better idea for just how geographically concentrated these patterns of fandom are. For instance, 599 of the 3,141 US counties had references to either #BBN or #L1C4. But of these, only 35 counties had a greater number of references to #L1C4, with Butler County, KY holding the dubious honor of being the only county in the Commonwealth with more references to #L1C4. Of the remaining counties, 554 had more references to #BBN, and only 10 counties in the country had an equal number of tweets referencing #BBN and #L1C4.

#BBN and #L1C4 in the Commonwealth of Kentucky
Also interesting is that no county in the US apart from Jefferson County, KY (Louisville and Jefferson County have a merged government, and so are coterminous) has more than 100 tweets with references to #L1C4, highlighting the essentially limited spatial distribution of UL fans. And though Jefferson County does have a few more UK tweets than UL tweets, one doesn't have to go far to find the county with the largest margin of UL-related tweets over UK-related tweets; right across the river from Louisville in Clark County, Indiana there are 20 more #L1C4 than #BBN tweets.

Meanwhile, Kentucky holds a decisive advantage in its hometown of Lexington-Fayette County, with 1,588 more #BBN tweets than #L1C4 tweets. But the county with the second-highest margin favoring UK is all the way south in Broward County, Florida (Ft. Lauderdale) with a +299 margin favoring UK.

#BBN vs. #L1C4 in Louisville
Within Louisville, the absolute number of tweets are almost equal, as mentioned previously; but, interestingly enough, the geographies of UK and UL tweeting are quite different. The clustering of #L1C4 tweets tends to be around the UL campus and downtown areas, while UK tweeting tends to be more spatially distributed, with many tweets coming from more suburban, residential areas in the city. So while the vast majority of UL tweets across the country are located in Louisville, a still significant number come from within just a handful of square miles surrounding the UL campus in downtown Louisville, perhaps indicating the limited appeal of a team that's lost four-straight games to the defending-champion Wildcats.

#L1C4? More like #L1C4.9xLessPopularThanUK.

UPDATE: See today's article over at ESPN.com, "The Commonwealth's great divide", which discusses some of the same geographic dimensions of UK and UL fandom we are showing here. It includes this interesting passage:
In 2005, the Courier-Journal polled fans on their sports loyalties and 53.7 percent within the city counted themselves as UL fans compared to just 33 percent who identified themselves as Cats fans. And according to the two schools' alumni associations, Louisville understandably has a far greater base in Jefferson County (54,872 living alumni) than Lexington (16,112). 

But here's the catch: There are just 22,160 living Louisville alumni in the rest of the state and other than Fayette County (where Lexington sits), none of Kentucky's 120 counties boasts more UK grads than Jefferson.
While we weren't aware of these figures at the time of our initial post, they not only tend to confirm some of our findings, but indeed only lend even more credence to our assertion that UK fans seem to be more voracious tweeters than their UL counterparts, as the roughly 50-50 split in tweeting in Louisville is significantly askew from the 54-33 numbers from the Courier-Journal's 2005 survey.

December 12, 2012

Does the Pope Tweet in the Woods?

The Guardian posted one of our maps today on the spatial distribution of the the Pope's followers on Twitter. However, they decided not to use our suggested  tag line "does the pope tweet in the woods?". We can't understand why. But rather than let this rhetorical gem disappear into the mists of the Internet, we're also posting the map here as well as a map on the location of tweets containing the term Pontifex (the name of the Pope's Twitter account). Apparently the pope is very big in Italy... who knew?


Location of Tweets containing the term "pontifex", December 2012
(note, these are raw counts rather than normalized)

December 11, 2012

We're also hiring a researcher in spatial statistics!

In addition to our new position in Internet Geography, we are now also hiring a full-time five-month researcher to study the geographies of user-generated content and participation on Wikipedia. We specifically seek to employ a researcher with experience in quantitative geography or quantitative sociology in order to statistically explain national and sub-national patterns and geographies of Wikipedia articles and editing behaviour.

Across the globe, daily economic, social and political activities increasingly revolve around the use of social content on the Internet. This user-generated content influences our understandings of, and interactions with, our social environment. Despite rapid increase in Internet access, there are indications that many people remain largely absent from websites and services, and many voices are absent from important platforms of information.

We explore this phenomenon through one of the world's most visible and most accessed source of content: Wikipedia. This project will employ a range of (primarily quantitative) methods to assess, explain, and model the variable levels of access, participation and representation on Wikipedia.

Candidates should have a keen interest in platforms of peer-production and the geographies of online participation. We welcome applications from candidates with a background in statistical methods, a strong record of scholarly research, and a desire to co-author academic publications.

Based at the Oxford Internet Institute, this position is available immediately for five months in the first instance, with the possibility of renewal thereafter funding permitting.

Applications for this vacancy are to be made online. To apply for this role and for further details, including a job description and selection criteria, please click on the following link: https://www.recruit.ox.ac.uk/pls/hrisliverecruit/erq_jobspec_version_4.jobspec?p_id=105871

Only applications received before 12:00 midday on 14th January 2013 can be considered. Interviews for those short-listed are currently planned to take place in the week commencing 21st January 2013.

Please also feel free to get in touch with any questions about the position. 

December 03, 2012

We're hiring an Internet Geographer!

We are hiring a full-time Internet Geographer at the Oxford Internet Institute!

The position is for a researcher to work with Mark Graham on a project to study and map the Geographies of the Internet. This is an exciting role in which the researcher will both gather and analyse a range of Internet-related data and develop innovative and beautiful ways to visualise them.

It is important to understand who produces and reproduces, who has access to, and who  and where are represented by information in our contemporary knowledge economy. Building on existing work conducted at the Oxford Internet Institute, this project proposes a comprehensive mapping of contemporary geographies of the Internet using both primary and secondary data (examples of data that we propose to map include geographies of academic journals, intellectual property imports vs. exports, patents, Wikipedia edits and contributions, networks of Twitter mentions and followers, Facebook users, information takedown requests, contested articles in Wikipedia, content indexed in Google etc.).

This one-year project will be divided into four stages. First, we will bring together and start collecting all necessary data. While some of the data are readily available in existing and open datasets, others require the creation of custom scripts and data collection tools. Second, we will use GIS and statistical packages to comprehensively analyse the data. Third, we will create visually appealing and state-of-the-art graphics and maps that clearly convey the geographies of access, information production, and information representation. Finally, we will broadly disseminate this work in a variety of open and accessible formats including free and interactive ebooks, an interactive website, a printed atlas, and academic journal articles. There may also be opportunities to gain teaching experience in some of the methods classes offered at the OII.

Candidates should have a keen interest in the geographies of the Internet, a passion for visualising and disseminating results, and an exemplary record of creative activity or scholarly research. The successful applicant will demonstrate an ability to collect online data with scripting tools, analyse large datasets with GIS tools, and visualise results in both static and interactive formats. 

We welcome applications from candidates who are additionally keen to design a future research programme in Internet Geographies in order to extend the position. Based at the Oxford Internet Institute, this position is available immediately for 12 months in the first instance, with the possibility of renewal thereafter funding permitting. 

For more details, please check out the application link and related job description and selection criteria:


Please also feel free to get in touch with any questions about the position. 

Relevant links:

November 29, 2012

Funding for Graduate work in FloatingSheep Studies

Do you enjoy the maps and research posted at the FloatingSheep blog? Interested in discovering how the Internet, geoweb and social media are changing the way we use and understand places? Would you like the opportunity to use the DOLLY project to explore geo-social media?

 If so, the Department of Geography at the University of Kentucky is currently accepting applications for graduate study at the Masters and Ph.D. level in the exciting arenas of online mapping, big data and critical social analysis. We're particularly interested in folks who blend experience in the technical/coding side of things with a desire to think carefully through the big socio-spatial theoretical questions that arise in concert with these technologies. To get a better sense of what this program of study might entail, take a closer look at some of the recent academic publications that have emerged from FloatingSheep such as work on augmented reality, user-generated geographies of religion and disaster relief as well as the virtual economy and economic flows.  You can also check a full list of my publications

In addition, be sure to examine the work of other University of Kentucky Geography professors, namely Jeremy Crampton and Matthew Wilson, who doing really exciting work in the related areas of critical cartography, online mapping and participatory GIS.

More information on the program and application process is available here. Students admitted to the graduate program receive full tuition waivers and stipends in exchange for working as teaching assistants. Fellowships and other funding are also possible. Applicants should submit their materials by January 15 to ensure a complete review.

 If you are interested (or want more information) please email me (zook@uky.edu) directly.

November 28, 2012

Digital Data Trails of the UK Floods

What do data scraped from the Internet tell us about a range of social, economic, political, and even environmental processes and practices? As ever more people take to social media to share and communicate, we are seeing that the data shadows of any particular story or event become increasingly well defined. 

The ongoing UK floods offer a useful example of some of the links between digital data trails and the phenomena they represent. In the graphics below, we mapped every geocoded tweet between Nov 20 and Nov 27, 2012 that mentioned the word "flood" (or variations like "flooded" or "flooding").


Unlike many maps of online phenomena (relevant XKCD),careful analysis and mapping of Twitter data does NOT simply mirror population densities. Instead concentration of twitter activity (in this case tweets containing the keyword flood) seem to closely reflect the actual locations of floods and flood alerts even when simply look at the total counts. This pattern becomes even clearer when we do normalise the map (the second map is a location quotient where everything greater than 1 indicates that there are more tweets related to flooding than one would expect based on normal Twitter usage in that area), the data even more closely mirror the UK Environment Agency's flooding map.

As we demonstrated with our maps of Hurricane Sandy, it is important to approach these sorts of maps with caution. At least in the information-dense Western world, they are often able to reflect the broad contours of large phenomena. But, because we are still necessarily measuring subsets of subsets, our big data shadows start to become quite small and unrepresentative at more local levels. This is particularly an issue when the use of the relevant technology is unevenly distributed across demographic sectors such as was the case in post-Katrina New Orleans

Nonetheless, with every new large event, movement, and phenomena, we are undoubtedly going to see a much more research into both the potentials and limitations of mapping and measuring digital data shadows. This is because physical phenomena like hurricanes and floods don't just leave physical trails, but create digital ones as well. 

November 23, 2012

Sheepallenge Deadline in One Week!

For those of you taking part in our Sheepallenge competition, we have over 40 teams and people signed up for this challenge and are looking forward to the variety of submissions. A few quick reminders and updates:

1.  Your final visualizations need to be submitted to Monica (monica.stephens@humboldt.edu) by midnight EST on November 30 (one week from today) to be forwarded on to the judges for consideration.

2.  We ask that those of you using Sheepallenge as a class project censure the submissions from your students to the ones you think are award worthy.

3.  Seriously, no bribing the judges with chocolate.  

We've heard rumors of exciting visualizations utilizing this data (from sinful surfaces to glutinous glory) and will post the best results in the coming weeks.

November 22, 2012

Do People Tweet of Mashed Turnips?, and other Thanksgiving Day Mysteries

While trying to avoid the hard work of stuffing the turkey or the pain of listening relatives who want to rehash the election, we decided to take a look at Thanksgiving-related geocoded tweets across the United States. We're not doing a lot of interpretation of these, but hopefully the maps do a decent job speaking for themselves, though it is important to note that all maps show raw counts without any kind of normalization.

Since turkey tweets are everywhere, we thought it might be fun to take a closer look at some of the more off-beat or regionally-specific Thanksgiving traditions using some new tools being developed to extend the capabilities of the DOLLY project.  Some rather off the cuff observations:
  • Grits, okra and hot dish have strongest regional tweet clusters in the south and upper midwest, respectively.
  • Very few people are tweeting about mashed turnips (who knew?), but those who are, are doing it in the areas around New York City.
  • Oyster and chestnut stuffing have the strongest concentrations in the Northeast.
  • Texas prefers pecan pie relative to apple or pumpkin pie.
  • People are still tweeting about turducken.
  • NPR listeners really are concentrated in the Northeast (as per the Mama Sternberg Cranberry Relish Twitter index).
Search for Grits

Search for Okra

Search for "hot dish" OR "hotdish"

Search for "mashed turnips"

Search for stuffing

Search for oyster* AND stuffing

Search for chestnut AND stuffing

Search for apple AND pie

Search for pecan AND pie

Search for pumpkin AND pie


Search for "cranberry" AND "sternberg", 

HO! HO! HO! Oh, wait... that's something different, isn't it?

November 12, 2012

Mapping the Eastern Kentucky Earthquake

Last week's post on racist tweets in the wake of the US presidential election received much more attention than we ever expected. A number of questions about and critiques of our method were raised, which we attempted to respond to in a special FAQ with the post (first time we had to do that). Nonetheless, we thought it might be useful to demonstrate the utility of our technique on a less controversial subject in order to demonstrate how we can leverage a relatively small number of geocoded tweets in order to understand particular offline phenomena, and maybe even assuage some concerns about such an approach.

The 4.3 magnitude earthquake that occurred on Saturday, November 10th around 12:08pm EST, about eight miles west of Whitesburg, Kentucky, provides just such an example. Given our own connections to Kentucky, and the significant number of our own friends and family who tweeted or updated their statuses about the earthquake, we were naturally interested in what we might be able to bring to such an analysis.

But before showing our own results, it is useful to note that the US Geological Survey also collects user-generated data on earthquakes through their "Did You Feel It?" reporting system in which individuals contribute their location and experience with quake. The USGS then aggregates these reports into a crowd sourced map like the one below in order to visualize an approximation of how the earthquake was experienced in different locations.

Rather than use such a direct system of user-generated data collection, we fired up DOLLY in order to gather geocoded tweets referencing the earthquake in its immediate aftermath. We were able to collect 795 geotagged tweets referencing "earthquake" from 12:08pm -- where the first tweet we uncovered near Hyden in Leslie County, KY simply said "EARTHQUAKE HOLY SHAT" -- until around 4:05pm in an area comprising most of central and eastern Kentucky, southern Ohio, West Virginia, southwest Virginia, western North Carolina and east Tennessee (we limited our query based on a bounding box drawn around the epicenter of the quake).

This area includes several cities such as Louisville and Lexington in Kentucky and Knoxville, TN, as well as many more rural areas. As much of our earlier work has clearly shown, population centers typically possess a greater level of online activity simply by virtue of population size, so it was important to look beyond just the raw numbers of earthquake-related tweeting. Therefore, in order to normalize the data, we also collected a 1% sample of all geotagged tweets from the month of October within in the same area. This totaled 30,699 tweets, which we used to normalize the tweets about the earthquake and construct a location quotient measurement in exactly the same way as with the racist tweet analysis [1]. We again aggregated from individual tweets to a larger areal unit, in this case, counties.


First and foremost, though we did not use an entirely contiguous area, it is easy to notice that our map roughly conforms with the map of crowdsourced reports from the USGS, generally confirming the relevance of a relatively small set of user-generated data to understanding such an event.

Second, by looking at the blue dots representing each individual tweet, we can see concentrations within the counties containing the largest cities in the specified search area. These include Knox Co., TN (Knoxville), Jefferson Co., KY (Louisville), Fayette Co., KY (Lexington), Madison Co., KY (Richmond), and Cabell Co., WV (Huntington). None of these localities are particularly close to the epicenter of the quake in eastern Kentucky, but are more likely is a product of the higher population in these cities (increasing the likelihood that Twitter users would feel the quake and take to Twitter to report it), as well as their importance as regional centers with close social and economic connections to eastern Kentucky.

Third, and interestingly enough, there were only six counties where there were more earthquake tweets than there were tweets within the given 1% sample from October [2]. Leading this group of counties is Letcher County, where the earthquake epicenter was located. Letcher County also has a location quotient of nearly 100, indicating the fact that the earthquake generated a much greater than average number of tweets in Letcher County than one would expect on average. Each of the other counties, though possessing many fewer tweets both in the earthquake and reference datasets, are also located in close proximity to Letcher County and the epicenter of the earthquake. These include Bath Co., KY, Leslie Co., KY, Polk Co., TN, Johnson Co., TN and Rockingham Co., VA.

We can also look at patterns of tweets without aggregating to an administrative unit. In this case, we estimate the intensity of the earthquake tweet pattern (again normalized for what would be expected based on a random sample of tweets) in the region using Gaussian kernel smoothing. Interestingly, the 'epicenter' of earthquake tweets is only 6.7 miles away from the real epicenter of the earthquake (indicated by the red star). Not coincidentally, the center of intensity of our tweet map is located in the nearby town of Hazard, KY, which has a higher population density (resulting in more twitter users) than the more rural town of Whitesburg, the epicenter as measured by the USGS.

Ultimately, these results are not necessarily surprising, as they indicate both the extremely localized nature of a phenomenon like reporting an earthquake as evidenced by the greater location quotient values nearer the epicenter, as well as the essentially networked nature of such phenomena mediated by the internet in the clustering of user-generated internet content in cities quite distant from the earthquake's origin.

From a methodological standpoint, it shows that the fairly simple technique of calculating location quotients, or even the more involved technique of Gaussian kernel smoothing, can provide powerful ways of uncovering the spatial dimensions of online reflections of essentially offline phenomena.

We hope that this example -- which uses about the same number of tweets (particularly relative to the number of administrative units) as our racist tweets map -- will help alleviate some of the methodological concerns raised in our previous post.
---------
[1] The equation used to calculate the location quotient is as follows:

# of tweets referencing "earthquake" per county / total # of tweets referencing "earthquake"
------------------------------------------------
# of reference tweets per county / total # of reference tweets

[2] We should note that this doesn't mean that there were more earthquake-related tweets in the given time period on Saturday than total tweets in the entire month of October. Rather, this simply represents an indicator of how many earthquake-related tweets there were relative to the expected amount of content in that place.

November 08, 2012

Mapping Racist Tweets in Response to President Obama's Re-election

Note: for questions about the methodology/approach of this post, see the FAQ (added 16:20 EST 11/9/2012).
Note: as of 11:00 EST 11/10/2012, we have disabled commenting on this post.
Note: at 10:00 am EST 11/12/2012 we posted an analysis using the same methodology as this post to locate the epicenter of earthquake in Eastern Kentucky over the weekend.

During the day after the 2012 presidential election we took note of a spike in hate speech on Twitter referring to President Obama's re-election, as chronicled by Jezebel (thanks to Chris Van Dyke for bringing this our attention). It is a useful reminder that technology reflects the society in which it is based, both the good and the bad.  Information space is not divorced from everyday life and racism extends into the geoweb and helps shapes its contours; and in turn, data from the geoweb can be used to reflect the geographies of racist practice back onto the places from which they emerged.

Using DOLLY we collected all the geocoded tweets from the last week (beginning November 1) with racist terms that also reference the election in order to understand how these everyday acts of explicit racism are spatially distributed. Given the nature of these search terms, we've buried the details at the bottom of this post in a footnote [1].

Given our interest in the geography of information we wanted to see how this type of hate speech overlaid on physical space.  To do this we aggregated the 395 hate tweets to the state level and then normalized them by comparing them to the total number of geocoded tweets coming out of that state in the same time period [2]. We used a location quotient inspired measure (LQ) that indicates each state's share of election hate speech tweet relative to its total number of tweets.[3]   A score of 1.0 indicates that a state has relatively the same number of hate speech tweets as its total number of tweets. Scores above 1.0 indicate that hate speech is more prevalent than all tweets, suggesting that the state's "twitterspace" contains more racists post-election tweets than the norm.

So, are these tweets relatively evenly distributed?  Or do some states have higher specializations in racist tweets?  The answer is shown in the map below (also available here in an interactive version) in which the location of individual tweets (indicated by red dots)[4] are overlaid on color coded states. Yellow shading indicates states that have a relatively lower amount of  post-election hate tweets (compared to their overall tweeting patterns) and all states shaded in green have a higher amount.  The darker the green color the higher the location quotient measure for hate tweets. 

Map of the Location Quotients for Post Election Racist Tweets
Click here to access an interactive version of the map at GeoCommons

A couple of findings from this analysis
  • Mississippi and Alabama have the highest LQ measures with scores of 7.4 and 8.1, respectively.
  • Other southern states (Georgia, Louisiana, Tennessee) surrounding these two core states also have very high LQ scores and form a fairly distinctive cluster in the southeast.
  • The prevalence of post-election racist tweets is not strictly a southern phenomenon as North Dakota (3.5), Utah (3.5) and Missouri (3) have very high LQs.  Other states such as West Virginia, Oregon and Minnesota don't score as high but have a relatively higher number of hate tweets than their overall twitter usage would suggest.
  • The Northeast and West coast (with the exception of Oregon) have a relatively lower number of hate tweets.
  • States shaded in grey had no geocoded hate tweets within our database.  Many of these states (Montana, Idaho, Wyoming and South Dakota) have relatively low levels of Twitter use as well.  Rhode Island has much higher numbers of geocoded tweets but had no hate tweets that we could identify.
Keep in mind we are measuring tweets rather than users and so one individual could be responsible for many tweets and in some cases (most notably in  North Dakota, Utah and Minnesota) the number of hate tweets is small and the high LQ is driven by the relatively low number of overall tweets. Nonetheless, these findings support the idea that there are some fairly strong clustering of hate tweets centered in southeastern U.S. which has a much higher rate than the national average.

But lest anyone elsewhere become too complacent, the unfortunate fact is that most states are not immune from this kind of activity. Racist behavior, particularly directed at African Americans in the U.S., is all too easy to find both offline and in information space.

--------------------- State Level Data ---------------------

The table below outlines the values for the location quotients for post-election hate tweets.

State LQ of Racist Tweets Notes
Alabama    8.1
Mississippi    7.4
Georgia    3.6
North Dakota    3.5
Utah    3.5
Louisiana    3.3
Tennessee    3.1
Missouri    3.0
West Virginia    2.8
Minnesota    2.7
Kansas    2.4
Kentucky    1.9
Arkansas    1.9
Wisconsin    1.9
Colorado    1.9
New Mexico    1.6
Maryland    1.6
Illinois    1.5
North Carolina    1.5
Virginia    1.5
Oregon    1.5
District of Columbia    1.5
Ohio    1.4
South Carolina    1.4
Texas    1.3
Florida    1.3
Delaware    1.3
Nebraska    1.1
Washington    1.0
Maine    0.9
New Hampshire    0.8
Pennsylvania    0.7
Michigan    0.6
Massachusetts    0.5
New Jersey    0.5
California    0.5
Oklahoma    0.5
Connecticut    0.5
Nevada    0.5
Iowa    0.4
Indiana    0.3
New York    0.3
Arizona    0.2
Alaska      -   see note 1
Idaho      -   see note 1
South Dakota      -   see note 1
Wyoming      -   see note 1
Montana      -   see note 1
Hawaii      -   see note 1
Vermont      -   see note 1
Rhode Island      -   see note 2


Note 1: no racist tweets, SMALL number of total geocoded tweets
Note 2: no racist tweets, LARGE number of total geocoded tweets

-----------------
[1] Using the examples of tweets chronicled by Jezebel blog post we collected tweets that contained the text "monkey" or "nigger" AND also contain the text "Obama" OR "reelected" OR "won". A quick, and very unsettling, examination of the search results revealed that this indeed was a good match for our target of election-related hate speech. We end up with a total of 395 of some of the nastiest tweets you might possibly imagine.  And given that we're talking about the Internet, that is really saying something.

[2] To be precise, we took a 0.05% sample of all geocoded tweets in November 2012 aggregated to the state level.

[3] The formula for this location quotient is

(# of Hate Tweets in State / # of Hate Tweets in USA) 
------------------------------------------------------------
(# of ALL Tweets in State / # of ALL Tweets in USA)

[4] We should also note that the precision of the individual tweet locations is variable.  Often the specific location shown in a map is the centroid of an area that is several tens or hundreds of meters across so while the tweet came from nearby the point location shown it did not necessarily come from that exact spot on the map.

FAQ: Mapping Racist Tweets in Response to President Obama's Re-election

Note: This FAQ was posted at 4:20 EST on 11/9/12

What about the sample size? 395 doesn’t seem like that many?

The 395 tweets mentioned are the number of geocoded tweets referencing the given keywords from November 1 until November 7 at approximately 4:00 pm EST. This is NOT a sample, but the total population of geocoded tweets that matched our search criteria as outlined in the post. Geocoded tweets make up a tiny fraction of overall Twitter activity (could be as large as 5% or as small as less than 1%), so the actual number of tweets referencing these keywords is likely much, much larger, though we are not sure as to this number.


That said, we don't know what the geographical distribution of non-geocoded tweets is. However, given that many geocoded tweets are the product of GPS-enabled smart phones, it is likely that geocoded tweets tend to come from wealthier locations. All things being equal, this means that the geocoded data likely underrepresents relatively poorer and more rural locations. Should this actually be the case, the location quotients for Mississippi and Alabama would actually be even higher than our initial study showed, but the exact nature of this phenomena is unknown.




Note: People concerned about our methodology should also check out our post on 11/12/2012 using geocoded tweets to located the epicenter of an earthquake in Kentucky. (this paragraph added at 10:20 am EST 11/12/2012)




Why didn’t you map references to hateful comments towards Mitt Romney? 

First, the motivation for this posting was the observations posted on the Jezebel blog linked in our original post, noting the uptick in racist tweets following President Obama’s re-election.  Second, we focus on racist language directed at President Obama because racism directed at black Americans is not only historically more significant, but because it also highlights the persistence of explicitly racist attitudes in what some have (fallaciously) termed ‘post-racial America’. Third, we did check for both the number of tweets referencing Mitt Romney containing some racially charged terms, as well as the number of derogatory comments about white people. Depending on the terminology used, the results show that there are 7-15x the amount hateful tweets direct towards President Obama than Mitt Romney.

Finally, if this is your first response to our map, and not “that’s really f---ed up!”, then we probably have more important issues to deal with than the minutiae of our methodology. Though we endorse neither hatred, discrimination or violence against anyone, we refuse to acknowledge the equivalence of the terms being used to describe President Obama and Mitt Romney.

Did you remove uses of the “N word” that were positive?

No. We didn’t filter the tweets used in this database, however a quick look at the data reveals that most are derogatory in nature. By leaving the data as is, we are more easily able to compare the number of references to, say, the kinds of comments about Mitt Romney people are clamoring for us to map, without inserting ourselves into an undoubtedly subjective filtering process. Regardless, even if we were to filter tweets, it very well might not change the overall spatial distribution, e.g., a filtered tweet could be from California or Alabama, leaving the map looking essentially the same as it currently does.

A further point is that the term ‘n----r’ is almost universally associated with negative, derogatory intent, as opposed to the more colloquialized (and appropriated by the black community) ‘n---a’, which a quick inspection of the data shows is used more positively. References to ‘n---a’ were not included in the study.

What about multiple tweets by the same individual?

Like our decision not to filter tweets based on their context, nor did we filter based on multiple tweets by the same individual. However, a quick look at the map indicates that tweeting activity is not entirely concentrated at any individual point, meaning that barring the remote possibility of a hyper-mobile tweeter fixated on racist slurs or a racist twitter bot, this is not issue enough to undermine our findings.

Moreover, when we returned to the data and looked at users rather than tweets, very little changes in the location quotients, with Alabama’s being even higher. We thus see this as being a moot point.

Are you saying I’m racist because I didn’t vote for Obama? Are you saying that everyone in a state that had more racist tweets is racist?

No and no. Nor do we imply such a thing anywhere in our original posts or our reactions to comments. However, we believe that the concentration of racist tweets in the South is indicative of the persistence of racism in the South, which is correlated with, though not necessarily causally-related to, statewide voting for Mitt Romney. Just because you live in Mississippi or Alabama does not make you a terrible person. If, however, you use the “N word” to degrade an individual or group of people, as the tweets we are talking about here do, it’s a different story altogether.

What else do you have to say for yourself?

This map and blog post have received more attention than we could have imagined, most of it positive and thought-provoking. Though racism undoubtedly remains a touchy subject, and one perhaps not best dealt with by fairly simple maps, we hoped to use this exercise to show the persistence of racism in the US, even with the country’s first black president being re-elected to a second term, and the need to address this head on, rather than counter such explicitly racist language and behavior with claims of ‘reverse racism’ as many of the critics of our map have done.

Of course, our map does not encompass the entirety of racism as it is experienced by black Americans, much less members of other groups who are systemically discriminated against, both in explicit language directed at these individuals and groups, as well as structural forms of racism that continually limit the ability of people to live happy, healthy and comfortable lives. As geographers, we like to think of ourselves as especially attuned to such issues. However, as the focus of this blog is dedicated to studying the world through the lens of the geoweb, we limit ourselves in this forum to analyses like those presented in the original post.

November 05, 2012

Can Twitter Predict the US Presidential Election?

Can Twitter predict the outcome of tomorrow's US presidential election? If the results of our preliminary analysis are anything to go by, then Barack Obama will be easily re-elected. The data presented below, including all geocoded tweets referencing Obama or Romney between October 1st and November 1st, out of a sample of about 30 million, give some insight into the visibility of each of the candidates on Twitter.


We see that if the election were decided purely based on Twitter mentions, then Obama would be re-elected quite handily. In fact, the only states in the electoral college that Romney would win are Maine, Massachusetts, New Mexico, Oregon, Pennsylvania, Utah, and Vermont. Romney also wins in the District of Colombia, and we unfortunately didn't collect data on Alaska or Hawaii. Some of the results seem to be interesting reflections of social and political characteristics of particular places. It makes sense that Romney has captured more of the public imagination in Utah, likely due to the state's considerable conservatism and large Mormon population, and Massachusetts, the state that he governed not all that long ago.

However, this drubbing that Romney receives in the Twitter electoral college belies the close nature of the final popular (Twitter) vote, re-raising the issue of whether the electoral college is the most suitable means of deciding the country's political future. There are a total of 132,771 tweets mentioning Obama and 120,637 mentioning Romney, giving Obama only 52.4% of the total and Romney 47.6%, a breakdown that is remarkably similar to current opinion polls, though not reflected when looking at the state-level aggregations in absolute terms. If you want to explore the data in more detail, please play around with the interactive map below:


We can also visualize the data using a sliding scale, so as to see how close the margin of victory is for each candidate in a given state.


Romney's largest margins of victory are in Pennsylvania and Massachusetts, while Obama's largest victories are in California and, strangely, Texas. The cases of Massachusetts and Texas, not to mention large portions of the south and plain states, likely point to the fact that many references on Twitter would tend to be negative.

It is also worth noting that we compared Twitter mentions of both Vice-Presidential candidates: Biden and Ryan. Ryan, interestingly, wins the head-to-head competition in every single state. This makes for a rather boring map, so we decided to instead compare references to Ryan and Romney in the map below (Romney shaded in grey for his ebullient personality, and Ryan in pink as a result of his staunch support for gay rights).


As might be expected, there are more references to Romney in most states (Kansas, Michigan, North Dakota, Rhode Island, South Dakota, and Vermont being the exceptions here). However, when looking at total references, we again don't see a large gap between the two men. Ryan has 94,707 tweets compared to Romney's 120,637.

What do these data really tell us? Ultimately, I doubt that they will accurately predict the election, as Obama's seeming victory in Texas or Romney's in Massachusetts will almost certainly not come to pass. But they do certainly reveal that many internet users in California, Texas, and much of the rest of the country for that matter, tend to talk more about Obama than Romney. And, of course, in order to truly equate tweets with votes, we would need to employ sentiment analysis or manually read a large number of the election-related tweets in order to figure out whether we are seeing messages of support or more critical posts, as has been done in a couple of interesting projects by Twitter available here and here and another project by Esri available here.

Maybe the most revealing aspect of these data is that the 'popular vote' is split between the two candidates. While the social and political data shadows that we are picking up may not accurately tell us much about the electoral college results, when aggregated across the country they may be a rough indicator of tomorrow's outcome, pointing to the more-or-less equal and evenly divided nature of the American two-party political system. While this work may seem like a contemporary attempt at soothsaying, something we tend to shy away from, the data more appropriately serve as a useful benchmark in order to allow us to analyze what social media data shadows might actually reflect, as no matter the level of participation, they remain distorted mirrors on the offline material world.

November 01, 2012

The seven deadly sins: Sheepallenge 2012

Over the past couple of months/weeks we've been having a lot of fun with the Twitter data we've been pulling in through our DOLLY project.  We've looked at beer vs. church, binders full of women, and even Big Bird.   But why should we have all the fun?  Wouldn't you like to be a sheeple too?

So in that vein and despite the frankenstorm on the East Coast which has reduced Taylor to nibbling on dry Ramen as he makes maps, we're pushing forward with our November Sheepallenge.  Building upon the idea of IronSheep 2012 (in which teams were given the same datasets and tasked with making "tasty maps") we have provided Sheepallenge participants with a set of Twitter derived data as part of an fantastical, allegorical, mapitorital competition taking place this month.  This is going to be so wicked cool!

After some brain-storming we decided to go with the theme of the Seven Deadly Sins (Envy, Gluttony, Greed, Lust, Pride, Sloth, Wrath) inspired in part by the cool mapping exercise by Mitchel Stimers and others at Kansas State University (here).  After all, the Twitter data from which we were pulling reflects the commentary of daily life.  What better source for uncovering the sins that lurk within the hearts and microblogging activities of Internet users? So we sat down and came up with range of terms that we thought did a decent job of representing a sin (e.g., the term Big Mac for Gluttony or honor student for Pride) and compiled them into a "sindex" for each each of the seven sins.  The sindex can be used as an aggregate measure or divided into its component parts (see meta data below).
(btw, Stimers et al maps are NOT based on tweets but indicators such as crime, income, etc.)

The challenge to you is to make your own map(s) of the 7-deadly sins with our data.

Those of you who registered as research participants should have received an email with a link to download the data.  If you are just reading this now and are thinking "Man, I should of signed up." email Monica and asked to be hooked up (monica.stephens@humboldt.edu).  We probably can accommodate more participants but no guarantees.  Currently we have 33 visualization groups/classes/people signed up from around the world so we can't wait to see what we end up with.

Whoever creates the most interesting, fun, informative and aesthetically pleasing visualization or data-driven artwork, will receive a prize and will have their visualization posted about here on FloatingSheep.org.

Rules for the Sheepallenge 2012
The rules are as follows:
  1. You can not post the raw data on the internet or redistributed to others.  Please contact us before using the data for any other research purpose.  Commercial use is prohibited.
  2. Maps may created in a range of formats from static maps (e.g, choropleth, cartograms or cartoons) to animations or interactive interactive maps.  Maps can be submitted as an attached .jpg or .pdf or can be a link to an interactive or animated map. 
  3. Your visualization needs to use at least one of the data files included in the 7-sins data package. Adding additional data from other sources (e.g. census, crime stats) is definitely allowed. You can chose one sin, one aspect of a sin, or all the sins.   
  4. For your visualization to be considered by the judges, you must email it to monica.stephens@humboldt.edu by November 30, 2012.   Monica will forward it on to the judges for consideration.
    1. Include a jpg/pdf of the actual map (or series of maps) or a link to an interactive or animated map. 
    2. Include a Word document that has (a) your name (or group), (b) your contact info (c) the specific seven sin dataset(s) you used and (d) a title/name for the map.  Although it is not required, feel free to include a short description/abstract of what you did, especially if you think it is cool/important.
  5. Multiple entries are allowed but submit each map separately as outlined above.
  6. Judges should not be bribed with chocolate.  (It is OK to bribe your local cartographer/GIS expert for help as long as you credit them in your work).
  7. The judges will select winners across a range of categories and map types.
  8. Winners will receive bragging rights, a carefully constructed electronic certificate (suitable for framing) and perhaps some FloatingSheep paraphernalia we have kicking around.
  9. By submitting a map you give the FloatingSheep.org blog permission to post it under the creative commons attribution-noncommercial-sharealike license we use for all our stuff.
  10. Goto rule 1.
Please direct all questions to Monica (monica.stephens@humboldt.edu) or Ate (ate.poorthuis@uky.edu).

May the best map win!

Metadata for Sheepallenge 2012
Much more extensive metadata is available with the data but the basics are:
  • The database is about 70 MB in size
  • Data covers all geotagged tweets made within the United States between June 26 and October 30
  • Keywords used (and associated sampling rates when appropriate) is available online
  • We have also include a range of other terms that may (or may not) fit in well with a particular sin (e.g., does Justin Bieber represent Lust? Or Pride?  Or Envy?).  Some of them (such as a random selection of tweets) will be useful for standardizing purposes.

October 31, 2012

The Urban Geographies of Hurricane Sandy in New York City

Following our two earlier posts showing how discussion of Hurricane Sandy were reflected on Twitter, we present another representation of tweets, focused specifically on how New York City -- the center of both the storm's effects and the media attention around it -- tweeted about the storm.

The following map includes a broader temporal range of tweets dating back to last week on October 24th, up to approximately 1:22pm on Tuesday, October 30th, as the storm was starting to subside and damage be more closely assessed. Tweets included in this dataset contain direct reference to "Sandy" and include more-or-less precise latitude/longitude coordinates (as opposed to being geocoded to less specific scales such as the city or neighborhood level), allowing a greater level of precision, despite sacrificing a significant number of tweets in order to do so, though still leaving us with nearly 16,000 individual observations to work with. In order to show density as opposed to individual points, tweets were then aggregated to the level of census blocks.

Although we definitely see some larger clusters, it is remarkable how spatially dispersed the tweeting about Sandy was. The majority of tweets are located in midtown Manhattan, which was not only the location of the last open Starbucks in the city, but was also hit by widespread power outages. The concentration of tweets around the southern tip of Central Park are likely caused by the infamous dangling crane (and subsequent evacuations) at 57th Street.

While some areas that were hit by flooding see a pattern of increased tweet activity -- for example Battery Park, Dumbo, LaGuardia and Hudson River Park -- it is surprising how few tweets we find in areas that were hit especially hard or where significant events happened. In Breezy Point (not included in the map) a fire destroyed more than eighty homes, but only a handful of tweets come from that same location. Similarly, Sandy inflicted very significant damage to large parts of Rockaway and Coney Island with very little mention in these places on Twitter. Other major events covered by the media, such as the evacuation of the NYU Medical Center just north of Stuyvesant Town or the explosion at ConEd's power station on 14th Street, also see only a few tweets in the immediate vicinity, though perhaps owing to the fact that individuals in these locations would be more concerned about safety than tweeting.

It seems that, when zooming in on the urban scale, the location and density of tweets does not necessarily correlate with areas most effected by Sandy. As the hurricane brought the city to a grinding halt, with businesses and schools closing ahead of the storms, Sandy appears to have been tweeted from the -- relatively -- safe confines of the home, as opposed to the many locations throughout the city which were hard hit, but relatively unrepresented in this virtual representation.

Ultimately, we're left wondering whether Hurricane Sandy represents a case distinct from that of Hurricane Katrina? Though the areas that were the most tweeted from in this case represent both the most densely populated and most well-off, areas such as Harlem don't mirror the experience of Katrina in being devastated by the storm and then wiped off of the virtual representation of the event. Or, as Mark indicated in his earlier post, is it simply difficult to ascertain much from such finely-grained data in cities? Or, as the relative lack of discussion about the devastation Sandy has caused in the Caribbean indicates, has the location of the storm in arguably the world's most important city simply deflected media attention away from other locations?

We don't offer these as definitive conclusions, but instead as provocations, as much deeper analysis needs to be undertaken to more fully understand the relationship between such intensely material events as Hurricane Sandy and virtual representations of them through platforms like Twitter.

For a good reference on areas hard hit by the storm, see this from the New York Times: http://www.nytimes.com/interactive/2012/10/30/nyregion/hurricane-sandys-aftermath.html

Hurricane Sandy and the Geographies of Flooding on Twitter

With the worst of Hurricane Sandy now past, we wanted to build on our initial map of references to "Frankenstorm" and construct a fuller picture of how the storm was represented and discussed on Twitter. The first alternative representation we offer visualizes how Twitter discussed the most obvious impact of the storm, the massive flooding (felt particularly acutely in New York City) that has not only disrupted the every functioning of the city, but also had likely long-lasting impacts on many individual lives and the way we prepare for and attempt to manage such 'natural' disasters.

To begin, we have been collecting tweets containing the terms "flood" and "flooding" in order to examine how Twitter usage might reflect lived experiences of the storm. By examining the digital data shadows of an intensely material event, we can hope to gain some understanding of how the intertwining and interfacing of virtual and material spaces apart from the immediate consequences of this particular event.
An interactive version of this map is available at:

The map reveals a few important findings. First, like the map of references to Frankenstorm, tweets referencing flooding are almost exactly where you would expect them to be; in other words, the vast majority of tweets were located in the path of the hurricane. Nonetheless, it is interesting that so few people elsewhere in the US are tweeting about the unprecedented flooding and resulting damage taking place on the East Coast. In this sense, the geography of data shadows drawn from Twitter appear to be quite effective at reflecting experiences of the storm. The hurricane, in essence, leaves a digital trail.

Second, we are able to see that these data become significantly less useful if we want to draw insights at a scale finer than the county level. Until noon GMT on Tuesday, October 30th, there were only 5,209 geocoded tweets about flooding, a fairly small number over such a broad area. We even initially intended to map references in both English and Spanish to reflect the potential differences in experience between different linguistic groups affected by the storm, but despite the millions of Spanish-speakers undoubtedly affected, we were only able to collect five Spanish-language tweets!

In other words, it is the absences on this map that are almost more interesting than the mapped results. The lack of published content in Spanish means that we are necessarily only including published content from English speakers in these representations. The absences in the rest of the country are also revealing. Why are so few people in Kentucky, Missouri, Wisconsin, etc. tweeting about East Coast flooding? Is it because the act of tweeting about such an event is only really likely to be performed by people in situ, experiencing the storm? Are people outside the direct path of the hurricane interested in other impacts apart from flooding (for instance, the significant snowfall in parts of central Appalachia)? Are they interested at all? Or does the necessarily limited representation offered by Twitter constrain any possible explanations?

October 29, 2012

Mapping the Frankenstorm on Twitter

As some of us hunker down in our fortified bunker in Worcester, Massachusetts awaiting the Frankenstorm, and others hang out in the California sunshine, we thought we'd contribute our collective two cents to the discussion of the ongoing storm via mapping -- from Google's crisis map to the New York Times' map of the hurricane's expected path -- in the form of a visualization of Twitter activity around the storm [1]. 

It didn't take long for the term "Frankenstorm" to catch on. Shortly after the National Oceanic and Atmospheric Administration first used the term this past Thursday, the first geotagged tweet was created by @SStirling, a data journalist for the Star-Ledger newspaper in Newark, NJ, around 11:06am that day.

Since then, well over 7,000 geotagged tweets referencing the Frankenstorm have been created in North America. The dataset used here includes exactly 7,056 geotagged tweets collected from DOLLY, from the very first mentioned above until approximately 12:36pm EST on Monday, October 29, just as the storm was starting to pick up along the east coast [2].

Mapping the Frankenstorm

After aggregating the tweets to the county level, a quick glance reveals some striking, if not unsurprising, patterns. Despite being a major national news event, Twitter activity around the storm has been incredibly concentrated along the east coast where the storm is expected to hit the hardest, demonstrating a clear connection to the places in the path of the storm. While itself not surprising, the precise level of concentration is a bit more startling. Indeed, over 40% of the total number of geotagged tweets referencing Frankenstorm in this sample come from just eight counties along the east coast.

And while these counties represent four of the ten largest metropolitan areas in the United States, and the four largest along the east coast, the concentration within these areas demonstrates the extent to which areas which might be just as hard hit -- such as rural Vermont during 2011's Hurricane Irene -- are relatively underrepresented in the virtual reflection of events such as these. But perhaps more interesting than the cluster of references along the east coast is the anomalous concentration of references all the way across the country in southern California.

Frankenstorm Hot Spots

Just 23 counties across the United States had more tweets referencing the Frankenstorm than Los Angeles, which had 46 tweets, breaking up what would otherwise represent a clear effect of distance decay in predicting the number of tweets referencing the Frankenstorm. In contrast to the relatively concentrated pattern discussed above, a cluster of references comparably significant to areas of Maine, Pennsylvania and Virginia pops up around over 2000 miles away from the path of the storm, while areas in between in the American south and midwest show no such clusters.

Though L.A.'s large population makes this concentration of activity somewhat less surprising, the city's position within the national (and global) urban hierarchy offers a somewhat more interesting (at least to geographers!) explanation. When considering L.A.'s centrality within the global air transportation system and the fact that thousands of flights have been affected by the storm, there emerges a range of alternative explanations emphasizing the relationship between Los Angeles and the cities along the east coast more directly affected by the storm. For instance, at least a handful of tweets, like those below from @robyntomlin and @paulhogarth, specifically reference air travel from Los Angeles to the east coast and into the path of the impending Frankenstorm.

So while the analysis presented here is more a confirmation than a revelation, it clearly shows the persistent connections between space and place in online networks like Twitter, as well as how geotagged Twitter content represents a promising way of demonstrating these connections between the virtual and the material. With the worst of the Frankenstorm still yet to come -- as are thousands more tweets, we're sure -- we hope everyone continues to stay safe and dry in the coming days... and that someone goes ahead and starts working on the next ridiculous name for a major storm so that we can do this again in the future!
-----------
[1] Despite there being a range of possible keywords possible here -- such as "Hurricane Sandy" itself -- we, in typical Floatingsheep fashion, chose only to map the more ridiculous "Frankenstorm". As such, the analysis here is tempered by this limitation.
[2] No members of the Floatingsheep collective were harmed in the making of this map. Taylor did, however, bravely venture out into the Frankenstorm to make it to the office in order to produce these maps, and his pants were appropriately drenched for this effort.