floatingsheep: DOLLY

The DOLLY Project (Digital OnLine Life and You) is a repository of billions of geolocated tweets that allows for real-time research and analysis. Building on top of existing open source technology, the Floating Sheep team (under the technical direction of Ate Poorthuis and with the invaluable assistance of the Arts and Sciences computer department or HIVE) has created a back-end that ingests all geotagged tweets (~8 million a day), does basic analysis, indexing and geocoding to allows real-time search throughout the entire database (3 billion tweets since Dec 2011). You can get a sense of what the interface is like by watching these two short videos demonstrating a search for "grits"in the United States and Sant Maarten in Europe

Building such a massive database presented a number of serious challenges that have been overcome through a combination of using Cassandra as data store, Elasticsearch as full-text search engine and a number of worker nodes that process incoming data from Twitter through a variety of scripts written in Ruby. The entire system runs on a virtual private cloud, operated in conjunction with the computer service department in the College of Arts and Sciences at the University of Kentucky. It currently consists of 18 virtual servers with new servers and capacity added on an as-needed basis. Designed for redundancy and high availability, the system keeps running even in the event of hardware failure – this has proven critical as Twitter data is streamed live and is lost if not processed and stored immediately.

Programming Schematic Behind DOLLY

With the completion of a robust and stable back-end, the current focus is on developing a user-friendly front-end that allows for easy exploration and analysis of the data. The goal is to create an interface that enables any researcher (without CS degree) to access, explore and analyze big geosocial media data. In the current iteration, a researcher can search the database full-text in real-time, visualize the results spatially and temporally and export the results as .csv for further analysis off-line in dedicated software such as R or ArcGIS. While the data is currently limited to Twitter, the framework in place can easily be leveraged to include other sources as well.

Screenshot of a Normalized DOLLY search on "grits"

(Purple areas indicate lower than average frequency of tweets containing the term grits; blue to red areas indicate higher than average frequency of tweets with the term grits)

DOLLY also forms the basis for establishing the Department of Geography at the University of Kentucky as a key center for critical research on big geosocial media data. We see DOLLY as both a key tool for our own work but also as a means to break down the technological barrier that is often present for researchers that would like to study big data but do not necessarily possess the required technical skills.

The DOLLY Project (Data On Local Life and You) has been built leveraging pilot grants from the Vice President of Research, the College of Arts and Sciences and the Geography Department at the University of Kentucky.

Key DOLLY Personnel

Mark Morcos, Principal Engineer
Ate Poorthuis, Architect of DOLLY Data Collection and Storage
Matthew Zook, Project Director

14 comments:

Unknown13 May 2013 at 14:51
Interesting how only about 3/4ths of the South uses the term "Grits" in tweets.
GIS Wiz14 May 2013 at 19:09
Are DOLLY Project resources available to outside (outside of UoK) entities (e.g. geo-spatial consulting firms) for research purposes?
Matthew Zook18 May 2013 at 15:46
We're working on how we can make it more accessible but at this time, DOLLY is limited to University of Kentucky and associated academic researchers.
Unknown21 May 2013 at 06:49
Can you add "deaf" to the disabled list?
Pink Pills25 July 2013 at 16:43
An open source version of DOLLY would be magnificent(eg github). The nerdy masses are your friends.
Morad29 July 2013 at 19:52
Does anybody know how I can download geotagged tweets?

thx
Unknown10 November 2014 at 14:30
Can I do a term search on your DB and get tweets? I'm studying Hydraulic Fracturing and would love all the tweets with geotagged data. GNIP (now Twitter) has allowed us to collect tweets via key term and other tags, but there is a limited number with geotags on them and searches are expensive.