DOLLY

The DOLLY Project (Digital OnLine Life and You) is a repository of billions of geolocated tweets that allows for real-time research and analysis.  Building on top of existing open source technology, the Floating Sheep team (under the technical direction of Ate Poorthuis and with the invaluable assistance of the Arts and Sciences computer department or HIVE) has created a back-end that ingests all geotagged tweets (~8 million a day), does basic analysis, indexing and geocoding to allows real-time search throughout the entire database (3 billion tweets since Dec 2011).  You can get a sense of what the interface is like by watching these two short videos demonstrating a search for "grits"in the United States and Sant Maarten in Europe

Building such a massive database presented a number of serious challenges that have been overcome through a combination of using Cassandra as data store, Elasticsearch as full-text search engine and a number of worker nodes that process incoming data from Twitter through a variety of scripts written in Ruby. The entire system runs on a virtual private cloud, operated in conjunction with the computer service department in the College of Arts and Sciences at the University of Kentucky.  It currently consists of 18 virtual servers with new servers and capacity added on an as-needed basis. Designed for redundancy and high availability, the system keeps running even in the event of hardware failure – this has proven critical as Twitter data is streamed live and is lost if not processed and stored immediately.


Programming Schematic Behind DOLLY




With the completion of a robust and stable back-end, the current focus is on developing a user-friendly front-end that allows for easy exploration and analysis of the data. The goal is to create an interface that enables any researcher (without CS degree) to access, explore and analyze big geosocial media data. In the current iteration, a researcher can search the database full-text in real-time, visualize the results spatially and temporally and export the results as .csv for further analysis off-line in dedicated software such as R or ArcGIS. While the data is currently limited to Twitter, the framework in place can easily be leveraged to include other sources as well.

Screenshot of a Normalized DOLLY search on "grits" 
(Purple areas indicate lower than average frequency of tweets containing the term grits; blue to red areas indicate higher than average frequency of tweets with the term grits)

DOLLY also forms the basis for establishing the Department of Geography at the University of Kentucky as a key center for critical research on big geosocial media data.  We see DOLLY as both a key tool for our own work but also as a means to break down the technological barrier that is often present for researchers that would like to study big data but do not necessarily possess the required technical skills.

The DOLLY Project (Data On Local Life and You) has been built leveraging pilot grants from the Vice President of Research, the College of Arts and Sciences and the Geography Department at the University of Kentucky.

Key DOLLY Personnel 

  • Mark Morcos, Principal Engineer
  • Ate Poorthuis, Architect of DOLLY Data Collection and Storage
  • Matthew Zook, Project Director

14 comments:

  1. Interesting how only about 3/4ths of the South uses the term "Grits" in tweets.

    ReplyDelete
  2. Are DOLLY Project resources available to outside (outside of UoK) entities (e.g. geo-spatial consulting firms) for research purposes?

    ReplyDelete
  3. We're working on how we can make it more accessible but at this time, DOLLY is limited to University of Kentucky and associated academic researchers.

    ReplyDelete
    Replies
    1. Can one become an "associated researcher" via affiliation with another institution? i.e. Penn State?

      Delete
  4. Can you add "deaf" to the disabled list?

    ReplyDelete
    Replies
    1. I hope they were able to accommodate your request. You are Jason LAMBerton!

      Delete
  5. An open source version of DOLLY would be magnificent(eg github). The nerdy masses are your friends.

    ReplyDelete
  6. Does anybody know how I can download geotagged tweets?

    thx

    ReplyDelete
    Replies
    1. Sorry for replying a year late, Morad, but I would imagine that one would need to make use of Twitter’s Streaming API, described at:

      https://dev.twitter.com/docs/streaming-apis

      Delete
    2. Yea, but very few have geotagged data in them. It takes millions of tweets to get a few thousand ones that are geogtagged.

      Delete
    3. Streaming API is a subset. They should use Twitter GNIP: correct?

      Delete
    4. (link to code example at end) The Twitter streaming API returns a volume equal to up to 1% of the total number of tweets being sent at any one time. Thus, if you use the filter API to monitor a relatively small/inactive area where the total number of geotagged tweets does not exceed 1% of tweets globally at any time you should have complete or nearly complete coverage.

      @Jim - Sounds like you are monitoring for keywords and then filtering for geotagged tweets. Perhaps try the opposite of collecting geotagged tweets for a given area and then selecting only the ones with the keywords you want.

      For anyone getting started with the Twitter API, some example code free at https://github.com/computermacgyver/twitter-python

      Delete
  7. Can I do a term search on your DB and get tweets? I'm studying Hydraulic Fracturing and would love all the tweets with geotagged data. GNIP (now Twitter) has allowed us to collect tweets via key term and other tags, but there is a limited number with geotags on them and searches are expensive.

    ReplyDelete

Note: only a member of this blog may post a comment.