Building such a massive database presented a number of serious challenges that have been overcome through a combination of using Cassandra as data store, Elasticsearch as full-text search engine and a number of worker nodes that process incoming data from Twitter through a variety of scripts written in Ruby. The entire system runs on a virtual private cloud, operated in conjunction with the computer service department in the College of Arts and Sciences at the University of Kentucky. It currently consists of 18 virtual servers with new servers and capacity added on an as-needed basis. Designed for redundancy and high availability, the system keeps running even in the event of hardware failure – this has proven critical as Twitter data is streamed live and is lost if not processed and stored immediately.
Programming Schematic Behind DOLLY
With the completion of a robust and stable back-end, the current focus is on developing a user-friendly front-end that allows for easy exploration and analysis of the data. The goal is to create an interface that enables any researcher (without CS degree) to access, explore and analyze big geosocial media data. In the current iteration, a researcher can search the database full-text in real-time, visualize the results spatially and temporally and export the results as .csv for further analysis off-line in dedicated software such as R or ArcGIS. While the data is currently limited to Twitter, the framework in place can easily be leveraged to include other sources as well.
Screenshot of a Normalized DOLLY search on "grits"
(Purple areas indicate lower than average frequency of tweets containing the term grits; blue to red areas indicate higher than average frequency of tweets with the term grits)
DOLLY also forms the basis for establishing the Department of Geography at the University of Kentucky as a key center for critical research on big geosocial media data. We see DOLLY as both a key tool for our own work but also as a means to break down the technological barrier that is often present for researchers that would like to study big data but do not necessarily possess the required technical skills.
The DOLLY Project (Data On Local Life and You) has been built leveraging pilot grants from the Vice President of Research, the College of Arts and Sciences and the Geography Department at the University of Kentucky.
Key DOLLY Personnel
- Mark Morcos, Principal Engineer
- Ate Poorthuis, Architect of DOLLY Data Collection and Storage
- Matthew Zook, Project Director