I have been doing a bit of work with hadoop of late in my work life, mainly using streaming map reduce and pig working to extract additional data out of weblogs, which is a powerful paradigm. Before the election I wanted to develop a way to look at data during the election period. Twitter is a powerful communication tool often trivialized, but is a powerful way to promote and for mass sentiment to be made known.
Twitter has a powerful streaming api, that allows twitter to push to the client the data in a large mass. PHP is often a tool that I have used as a rapid development tool, but usually lacks a multi-threaded model and libraries that implement features like twitter’s streaming api. Twitter4j is a good library for java and also works with android, which works well with twitter. This allowed me to capture a significant amount of data for analysis. the code had matured significantly by the time the town hall debate took place, which led to capturing a good quality of data. This run used a the Query Stream, which allowed to filter from the global data set that twitter is, and limit it to the united states and topics relating to the debate and presidential election. Wanting to do more work with hadoop’s java libraries and features, I wrote the hadoop map reduce jobs in java and setup a single pseudo distributed node to process the data. These are the results imported into Google spreadsheets.