A tour of birding‘s implementation

Python Twitter Client

There are many Python packages for Twitter. The Python Twitter Tools project (pip install twitter) is of interest because:

  1. It has a command-line application to get twitter activity which includes a straightforward authentication workflow to log into twitter and get OAuth credentials, using a PIN-Based workflow.
  2. It provides APIs in Python which bind to twitter’s public APIs in a dynamic and predictable way, where Python attribute and method names translate to URL paths, e.g. twitter.statuses.friends_timeline() retrieves data from http://twitter.com/statuses/friends_timeline.json.
  3. The OAuth credentials saved by the command-line tool can be readily used when making API calls using the package.

Twitter API

To ease configuration, birding adds a from_oauth_file() method which will creates a Twitter binding using the OAuth credential file created by the twitter command-line application. The twitter command need only be run once to create this file, which is saved in the user home directory at ~/.twitter_oauth. Once that file is in place, twitter API interactions look like this:

Search Manager

It is useful to solve the problem itself before being concerned with details about the topology. birding’s TwitterSearchManager composes the Twitter object into higher-level method signatures which perform the processing steps needed for the given Problem statement & topology. A full interaction before applying Storm looks like this (see In[2]):

Storm Bolts

With APIs in place to do the work, Bolt classes provide Storm components:

Storm Spouts

Spout classes provide Storm components which take birding’s input and provide the source of streams in the topology:

Storm Topology

With Storm components ready for streamparse, a topology can pull it all together. birding’s topology uses the Clojure DSL; the streamparse discussion of topologies has more detail. In the topology definition below, note the class references "birding.bolt.TwitterSearchBolt", "birding.bolt.TwitterLookupBolt", and "birding.bolt.ResultTopicBolt". These are full Python namespace references to the birding classes. The names given in the DSL can then be used to wire the components together. For example, the definition of "search-bolt" (python-bolt-spec ...) allows "search-bolt" to be used as input in another bolt, "lookup-bolt" (python-bolt-spec ... {"search-bolt" :shuffle} ... ).

(ns birding
  (:use     [streamparse.specs])
  (:gen-class))

(defn birding [options]
  [
    ;; spout configuration
    {"term-spout" (python-spout-spec
          options
          ; Dispatch class based on birding.yml.
          "birding.spout.DispatchSpout"
          ["term" "timestamp"]
          :conf {"topology.max.spout.pending", 8}
          )
    }
    ;; bolt configuration
    {"search-bolt" (python-bolt-spec
          options
          ; Use field grouping on term to support in-memory caching.
          {"term-spout" ["term"]}
          "birding.bolt.TwitterSearchBolt"
          ["term" "timestamp" "search_result"]
          :p 2
          )
     "lookup-bolt" (python-bolt-spec
          options
          {"search-bolt" :shuffle}
          "birding.bolt.TwitterLookupBolt"
          ["term" "timestamp" "lookup_result"]
          :p 2
          )
     "elasticsearch-index-bolt" (python-bolt-spec
          options
          {"lookup-bolt" :shuffle}
          "birding.bolt.ElasticsearchIndexBolt"
          []
          :p 1
          )
     "result-topic-bolt" (python-bolt-spec
          options
          {"lookup-bolt" :shuffle}
          "birding.bolt.ResultTopicBolt"
          []
          :p 1
          )
    }
  ]
)

Next, goto one of: