A tour of birding‘s implementation¶
Python Twitter Client¶
There are many Python packages for Twitter. The Python Twitter Tools
project (pip install twitter
) is of interest because:
- It has a command-line application to get twitter activity which includes a straightforward authentication workflow to log into twitter and get OAuth credentials, using a PIN-Based workflow.
- It provides APIs in Python which bind to twitter’s public APIs in a
dynamic and predictable way, where Python attribute and method names
translate to URL paths, e.g.
twitter.statuses.friends_timeline()
retrieves data fromhttp://twitter.com/statuses/friends_timeline.json
. - The OAuth credentials saved by the command-line tool can be readily used when making API calls using the package.
Twitter API¶
To ease configuration, birding adds a
from_oauth_file()
method which will creates a
Twitter binding using the OAuth credential file created by the twitter
command-line application. The twitter
command need only be run once to
create this file, which is saved in the user home directory at
~/.twitter_oauth
. Once that file is in place, twitter API interactions look
like this:
Search Manager¶
It is useful to solve the problem itself before being concerned with details
about the topology. birding’s TwitterSearchManager
composes the Twitter object into higher-level method signatures which perform
the processing steps needed for the given Problem statement & topology. A full interaction
before applying Storm looks like this (see In[2]
):
Storm Bolts¶
With APIs in place to do the work, Bolt classes provide Storm components:
TwitterSearchBolt
searches the input terms.TwitterLookupBolt
expands search results into full tweets.ElasticsearchIndexBolt
indexes the lookup results in elasticsearch.ResultTopicBolt
publishes the lookup results to Kafka.
Storm Spouts¶
Spout classes provide Storm components which take birding’s input and provide the source of streams in the topology:
DispatchSpout()
dispatches spout class based on config. See Configuring birding.TermCycleSpout
cycles through a static list of terms.
Storm Topology¶
With Storm components ready for streamparse, a topology can pull it all
together. birding’s topology uses the Clojure DSL; the streamparse
discussion of topologies has more detail. In the topology definition below,
note the class references "birding.bolt.TwitterSearchBolt"
,
"birding.bolt.TwitterLookupBolt"
, and
"birding.bolt.ResultTopicBolt"
. These are full Python namespace references
to the birding classes. The names given in the DSL can then be used to wire the
components together. For example, the definition of "search-bolt"
(python-bolt-spec ...)
allows "search-bolt"
to be used as input in
another bolt, "lookup-bolt" (python-bolt-spec ... {"search-bolt" :shuffle}
... )
.
(ns birding
(:use [streamparse.specs])
(:gen-class))
(defn birding [options]
[
;; spout configuration
{"term-spout" (python-spout-spec
options
; Dispatch class based on birding.yml.
"birding.spout.DispatchSpout"
["term" "timestamp"]
:conf {"topology.max.spout.pending", 8}
)
}
;; bolt configuration
{"search-bolt" (python-bolt-spec
options
; Use field grouping on term to support in-memory caching.
{"term-spout" ["term"]}
"birding.bolt.TwitterSearchBolt"
["term" "timestamp" "search_result"]
:p 2
)
"lookup-bolt" (python-bolt-spec
options
{"search-bolt" :shuffle}
"birding.bolt.TwitterLookupBolt"
["term" "timestamp" "lookup_result"]
:p 2
)
"elasticsearch-index-bolt" (python-bolt-spec
options
{"lookup-bolt" :shuffle}
"birding.bolt.ElasticsearchIndexBolt"
[]
:p 1
)
"result-topic-bolt" (python-bolt-spec
options
{"lookup-bolt" :shuffle}
"birding.bolt.ResultTopicBolt"
[]
:p 1
)
}
]
)
Next, goto one of: