Problem statement & topology¶

Problem Statement¶

Take as input a sequence of terms and timestamps and produce a “filtered firehose” of twitter activity using only twitter’s public APIs, without requiring special API access to twitter or any third party.

Specifics¶

Input is in the format of (term, timestamp), where term is any string and timestamp is a date/time value in an ISO 8601 format, e.g. 2015-06-25T08:00Z.
The motivating use-case:
- provides this input as a Kafka topic
- prefers output be sent to a Kafka topic & include full twitter API results
- prefers the solution be implemented in Python

Observations¶

Twitter provides GET search/tweets to get relevant Tweets (status updates) matching a specified query. Any detail not provided in the search results can be accessed with GET statuses/lookup, looking up multiple status updates in a batched request.

The problem has potentially unbounded streams of data, which makes Storm a relevant technology for the solution. Given that the motivating use-case prefers Python with Kafka I/O, streamparse and pykafka are relavant.

Topology¶

Given the problem statement, a streaming solution looks something like:

Other Goals¶

The solution should:

Encode best practices about how to use Storm/streamparse and Kafka/pykafka.
Be fully public & open source to serve as an example project, so it should not depend on anything specific to a specific company/organization. Depending on the publicly scrutable Twitter API is, of course, okay.
Include basic command-line tools for testing the topology with data and ways to configure things like Twitter authentication credentials.

Next, goto one of:

Problem statement & topology¶