Problem statement & topology

Problem Statement

Take as input a sequence of terms and timestamps and produce a “filtered firehose” of twitter activity using only twitter’s public APIs, without requiring special API access to twitter or any third party.

Specifics

  • Input is in the format of (term, timestamp), where term is any string and timestamp is a date/time value in an ISO 8601 format, e.g. 2015-06-25T08:00Z.
  • The motivating use-case:
    • provides this input as a Kafka topic
    • prefers output be sent to a Kafka topic & include full twitter API results
    • prefers the solution be implemented in Python

Observations

Twitter provides GET search/tweets to get relevant Tweets (status updates) matching a specified query. Any detail not provided in the search results can be accessed with GET statuses/lookup, looking up multiple status updates in a batched request.

The problem has potentially unbounded streams of data, which makes Storm a relevant technology for the solution. Given that the motivating use-case prefers Python with Kafka I/O, streamparse and pykafka are relavant.

Topology

Given the problem statement, a streaming solution looks something like:

Other Goals

The solution should:

  • Encode best practices about how to use Storm/streamparse and Kafka/pykafka.
  • Be fully public & open source to serve as an example project, so it should not depend on anything specific to a specific company/organization. Depending on the publicly scrutable Twitter API is, of course, okay.
  • Include basic command-line tools for testing the topology with data and ways to configure things like Twitter authentication credentials.

Next, goto one of: