Problem statement & topology¶
Problem Statement¶
Take as input a sequence of terms and timestamps and produce a “filtered firehose” of twitter activity using only twitter’s public APIs, without requiring special API access to twitter or any third party.
Specifics¶
- Input is in the format of (term, timestamp), where term is any string and
timestamp is a date/time value in an ISO 8601 format,
e.g.
2015-06-25T08:00Z
. - The motivating use-case:
Observations¶
Twitter provides GET search/tweets to get relevant Tweets (status updates) matching a specified query. Any detail not provided in the search results can be accessed with GET statuses/lookup, looking up multiple status updates in a batched request.
The problem has potentially unbounded streams of data, which makes Storm a relevant technology for the solution. Given that the motivating use-case prefers Python with Kafka I/O, streamparse and pykafka are relavant.
Other Goals¶
The solution should:
- Encode best practices about how to use Storm/streamparse and Kafka/pykafka.
- Be fully public & open source to serve as an example project, so it should not depend on anything specific to a specific company/organization. Depending on the publicly scrutable Twitter API is, of course, okay.
- Include basic command-line tools for testing the topology with data and ways to configure things like Twitter authentication credentials.
Next, goto one of: