“Am I going viral?” An AWS Data Engineering project.

With 2020 being a straight-up rubbish year, the need for respite never has been greater. Enter: Nathan Apodaca on TikTok.

One man, skateboarding with some Ocean Spray juice to the sound of Fleetwood Mac’s Dreams.

His video quickly went viral. And what was the effect?

A lot of people became interested in Ocean Spray juice.

Interest in Ocean Spray. Source: Google trends

It’s a sign of the times that someone with a skateboard, singing some Fleetwood Mac can generate more interest than the marketing department of a multibillion dollar company.

I thought about this for some time, if interest in a company could come from anywhere, how do you detect this? My question really boiled down to this:

How do I know if my company is going viral?

This was the starting point for my project.

Here’s what I did.

  • Use Tweepy to Scrape Twitter API to get tweets about a topic (i.e the name of the company)
  • Publish to AWS Kinesis Stream
  • Analyse using AWS Kinesis Stream Data Analytics
  • Send Data to AWS DynamoDB and Publish to AWS SNS if going viral

My first step to scraping twitter was signing up as a developer to use the Twitter API.

To proceed I needed to create a new project and get four things:

  • Consumer Key
  • Consumer Secret
  • Access Token
  • Access Token Secret

After getting these keys and tokens, it was time to use Tweepy. I chose to use Tweepy because it seemed intuitive and allowed me to get streaming data fast.

Following Tweepy’s documentation, I created a StreamListener and started listening for tweets about a company.

The next step is to publish this to AWS Kinesis. At the moment we’re only printing the tweets to the console. To send it through to AWS, we’ll make a post request to API Gateway and use a lambda to write the data into a kinesis stream.

We first setup a new API Gateway with a POST method.

Creating a New API

We also need a Kinesis Stream for the tweets.

To write the Tweet from the API Gateway to the Kinesis Stream, we will use a AWS Lambda function. To make things easier we’ll use boto3 and write this in Python. We also need to give our lambda function the appropriate permissions to write to kinesis. The code for this Lambda function can be seen below:

In terms of config, we’ll also add the mapping template application/json for our API Gateway. After this, we’re ready to deploy our API Gateway to a stage (I chose dev). Once deployed, AWS gives us a URL to make POST requests to.

For the client, all that was needed was to update our Python client to make the post request to our new URL.

Now that we had data being sent to AWS, our next goal is perform some aggregation to group the number of tweets that are arriving.

To begin, I created an AWS Kinesis Data Analytic Application. For this task we are looking to aggregate and count the number tweets that are coming in during a period of time. There are 3 main components to this, a source, the analytics and then the destination. The goal of this step is to perform the aggregation and then send the aggregated result to DynamoDB (via Lambda).

Thinking about streaming data, we quickly run into the concept of windows. With streaming data we need to draw some line to determine when we begin aggregating. There a couple of different windows we can use:

  • tumbling — distinct time-boxed windows (say every 30 minutes)
  • sliding — windows that slide across the data according to a specified interval.

For this project, I chose to use a tumbling window with 30 minute windows.

In order to aggregate and get the number of tweets coming into the stream, I needed to use the real-time data analytics tool. I wrote some SQL which outputted the results of this to a destination.

Our Kinesis Data Application is now outputting the number of tweets about a topic every 30 minutes. This data can now be connected to a destination, like a database. It’s worth collecting this data so it can be explored later.

There a number of choices to use for a database here. Ideally, this kind of data would be perfect for long term trend analysis. Using Redshift seems the natural choice. For this use-case though, I’ll be using DynamoDB as its straightforward to setup and experiment with.

Our next step is to setup DynamoDB. I created the table “tweets_kinesis_dynamodb” with Partition Key “row_id” and Sort Key “row_timestamp”.

In order to inform a user that their Twitter is trending, we use AWS SNS to deliver an email notification. SNS uses a publish/subscribe model where messages are published to different topics and this message is past to different subscribers.

To do this we need to setup a topic

and a new subscription.

In this subscription, we specify the email we want to receive the message from.

In Kinesis Data Analytics, we can setup a destination to deliver our aggregated tweet count.

For this, we use a Lambda function that we invoke. Note: This Lambda needs IAM permissions for accessing DynamoDB and SNS.

Whenever the Lambda is invoked, it puts the number of tweets to the DynamoDB table we setup before.

In order to determine if our twitter subject is currently going viral, we check if the number of tweets is above a certain threshold. In the Lambda above, this number is 20 000 (this of course may differ depending on the company).

To test our system is working properly, we can specify a topic on our Python Client (step 1, line 18) and set a number in the lambda (line 26) we just wrote.

Ta da!

There are a couple of ways that this system could be improved. Setting an arbitrary number to assess whether you’re going viral isn’t perfect as this number will naturally change as a company or a topic grows.

There are other ways that this can be achieved. Using another technique like Random Cut Forest in Kinesis Data Analytics or Cloud Watch Anomaly Detection to determine if a topic is going viral on twitter could also work.

Hope this helped.

Ben

I’m a Software Engineering grad interested in Data Engineering opportunities.