Athena - Part 1 - System Design
Setting up a Serverless Analytics API using AWS Athena and Terraform



Introduction
In this guide I'll show you how to take data from Cloudwatch Logs, transform it and store it as files in an S3 bucket in a format compatible with AWS Athena. I'll then show you how to set up an API to return the data to an end user.
This results in a super cost-effective way to store large amounts of data, query it and return useful analytics.
Background
AWS Athena is an interactive SQL service that can be used to query lots of different data sources. Have a look at the information on the AWS website. This makes it perfect for analytics. Recording analytics data involves storing a large volume of data that grows over time so it can be expensive to store in a database service. AWS S3 is a really cheap alternative and thanks to Athena, easy to query.
Tech Stack
I want to be able to repeatedly deploy this, so I won't use the AWS console, instead I'll use Terraform to store Infrastructure as Code (IaC).
These are the technologies I'll be using:
Architecture
The application architecture is split into 3 separate tiers.
- The actual app - the only change to this was to log out the required details in CloudWatch
- A process to run on a regular basis to parse the logs and save the data into S3 files
- An API to surface the analytics data.
The initial architecture looks a little something like this:
- Frontend web app > CloudFront > lambda > log data using CloudWatch
- CloudWatch Event triggers every 10 mins > read log data > generate and store log files in S3
- Frontend analytics viewer > lambda > Athena > read data from S3 > return to user
Initial Considerations
I split the problem into the following categories:
- Analytics Design. What do we need to record in order to be able to provide the data?
- How do we save the data in a cost effective way?
- How do we serve the data to the user in a timely fashion?
Analytics Design
The application is a SAAS application that is B2B2C, So a business can buy the application and provide it as a tool for one of its business customers to be able to provide the service to its public customers. So the data hierarchy is something like, Global > Client > Reseller > User. The data needs to be queryable at each of these levels which means recording the data at each level.
The requirements from the client were to record (amongst other things):
- Number of unique session by time period,
- Number of unique requests per session,
- The product details per request per session.
Continue to part 2 to see how I achieved the logging of the data >>