Scraping Freecycle with AWS - Part 1, boilerplate.
Creating a serverless application to scrape Freecycle and send me a notification when something is posted.



Introduction
I recently starting looking for things on freecycle and I wanted to set up some alerts. The alerts don't seem to work or I've done it wrong or something. Obviously as a developer I haven't got the patience to read the docs and ensure that I've done everything correctly, no, my first response is to think, "ooh I could do this myself!". So I thought I'd write a scraper to login as me to the website every 10 minutes or so and check for new posts. If there are any new posts then I'll send myself a notification using AWS SNS straight to my phone. Nice. Oh, and I'll use terraform, cos, well I like it. I mean I could use SAM to learn that, but maybe that's for another day.
Architecture
- Lambda to scrape
- trigger lambda every 10 minutes
- send notification to SNS
- SNS sends notification to my phone
Solution
Lambda
Authentication
First thing is to authenticate with the site - or at least check that I am authenticated.
Trigger
Let's use cloudwatch events to trigger the lambda every 10 minutes. I want to vary the trigger by a small amount like you can do in cron, so that the site doesn't realise it's a bot. I'm not sure if this is necessary, but it's a good idea anyway.
Send SNS notification
SNS to my phone
Set up.
First I'll set up a repo on github.
mkdir freecycle-scraper
cd freecycle-scraper
git init
npm init -y
git add .gitignore // << put some things in here we don't want to add to the repo.
git add .
git commit -m "init"
I've previously been creating the repo on the github site itself but it get's a bit tedious so I thought I'd see if there is a way to do it using the command line. Turns out there is, and it's called gh
.
I installed it using:
brew install gh
and then I can create the repo using
gh repo create freecycle-scraper --public --confirm
This creates the repo on github and then adds the remote to my local repo. Nice.
Meta note: I'm writing this blog with the help of co-pilot and it makes suggestions, one of which was
gh repo create freecycle-scraper --public --confirm
which I am about to see if that works.
Authenticating with gh
Need to auth first:
gh auth login
I created and used a PAT from here https://github.com/settings/tokens?type=beta.
I was then able to do the command:
gh repo create freecycle-scraper --public --confirm
then I added that as an origin:
git remote add origin https://github.com/sjclemmy/freecycle-scraper
Build
I like using esbuild
and typescript so I'll set that up now.
npm i -D esbuild // short form of npm install --save-dev esbuild
Need types for my lambda
npm i -D @types/aws-lambda
Let's create a packages/scraper/src
folder and put the index.ts
file in there. We'll put some dummy content in to start with:
import { APIGatewayEvent, Context } from 'aws-lambda'
const handler = async (event: APIGatewayEvent, context: Context) => {
console.log('hello world')
return {
statusCode: 200,
body: 'hello world',
}
}
export { handler }
We need a tsconfig file:
// tsconfig.json
{
"$schema": "https://json.schemastore.org/tsconfig",
"display": "Node 18 + ESM + Strictest",
"compilerOptions": {
"lib": ["es2022", "esnext.asynciterable"],
"module": "es2022",
"target": "esnext",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"moduleResolution": "node",
"allowUnusedLabels": false,
"allowUnreachableCode": false,
"exactOptionalPropertyTypes": true,
"noFallthroughCasesInSwitch": true,
"noImplicitOverride": true,
"noImplicitReturns": true,
"noPropertyAccessFromIndexSignature": true,
"noUncheckedIndexedAccess": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"importsNotUsedAsValues": "error",
"resolveJsonModule": true
}
}
Now we need build it using esbuild. The command is:
esbuild ./packages/scraper/src/index.ts --bundle --platform=node --outfile=dist/index.js --target=es2019
Let's add that to the scripts section in the package.json
file:
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"build": "esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019"
},
Let's check it works:
npm run build
There should be output that looks like this:
> freecycle-scraper@1.0.0 build
> esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019
dist/index.js 1.1kb
⚡ Done in 16ms
We also need to zip that file up to upload it to AWS. We'll add a script to do that:
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"build": "esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019",
"package": "npm run build && zip -j dist/index.zip dist/index.js"
},
So we can just run the package script and it will build and zip the file.
And let's add a test so we've got tests running, I like vitest...
npm i -D vitest
and add this line to the compilerOptions in tsconfig.json
file:
"types": ["vitest/globals"]
Let's update the test script in package.json
:
"scripts": {
"test": "vitest",
"build": "esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019",
"package": "npm run build && zip -j dist/index.zip dist/index.js"
},
Let's add a test:
import { handler } from '../src'
describe('scraper', () => {
it('should run', async () => {
await expect(handler()).resolves.toEqual({
statusCode: 200,
body: 'hello world',
})
})
})
Because we added the globals
to the tsconfig.json file, we don't need to import anything, we can just use describe
and it
directly.
Now if we run npm test
, vitest
will watch the test folders for any change and run the tests.
The output should look something like this:
➜ freecycle-scraper git:(main) ✗ npm test
> freecycle-scraper@1.0.0 test
> vitest
DEV v0.31.4 ./freecycle-scraper
✓ packages/scraper/__tests__/index.test.ts (1)
Test Files 1 passed (1)
Tests 1 passed (1)
Start at 15:47:29
Duration 433ms (transform 41ms, setup 0ms, collect 19ms, tests 3ms, environment 0ms, prepare 123ms)
PASS Waiting for file changes...
press h to show help, press q to quit
Now we need to deploy that to aws. So let's set up a top level folder in our repo called terraform
and create some boilerplate in there.
mkdir terraform
cd terraform
touch main.tf
touch config
touch site.tfvars
touch backend.tf
touch provider.tf
touch vars.tf
touch locals.tf
We'll put various things in these files as we go along. Initial contents are:
// backend.tf
terraform {
backend "s3" {
encrypt = true
}
}
// config
region = "eu-west-2"
bucket = "569938948469-tfstate"
key = "services/freecycle-scraper"
dynamodb_table = "569938948469-tfstate-lock"
// locals.tf
locals {
namespace = "freecycle-scraper"
application = "freecycle scraper"
lambda_runtime = "nodejs18.x"
tags = {
Project = local.application
ManagedBy = "Terraform"
Application = local.application
Owner = "Original Eye"
Environment = var.environment
}
}
// provider.tf
provider "aws" {
region = "eu-west-2"
}
// site.tfvars
aws_region = "eu-west-2"
aws_profile = "569938948469"
log_level = "debug"
environment = "prod"
// vars.tf
variable "aws_region" {
type = string
}
variable "aws_profile" {
type = string
}
variable "log_level" {
type = string
}
variable "environment" {
type = string
}
Terraforming
So there's the barebones of the terraform set up. Now I need to add the lambda. In the main.tf
file, I'll add this:
resource "aws_lambda_function" "scraper" {
filename = "../dist/output.zip"
function_name = "${local.namespace}-${var.environment}"
role = aws_iam_role.scraper.arn
handler = "src/index.handler"
description = "Scrapes the freecycle site"
source_code_hash = filebase64sha256("../dist/output.zip")
runtime = local.lambda_runtime
timeout = 15
reserved_concurrent_executions = 5
memory_size = 1024
environment {
variables = {
LOG_LEVEL = var.log_level
}
}
}
which is the lambda
I also need to add permissions, which can be done lke this:
data "aws_iam_policy_document" "lambda_assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["lambda.amazonaws.com"]
}
}
}
resource "aws_iam_role" "scraper" {
name = "${local.namespace}-${var.environment}"
assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}
resource "aws_iam_role_policy_attachment" "scraper_cloudwatch" {
role = aws_iam_role.scraper.name
policy_arn = "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"
}
Now we can cd into the terraform directory and set up terraform:
cd terraform
terraform init -backend-config=config
As long as you've got terraform installed, this should yield positive results like this:
Initializing the backend...
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing provider plugins...
- Finding latest version of hashicorp/aws...
- Installing hashicorp/aws v5.1.0...
- Installed hashicorp/aws v5.1.0 (signed by HashiCorp)
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
Then we can plan the deployment:
terraform plan -var-file=site.tfvars -out=.tfplan
and we can deploy this with the command:
terraform apply .tfplan
Now let's add the scheduler to the terraform infrastructure so that we can trigger the lambda every 10 minutes.
In the main.tf
file, I'll add this:
resource "aws_cloudwatch_event_rule" "scheduler" {
name = "${local.namespace}-scheduler-${var.environment}"
description = "Schedule to trigger the scraper"
schedule_expression = "rate(10 minutes)"
}
resource "aws_cloudwatch_event_target" "scheduler" {
arn = aws_lambda_function.scraper.arn
rule = aws_cloudwatch_event_rule.scheduler.name
}
resource "aws_lambda_permission" "scheduler" {
statement_id = "AllowExecutionFomEventBridge"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.scraper.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.scheduler.arn
}
resource "aws_iam_role" "scheduler" {
name = "${local.namespace}-scheduler-${var.environment}"
assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}
resource "aws_iam_role_policy_attachment" "scheduler_cloudwatch" {
role = aws_iam_role.scheduler.name
policy_arn = "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"
}
Let's rerun the plan:
terraform plan -var-file=site.tfvars -out=.tfplan
And if all is well, let's apply it:
terraform apply .tfplan
We should now have a lambda that is triggered every 10 minutes.