Scraping Freecycle with AWS - Part 1, boilerplate.

Creating a serverless application to scrape Freecycle and send me a notification when something is posted.

Steve Clements

June 3, 2023

Steve Clements

Skip to part 2 - Doing stuff

Introduction

I recently starting looking for things on freecycle and I wanted to set up some alerts. The alerts don't seem to work or I've done it wrong or something. Obviously as a developer I haven't got the patience to read the docs and ensure that I've done everything correctly, no, my first response is to think, "ooh I could do this myself!". So I thought I'd write a scraper to login as me to the website every 10 minutes or so and check for new posts. If there are any new posts then I'll send myself a notification using AWS SNS straight to my phone. Nice. Oh, and I'll use terraform, cos, well I like it. I mean I could use SAM to learn that, but maybe that's for another day.

Architecture

Lambda to scrape
trigger lambda every 10 minutes
send notification to SNS
SNS sends notification to my phone

Solution

Lambda

Authentication

First thing is to authenticate with the site - or at least check that I am authenticated.

Trigger

Let's use cloudwatch events to trigger the lambda every 10 minutes. I want to vary the trigger by a small amount like you can do in cron, so that the site doesn't realise it's a bot. I'm not sure if this is necessary, but it's a good idea anyway.

Send SNS notification

SNS to my phone

Set up.

First I'll set up a repo on github.

mkdir freecycle-scraper
cd freecycle-scraper
git init
npm init -y
git add .gitignore // << put some things in here we don't want to add to the repo.
git add .
git commit -m "init"

I've previously been creating the repo on the github site itself but it get's a bit tedious so I thought I'd see if there is a way to do it using the command line. Turns out there is, and it's called gh.

I installed it using:

brew install gh

and then I can create the repo using

gh repo create freecycle-scraper --public --confirm

This creates the repo on github and then adds the remote to my local repo. Nice.

Meta note: I'm writing this blog with the help of co-pilot and it makes suggestions, one of which was

gh repo create freecycle-scraper --public --confirm

which I am about to see if that works.

Authenticating with `gh`

Need to auth first:

gh auth login

I created and used a PAT from here https://github.com/settings/tokens?type=beta.

I was then able to do the command:

gh repo create freecycle-scraper --public --confirm

then I added that as an origin:

git remote add origin https://github.com/sjclemmy/freecycle-scraper

Build

I like using esbuild and typescript so I'll set that up now.

npm i -D esbuild // short form of npm install --save-dev esbuild

Need types for my lambda

npm i -D @types/aws-lambda

Let's create a packages/scraper/src folder and put the index.ts file in there. We'll put some dummy content in to start with:

import { APIGatewayEvent, Context } from 'aws-lambda'

const handler = async (event: APIGatewayEvent, context: Context) => {
  console.log('hello world')
  return {
    statusCode: 200,
    body: 'hello world',
  }
}
export { handler }

We need a tsconfig file:

// tsconfig.json
{
  "$schema": "https://json.schemastore.org/tsconfig",
  "display": "Node 18 + ESM + Strictest",
  "compilerOptions": {
    "lib": ["es2022", "esnext.asynciterable"],
    "module": "es2022",
    "target": "esnext",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "moduleResolution": "node",
    "allowUnusedLabels": false,
    "allowUnreachableCode": false,
    "exactOptionalPropertyTypes": true,
    "noFallthroughCasesInSwitch": true,
    "noImplicitOverride": true,
    "noImplicitReturns": true,
    "noPropertyAccessFromIndexSignature": true,
    "noUncheckedIndexedAccess": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true,
    "importsNotUsedAsValues": "error",
    "resolveJsonModule": true
  }
}

Now we need build it using esbuild. The command is:

esbuild ./packages/scraper/src/index.ts --bundle --platform=node --outfile=dist/index.js --target=es2019

Let's add that to the scripts section in the package.json file:

"scripts": {
  "test": "echo \"Error: no test specified\" && exit 1",
  "build": "esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019"
},

Let's check it works:

npm run build

There should be output that looks like this:

> freecycle-scraper@1.0.0 build
> esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019


  dist/index.js  1.1kb

⚡ Done in 16ms

We also need to zip that file up to upload it to AWS. We'll add a script to do that:

"scripts": {
  "test": "echo \"Error: no test specified\" && exit 1",
  "build": "esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019",
  "package": "npm run build && zip -j dist/index.zip dist/index.js"
},

So we can just run the package script and it will build and zip the file.

And let's add a test so we've got tests running, I like vitest...

npm i -D vitest

and add this line to the compilerOptions in tsconfig.json file:

"types": ["vitest/globals"]

Let's update the test script in package.json:

"scripts": {
  "test": "vitest",
  "build": "esbuild ./packages/src/scraper.ts --bundle --platform=node --outfile=dist/index.js --target=es2019",
  "package": "npm run build && zip -j dist/index.zip dist/index.js"
},

Let's add a test:

import { handler } from '../src'
describe('scraper', () => {
  it('should run', async () => {
    await expect(handler()).resolves.toEqual({
      statusCode: 200,
      body: 'hello world',
    })
  })
})

Because we added the globals to the tsconfig.json file, we don't need to import anything, we can just use describe and it directly.

Now if we run npm test, vitest will watch the test folders for any change and run the tests. The output should look something like this:

➜  freecycle-scraper git:(main) ✗ npm test

> freecycle-scraper@1.0.0 test
> vitest


 DEV  v0.31.4 ./freecycle-scraper

 ✓ packages/scraper/__tests__/index.test.ts (1)

 Test Files  1 passed (1)
      Tests  1 passed (1)
   Start at  15:47:29
   Duration  433ms (transform 41ms, setup 0ms, collect 19ms, tests 3ms, environment 0ms, prepare 123ms)


 PASS  Waiting for file changes...
       press h to show help, press q to quit

Now we need to deploy that to aws. So let's set up a top level folder in our repo called terraform and create some boilerplate in there.

mkdir terraform
cd terraform
touch main.tf
touch config
touch site.tfvars
touch backend.tf
touch provider.tf
touch vars.tf
touch locals.tf

We'll put various things in these files as we go along. Initial contents are:

// backend.tf

terraform {
  backend "s3" {
    encrypt = true
  }
}

// config

region = "eu-west-2"
bucket = "569938948469-tfstate"
key = "services/freecycle-scraper"
dynamodb_table = "569938948469-tfstate-lock"


// locals.tf

locals {
  namespace      = "freecycle-scraper"
  application    = "freecycle scraper"
  lambda_runtime = "nodejs18.x"

  tags = {
    Project     = local.application
    ManagedBy   = "Terraform"
    Application = local.application
    Owner       = "Original Eye"
    Environment = var.environment
  }
}

// provider.tf

provider "aws" {
  region = "eu-west-2"
}

// site.tfvars

aws_region  = "eu-west-2"
aws_profile = "569938948469"
log_level   = "debug"
environment = "prod"

// vars.tf

variable "aws_region" {
  type = string
}
variable "aws_profile" {
  type = string
}
variable "log_level" {
  type = string
}
variable "environment" {
  type = string
}

Terraforming

So there's the barebones of the terraform set up. Now I need to add the lambda. In the main.tf file, I'll add this:

resource "aws_lambda_function" "scraper" {
  filename      = "../dist/output.zip"
  function_name = "${local.namespace}-${var.environment}"
  role          = aws_iam_role.scraper.arn
  handler       = "src/index.handler"
  description   = "Scrapes the freecycle site"

  source_code_hash               = filebase64sha256("../dist/output.zip")
  runtime                        = local.lambda_runtime
  timeout                        = 15
  reserved_concurrent_executions = 5
  memory_size                    = 1024

  environment {
    variables = {
      LOG_LEVEL = var.log_level
    }
  }
}

which is the lambda

I also need to add permissions, which can be done lke this:

data "aws_iam_policy_document" "lambda_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "scraper" {
  name               = "${local.namespace}-${var.environment}"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}

resource "aws_iam_role_policy_attachment" "scraper_cloudwatch" {
  role       = aws_iam_role.scraper.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"
}

Now we can cd into the terraform directory and set up terraform:

cd terraform
terraform init -backend-config=config

As long as you've got terraform installed, this should yield positive results like this:

Initializing the backend...

Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Finding latest version of hashicorp/aws...
- Installing hashicorp/aws v5.1.0...
- Installed hashicorp/aws v5.1.0 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Then we can plan the deployment:

terraform plan -var-file=site.tfvars -out=.tfplan

and we can deploy this with the command:

terraform apply .tfplan

Now let's add the scheduler to the terraform infrastructure so that we can trigger the lambda every 10 minutes. In the main.tf file, I'll add this:

resource "aws_cloudwatch_event_rule" "scheduler" {
  name                = "${local.namespace}-scheduler-${var.environment}"
  description         = "Schedule to trigger the scraper"
  schedule_expression = "rate(10 minutes)"
}

resource "aws_cloudwatch_event_target" "scheduler" {
  arn  = aws_lambda_function.scraper.arn
  rule = aws_cloudwatch_event_rule.scheduler.name
}


resource "aws_lambda_permission" "scheduler" {
  statement_id  = "AllowExecutionFomEventBridge"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.scraper.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.scheduler.arn
}

resource "aws_iam_role" "scheduler" {
  name               = "${local.namespace}-scheduler-${var.environment}"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}

resource "aws_iam_role_policy_attachment" "scheduler_cloudwatch" {
  role       = aws_iam_role.scheduler.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"
}

Let's rerun the plan:

terraform plan -var-file=site.tfvars -out=.tfplan

And if all is well, let's apply it:

terraform apply .tfplan

We should now have a lambda that is triggered every 10 minutes.

Continue to part 2 - To see how to actually do stuff