Scraping Freecycle with AWS - Part 3, Parsing

Creating a serverless application to scrape Freecycle and send me a notification when something is posted.

Steve Clements
Steve Clements
freecycle

Back to part 2 - Authentication

Part 3

The Data

If you remember from the previous post the data we got back from freecycle is an html page with loads of things we don't need and the actual data is way down the page and is formatted in a horrible way. So we need to change that and make it something that we can use.

The unifiedjs ecosystem is really useful for working with structured data and makes it easy to extract the data you need.

Looking at the docs it looks like we should start with rehype.

npm i rehype

So lets create a new parse module. In the src directory create a new file called parse.ts.

import { rehype } from 'rehype'

export const parse = (html: string) => {
  const parsed = rehype().parse(html)
  return parsed
}

and let's update our initial function to log out the result...

import { parse } from './parse'
// ...etc...

const handler = async () => {
  const cookie = await getLoginCookie()
  const latestPosts = await getLatestPosts(cookie.split(';')[0] as string)
  console.log(await parse(latestPosts))
  return { statusCode: 200, body: 'processed' }
}

If we run that (npm test) we get to see that it has parsed the html into a tree structure.

{
  type: 'root',
  children: [
    { type: 'doctype', position: [Object] },
    {
      type: 'element',
      tagName: 'html',
      properties: {},
      children: [Array],
      position: [Object]
    }
  ],
  data: { quirksMode: false },
  position: {
    start: { line: 1, column: 1, offset: 0 },
    end: { line: 794, column: 1, offset: 77312 }
  }
}

This is a good start but we need to get the data out of it.

We already have the data in a file from the previous post so we can see what data we need. If we search that for posts we end up finding this little segment:

<!-- Items list -->
            <div class="item-list-items" style="height: auto;" >
                <div v-masonry transition-duration="0.3s" item-selector=".post-grid-item" class="item-grid-view" ref="post_gridview" v-if="$root.posts.layout === 'grid'" >
                    <div class="grid-sizer"></div>
                    <div class="gutter-sizer"></div>
                    <fc-data v-masonry-tile :data="{&quot;count&quot;:1411,&quot;posts&quot;:[{&quot;id&quot;:93240934,&quot;userId&quot;:28212562,&quot;subject&quot;:&quot;Mini Tube Cutter &quot;,&quot;location&quot;:&quot;Pennsylvania Road, EX4 6DH&quot;,&quot;description&quot;:&quot;A Mi...
                    ...etc...

So it appears that the bit we need can be identified from the fc-data tag. So lets try and get that.

This is where we need to use some of the utilities of the unifiedjs ecosystem. We can use unist-util-visit. Update the code to look like this:

import { rehype } from 'rehype'
import { visit } from 'unist-util-visit'

export const parse = (html: string) => {
  const tree = rehype().parse(html)
  let requiredData
  visit(tree, 'element', (node: any) => {
    if (node.tagName === 'fc-data') {
      requiredData = node.properties
    }
  })
  return requiredData
}

That should get us a bit closer:

{
  posts: {
    ':data': `{"count":1406,"posts":[{"id":93253696,"userId":33060400,"subject":"Laser Printer","location":"Exeter ","description":"Hi all,\\r\\n\\r\\nI know this is a big ask but i am looking for a laser printer dont mind if its tatty looking as long as it works, and the toner isnt a r...

So it looks like the data we need is in the :data property. So lets update the code to get that.

import { rehype } from 'rehype'
import { visit } from 'unist-util-visit'

export const parse = (html: string) => {
  const tree = rehype().parse(html)
  let requiredData
  visit(tree, 'element', (node: any) => {
    if (node.tagName === 'fc-data') {
      requiredData = JSON.parse(node.properties[':data'])
    }
  })
  return requiredData
}

This should log out some actual JSON data:

[
  {
    id: 93253696,
    userId: 33060400,
    subject: 'Laser Printer',
    location: 'Exeter ',
    description: 'Hi all,

    ...etc...

Now lets add some types and process the data a bit more to get the data we really need. The parse file should look like this:


import { rehype } from 'rehype'
import { visit } from 'unist-util-visit'

type FreecycleEntry = {
  type: { name: 'Offer' | 'Wanted' }
  subject: string
  description: string
  date: string
  time: string
  group: {
    name: string
  }
  id: string
}

export const parse = async (htmlFile: string) => {
  const tree = rehype().parse(htmlFile)
  let requiredData
  visit(tree, 'element', (node) => {
    if (
      node.tagName === 'fc-data' &&
      node.properties &&
      node.properties[':data'] &&
      typeof node.properties[':data'] === 'string'
    ) {
      requiredData = JSON.parse(node.properties[':data'])
        .posts.filter(
          ({ type }: FreecycleEntry) => type.name.toLowerCase() === 'offer'
        )
        .map(
          ({
            subject,
            description,
            date,
            time,
            group,
            id,
          }: FreecycleEntry) => ({
            subject,
            description,
            date,
            time,
            location: group.name,
            id,
          })
        )
    }
  })
  return requiredData
}

}

I've added some types and filtered the data to only include offers and then mapped the data to a new object with only the data we need. This should give us a nice array of objects that we can use to send notifications.

There's not much code but it's doing a lot thanks to the unifiedjs ecosystem. See the full code at https://github.com/sjclemmy/freecycle-scraper

In the next post I'll show how I built the notification system.