My roommate needed help with building a web scraper. As a result of various
constraints, we ended up using Heroku. There was, however, a database row limit
that we had to work around. It turned out to be a nice opportunity to play with
the new PostgreSQL 9.5 INSERT ... ON CONFLICT DO UPDATE
, also known as UPSERT,
functionality. The goal of this article is to introduce UPSERT to you via a
practical example.
Introduction
My roommate, an aspiring programmer, recently asked me for advice on how to build a web scraper. The scraper should run on a machine that is constanly connected to the internet, because it should check at least every hour or so. Running the scraper on his laptop is not an option for my roommate, because he frequently takes it with him on the road. Neither is running it on a remote server, because he possesses none nor is he planning to spend money on one. We tried out Heroku; a free platform as a service provider that embraces all of these constraints.
Choosing the right storage
The Heroku Scheduler addon enabled us to run a script every 10 minutes (the highest available frequency) that scrapes the targeted website. With that requirement settled, we needed a way to compare the old state to the new state, so we could send ourselves an email as soon as the website changed. The simplest solution we could come up with involved saving the hashed HTML contents into a Heroku config variable. That solves the problem of notification, but will not allow us analyse the data later. For this reason, using a database made sense. PostgreSQL storage is cheap, free actually on Heroku, so we picked that one.
Heroku limitations
Normally, I would insert one row per scraping run with the scraped contents (or a foreign key to a unique piece of content) and a timestamp into the database. In this case, however, it was not a good idea because the free Heroku PostgreSQL database offering has certain limitations. You are allowed to have a maximum of 10.000 rows in the database. A scraper that scrapes every 10 minutes reaches the limit after 40 days of service. We needed a way to circumvent the limitation, since we were planning on scraping longer than that.
What if we scaled horizontally instead of vertically? Instead of adding rows, we
could grow columns and consequently circumvent the row limit. One row per unique
piece of scraped content with an ever expanding array of timestamps. PostgreSQL
arrays seem to be a good fit for this. One could implement this using two
queries. First, a SELECT
establishes whether we have seen the content
before. Second, an INSERT
or UPDATE
respectively inserts a new piece of
content with the current time to the array or updates an existing row by
appending the current time. Or, you use the new INSERT … ON CONFLICT DO UPDATE
operation instead, like I will show you in this article.
Methods
In this article, I will only show you how to complete the task using UPSERT. If you would like to see how you could achieve similar behavior with PostgreSQL versions below 9.5, please check out this Stackoverflow answer.
The query
We built the plumbing of the scraper using Python; the source code is available on GitHub. The Python script executes the following query–the interesting part–every time it runs.
The test
We supply the query above to the script below via the $QUERY
variable. First,
we initialize an empty database and run the query twice with the body
variable
set to content
and ts
set to the SQL function now()
. After that, we
simulate a change on the targeted website by running the query again after
setting content
to changed
. In production, the body
variable contains an
HTML document. The output of the script is printed below it.
Results
The scraper has been running flawlessly in production for almost two weeks. Let’s see how many rows have been added during that time.
After running every 10 minutes for almost two weeks, the scraper inserted just 24 records. Before we inspect the contents of the database, let’s make sure that we really are in compliance with the Heroku PostgreSQL maximum rows limitation:
We have to unnest the seen_at
array to obtain the total count of scraper
runs.
Let’s break the 2282 rows down by date and aggregate the count of checks and changes of content that occured that day. Please check out the appendix for the exact query that I used.
The sums of checks and changes match the unnested and total counts above, respectively. Note: the amount of checks differs between days probably because of the following:
Scheduler is a best-effort service. There is no guarantee that jobs will execute at their scheduled time, or at all. Scheduler has a known issue whereby scheduled processes are occasionally skipped.
Source: Known issues and alternatives; Heroku Scheduler documentation
Discussion
This scraping method only works when the website is static and the content changes slowly, compared to dynamic websites with different HTML output on each request. For example, some websites return a different XSRF token at every visit. In that case, every scraping run inserts a new row into the database, negating the savings of our UPSERT horizontal expansion.
Conclusion
The HTML output of our targeted website only changed when the content of interest changed, leading to our high checks vs. changes ratio. Using UPSERT for scraping turned out to be a good fit for this website because it enables us to scrape for some time ahead, while logging all of the captured data.
Acknowledgements
Thank you people that made Org mode and Babel. :-)
Appendix
The source code of this article is available online.