Member-only story
Crawling a JSON API into Google DataStudio
3 steps to integrate external data into your company’s workflow
Recently I’ve been working a lot on gathering knowledge on the web and making it available to analyse by myself and others. In particular, I have been looking at APIs of top websites and setting up the processes for taking that information from an endpoint to a datastore that can be queried on demand.
Using a JSON API over the traditional approach of accessing the webpage and extracting the information from the HTML/CSS has many benefits. In particular, the data is in a neat, structured format which means it’s easier to work with and explore as well as being less susceptible to website changes. On top of this, there is often much more data available in the JSON API then is actually rendered on the page which means you can access ‘secret’ information.
In this article, I’d like to outline the approach that I have used, to show you just how easy it can be. As a quick summary, we will:
- start with an API endpoint
- use Scrapy and Scraping Hub to schedule crawling jobs with Python
- pull and process the data using Google Scripts
- forward the data onto Google BigQuery where it will be stored