How to Use GitHub Actions to Automate Data Scraping

Published in

Data science at Nesta

7 min readMar 21, 2024

Introduction

In this blog, we explore how to automate a data scraping process in the cloud using GitHub Actions.

For this case study, we will demonstrate how to automate a Python script which fetches a paginated bulk dataset from the Gateway to Research API (though the techniques used can be applied to a variety of data processing tasks).

Using GitHub Actions enables this to be done entirely in the cloud and configured to happen at regular intervals (in this case every day).

We cover the following in relation to GitHub Actions and data pipelines:

How the Python script works
How GitHub Actions works
Limitations of GitHub Actions
- The GitHub Actions minutes limit
- The GitHub Actions memory limit, encountered when scraping large datasets
What’s next (after GitHub Actions)

Case study: Why we needed to automate the scraping of Gateway to Research data

Horizon-scanning is one of the areas of expertise within Nesta’s Discovery Hub. Our horizon-scanning uses a combination of qualitative desk research and data-driven methods (usually data relating to investments, academic research and patents amongst others).

For academic research data, our primary source of information is the UK Research and Innovation Gateway to Research (GtR) API. It details the research projects funded by UKRI and is an early indicator of research and development activity in the UK, both private and public sector.

We use this information to understand trends in research funding and identify innovations relevant to Nesta’s three missions, as well for other parts of our organisation that could utilise this horizon-scanning capacity.

To achieve this, we needed to regularly download the most up-to-date version of this dataset, and so decided to automate the process.

Creating a Python script for downloading and storing the data

Firstly, we needed to create a Python script to pull the data. Instead of extracting information about each research project by name, we used the GtR 2 API documentation, and used endpoints to provide us with a bulk resource.

For each bulk endpoint, the dataset is paginated into small chunks. The API has an upper limit on chunk size, and the upper limit is used in this script. The main scraper function iterates through each page, appending the page to a Python list variable, which stores the gradually extending list. Once all pages in the dataset are scraped and appended to the list, the data is uploaded to AWS S3 in a time-stamped directory.

We’ve open sourced this work — you can view the script here.

So far so good — but if we wished to run this script every day or every week, to get the most up-to-date information about new research projects, we would need an automated workflow.

Automating the workflow using GitHub Actions

GitHub Actions is enabled when a YAML file with the exact path [repo_root]/.github/workflows/main.yaml is stored in the repository.

There are two main ways to achieve this: a) via the GitHub website in the Actions section or b) creating the file manually and pushing to your GitHub repo. For our purposes, we found that the second way offered more control, and it was nicer to create the file in our own IDE.

The GitHub Actions interface

In your repo on GitHub, click into “Actions”. If an Action is not configured, you will see a “Get started with GitHub Actions” menu with various options. If there is an Action in there that GitHub recognises, you’ll see a dashboard showing all previous workflow runs. This is very useful as it enables us to view the workflow live via the logs, and use them to investigate previous commits and whether the Action was successful.

The main.yaml file

This is the file which the Actions runner interprets to execute your workflow. The elements of the workflow file are as follows:

Name

This names your workflow — you’ll see this name in the GitHub Actions interface.

name: Gateway to Research API Automation

2. On (schedule)

To enable your workflow to execute, you need to specify the triggers for your workflow.

Our workflow had two triggers: a ‘cron job’ executed every day at midnight and it is also triggered when the repo is pushed to any branch. The latter is useful when testing the workflow, though there are considerations which we discuss in more detail below.

on:
  schedule:
    - cron: "0 0 * * *" # Run every day at midnight
  push:
    branches-ignore: [] # will run whenever pushed to any branch

What is a cron job?

A cron job is a scheduled task used in Unix-like operating systems that allows users to run scripts, commands, or software programs at specified times, dates, or intervals. Cron syntax is a string of five (or sometimes six) fields separated by spaces, representing different units of time.

The first “0” represents the minute and the second represents the hour. The last three values are, in order, the day of the month, the month and the day of the week. The asterisks represent “every” possible value within the range for its time unit, like “every hour” or “every day.”

Here are some examples:

# Run every day at midnight
- cron: "0 0 * * *"

# Run every week on Monday
- cron: "0 0 * * 1"

# Run every 1st day of the month
- cron: "0 0 1 * *"

3. Jobs (workflows)

These are the workflows that will be executed by the above triggers. There is just one job “run_script”, which is defined by 3 components: runs_on, strategy and steps.

runs_on

jobs:
  run_script:
    runs-on: ubuntu-latest # Use the latest version of Ubuntu

This line specifies that the job should be executed on a runner (a virtual environment or a physical machine configured to run GitHub Actions workflows) using the latest version of Ubuntu available on GitHub Actions. Alternatives to Ubuntu include Windows, macOS, and self-hosted runners, while containers, which are lightweight, standalone, executable packages that include everything needed to run a piece of software, can also be defined here to run the job.

strategy

    strategy:
      matrix:
        endpoint: [
            "funds",
            "organisations",
            # "outcomes",
            # "outcomes/keyfindings",
            # "outcomes/impactsummaries",
            # "outcomes/publications",
            # "outcomes/collaborations",
            # "outcomes/intellectualproperties",
            # "outcomes/policyinfluences",
            # "outcomes/products",
            # "outcomes/researchmaterials",
            # "outcomes/spinouts",
            # "outcomes/furtherfundings",
            # "outcomes/disseminations",
            "persons",
            "projects",
          ]

The “matrix” strategy enables the running of the job multiple times with different settings in parallel. In this case, it’s used to run the script using different endpoints. We are currently focussing on a smaller number of endpoints, so the ones we are not using are commented out.

steps (part 1)

“steps” contains all of the tasks which are executed for each workflow run (in this case, all of these happen for each endpoint).

Here, we upgrade pip and botocore, checkout the repository, set up a specific version of Python, set up Miniconda and activate the environment, install direnv, and install dependencies from requirements.txt.

    steps:
      - name: Upgrade pip
        run: |
          python -m pip install --upgrade pip
          python -m pip install --upgrade botocore

      - name: Checkout repository
        uses: actions/checkout@v4 # Use the latest version compatible

      - name: Set up Python
        uses: actions/setup-python@v4 # Use the latest version compatible
        with:
          python-version: 3.8

      - name: Set up Miniconda
        uses: conda-incubator/setup-miniconda@v2
        with:
          activate-environment: discovery_gtr # Check your environment name
          environment-file: environment.yaml # Verify the relative path

      - name: Install direnv
        run: |
          sudo apt-get update
          sudo apt-get install direnv

      - name: Install dependencies
        run: |
          make install
          pip install -r requirements.txt
          conda list
        # Print installed packages for debugging

steps (part 2)

The final step sets up the environment variables and runs the script.

- name: Run GtR script
        env:
          AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY }}
          AWS_SECRET_KEY: ${{ secrets.AWS_SECRET_KEY }}
          MY_BUCKET_NAME: ${{ secrets.MY_BUCKET_NAME }}
          DESTINATION_S3_PATH: ${{ secrets.DESTINATION_S3_PATH }}
          ENDPOINT: ${{ matrix.endpoint }}
        run: python discovery_gtr/getters/gtr_to_s3.py

The “env” sub-step sets the environment variables. The first four are consistent across all of the parallel scripts, so are taken from GitHub Secrets. These have been securely encrypted in the GitHub repo, through Settings -> General -> Security -> Secrets and Variables -> Actions -> Repository secrets.

The ENDPOINT variable is set to the value of matrix.endpoint from the matrix strategy in a GitHub Actions workflow. It dynamically assigns one of the predefined endpoint values (like “funds”, “organisations”, etc.) to ENDPOINT for use in the job.

The last line runs the script.

Limitations of GitHub Actions

Limited Minutes

There are usage limits for Github Actions workflows. Pushing the workflow for testing in the cloud, plus having the download happen each day with this and a separate workflow, meant that we needed to be cautious to avoid running out of organisational minutes for the month. If you are planning to rely on GitHub Actions, we would recommend co-ordinating with any other colleagues in your organisation who might be using GitHub Actions.

Downloading large files

Different datasets might come with other challenges. For example, to support Nesta’s data-driven horizon scanning activities and research, we also download venture capital investment data from a proprietary source (Crunchbase). In that case we did not have to use multiple endpoints or iterate through paginated resources, as the API supplies all of the datasets in a single bulk request.

However, this meant that we had to manage the size of the data stored in memory due to the restrictions on GitHub-hosted runners in GitHub Actions, leading us to use streaming for uploads to S3. These limits are specific to GitHub-hosted runners. Alternatives include using larger or self-hosted runners, which can offer more flexibility regarding memory and processing capabilities. However, setting these up takes time and creates additional complexity, such as infrastructure management.

What next?

It was useful to prototype the data update pipeline in GitHub Actions, since we now have access to the most up-to-date data on research funding.

I’d recommend GitHub Actions for use where you want to automate a simple, task-based workflow in your GitHub repository without many integrations. A major benefit is that it does not require the management of infrastructure and is simple to deploy. It’s important to be mindful of the usage limits, but within these boundaries it’s a useful option in your toolbox.

However, the limitations discussed above meant that we were constrained too much. Hence, our next step was to re-implement this workflow using Airflow, which is more complex but also more versatile and powerful.