Skip to content

Latest commit

 

History

History
390 lines (313 loc) · 18.8 KB

README.md

File metadata and controls

390 lines (313 loc) · 18.8 KB

AlphaFold Protein Structure Database

Introduction

The AlphaFold UniProt release (214M predictions) is hosted on Google Cloud Public Datasets, and is available to download at no cost under a CC-BY-4.0 licence. The dataset is in a Cloud Storage bucket, and metadata is available on BigQuery. A Google Cloud account is required for the download, but the data can be freely used under the terms of the CC-BY 4.0 Licence.

This document provides an overview of how to access and download the dataset for different use cases. Please refer to the AlphaFold database FAQ for further information on what proteins are in the database and a changelog of releases.

📒 Note: The full dataset is difficult to manipulate without significant computational resources (the size of the dataset is 23 TiB, 3 * 214M files).

There are also alternatives to downloading the full dataset:

  1. Download a premade subset (covering important species / Swiss-Prot) via our download page.
  2. Download a custom subset of the data. See below.

If you need to download the full dataset then please see the "Bulk download" section. See "Creating a Google Cloud Account" below for more information on how to avoid any surprise costs when using Google Cloud Public Datasets.

Licence

Data is available for academic and commercial use, under a CC-BY-4.0 licence.

EMBL-EBI expects attribution (e.g. in publications, services or products) for any of its online services, databases or software in accordance with good scientific practice.

If you make use of an AlphaFold prediction, please cite the following papers:

AlphaFold Data Copyright (2022) DeepMind Technologies Limited.

Disclaimer

The AlphaFold Data and other information provided on this site is for theoretical modelling only, caution should be exercised in its use. It is provided 'as-is' without any warranty of any kind, whether expressed or implied. For clarity, no warranty is given that use of the information shall not infringe the rights of any third party. The information is not intended to be a substitute for professional medical advice, diagnosis, or treatment, and does not constitute medical or other professional advice.

Format

Dataset file names start with a protein identifier of the form AF-[a UniProt accession]-F[a fragment number].

Three files are provided for each entry:

  • model_v4.cif – contains the atomic coordinates for the predicted protein structure, along with some metadata. Useful references for this file format are the ModelCIF and PDBx/mmCIF project sites.
  • confidence_v4.json – contains a confidence metric output by AlphaFold called pLDDT. This provides a number for each residue, indicating how confident AlphaFold is in the local surrounding structure. pLDDT ranges from 0 to 100, where 100 is most confident. This is also contained in the CIF file.
  • predicted_aligned_error_v4.json – contains a confidence metric output by AlphaFold called PAE. This provides a number for every pair of residues, which is lower when AlphaFold is more confident in the relative position of the two residues. PAE is more suitable than pLDDT for judging confidence in relative domain placements. See here for a description of the format.

Predictions grouped by NCBI taxonomy ID are available as proteomes/proteome-tax_id-[TAX ID]-[SHARD ID]_v4.tar within the same bucket.

There are also two extra files stored in the bucket:

  • accession_ids.csv – This file contains a list of all the UniProt accessions that have predictions in AlphaFold DB. The file is in CSV format and includes the following columns, separated by a comma:
    • UniProt accession, e.g. A8H2R3
    • First residue index (UniProt numbering), e.g. 1
    • Last residue index (UniProt numbering), e.g. 199
    • AlphaFold DB identifier, e.g. AF-A8H2R3-F1
    • Latest version, e.g. 4
  • sequences.fasta – This file contains sequences for all proteins in the current database version in FASTA format. The identifier rows start with ">AFDB", followed by the AlphaFold DB identifier and the name of the protein. The sequence rows contain the corresponding amino acid sequences. Each sequence is on a single line, i.e. there is no wrapping.

Creating a Google Cloud Account

Downloading from the Google Cloud Public Datasets (rather than from AFDB or 3D Beacons) requires a Google Cloud account. See the Google Cloud get started page, and explore the free tier account usage limits.

IMPORTANT: After the trial period has finished (90 days), to continue access, you are required to upgrade to a billing account. While your free tier access (including access to the Public Datasets storage bucket) continues, usage beyond the free tier will incur costs – please familiarise yourself with the pricing for the services that you use to avoid any surprises.

  1. Go to https://cloud.google.com/datasets.
  2. Create an account:
    1. Click "get started for free" in the top right corner.
    2. Agree to all terms of service.
    3. Follow the setup instructions. Note that a payment method is required, but this will not be used unless you enable billing.
    4. Access to the Google Cloud Public Datasets storage bucket is always at no cost and you will have access to the free tier.
  3. Set up a project:
    1. In the top left corner, click the navigation menu (three horizontal bars icon).
    2. Select: "Cloud overview" -> "Dashboard".
    3. In the top left corner there is a project menu bar (likely says "My First Project"). Select this and a "Select a Project" box will appear.
    4. To keep using this project, click "Cancel" at the bottom of the box.
    5. To create a new project, click "New Project" at the top of the box:
      1. Select a project name.
      2. For location, if your organization has a Cloud account then select this, otherwise leave as is.
  4. Install gsutil:
    1. Follow these instructions.

Accessing the dataset

The data is available from:

Bulk download

We don't recommend downloading the full dataset unless required for processing with local computational resources, for example in an academic high performance computing centre.

We estimate that a 1 Gbps internet connection will allow download of the full database in roughly 2.5 days.

While we don’t know the exact nature of your computational infrastructure, below are some suggested approaches for downloading the dataset. Please reach out to alphafold@deepmind.com if you have any questions.

The recommended way of downloading the whole database is by downloading 1,015,797 sharded proteome tar files using the command below. This is significantly faster than downloading all of the individual files because of large constant per-file latency.

gsutil -m cp -r gs://public-datasets-deepmind-alphafold-v4/proteomes/ .

You will then have to un-tar all of the proteomes and un-gzip all of the individual files. Note that after un-taring, there will be about 644M files, so make sure your filesystem can handle this.

Storage Transfer Service

Some users might find the Storage Transfer Service a convenient way to set up the transfer between this bucket and another bucket, or another cloud service. Using this service may incur costs. Please check the pricing page for more detail, particularly for transfers to other cloud services.

Downloading subsets of the data

AlphaFold Database search

For simple queries, for example by protein name, gene name or UniProt accession you can use the main search bar on alphafold.ebi.ac.uk.

3D Beacons

3D-Beacons is an international collaboration of protein structure data providers to create a federated network with unified data access mechanisms. The 3D-Beacons platform allows users to retrieve coordinate files and metadata of experimentally determined and theoretical protein models from data providers such as AlphaFold DB.

More information about how to access AlphaFold predictions using 3D-Beacons is available at 3D-Beacons documentation.

Other premade species subsets

Downloads for some model organism proteomes, global health proteomes and Swiss-Prot are available on the AFDB website. These are generated from reference proteomes. If you want other species, or all proteins for a particular species, please continue reading.

We provide 1,015,797 sharded tar files for all species in gs://public-datasets-deepmind-alphafold-v4/proteomes/. We shard each proteome so that each shard contains at most 10,000 proteins (which corresponds to 30,000 files per shard, since there are 3 files per protein). To download a proteome of your choice, you have to do the following steps:

  1. Find the NCBI taxonomy ID ([TAX_ID]) of the species in question.
  2. Run gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-[TAX ID]-*_v4.tar . to download all shards for this proteome.
  3. Un-tar all of the downloaded files and un-gzip all of the individual files.

File manifests

Pre-made lists of files (manifests) are available at gs://public-datasets-deepmind-alphafold-v4/manifests. Note that these filenames do not include the bucket prefix, but this can be added once the files have been downloaded to your filesystem.

One can also define their own list of files, for example created by BigQuery (see below). gsutil can be used to download these files with

cat [manifest file] | gsutil -m cp -I .

This will be much slower than downloading the tar files (grouped by species) because each file has an associated overhead.

BigQuery

IMPORTANT: The free tier of Google Cloud comes with BigQuery Sandbox with 1 TB of free processed query data each month. Repeated queries within a month could exceed this limit and if you have upgraded to a paid Cloud Billing account you may be charged.

This should be sufficient for running a number of queries on the metadata table, though the usage depends on the size of the columns queried and selected. Please look at the BigQuery pricing page for more information.

This is the user's responsibility so please ensure you keep track of your billing settings and resource usage in the console.

BigQuery provides a serverless and highly scalable analytics tool enabling SQL queries over large datasets. The metadata for the UniProt dataset takes up ​​113 GiB and so can be challenging to process and analyse locally. The table name is:

With BigQuery SQL you can do complex queries, e.g. find all high accuracy predictions for a particular species, or even join on to other datasets, e.g. to an experimental dataset by the uniprotSequence, or to the NCBI taxonomy by taxId.

If you would find additional information in the metadata useful please file a GitHub issue.

Setup

Follow the BigQuery Sandbox set up guide.

Exploring the metadata

The column names and associated data types available can be found using the following query.

SELECT column_name, data_type FROM bigquery-public-data.deepmind_alphafold.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'metadata'
Column name Data type Description
allVersions ARRAY<INT64> An array of AFDB versions this prediction has had
entryId STRING The AFDB entry ID, e.g. "AF-Q1HGU3-F1"
fractionPlddtConfident FLOAT64 Fraction of the residues in the prediction with pLDDT between 70 and 90
fractionPlddtLow FLOAT64 Fraction of the residues in the prediction with pLDDT between 50 and 70
fractionPlddtVeryHigh FLOAT64 Fraction of the residues in the prediction with pLDDT greater than 90
fractionPlddtVeryLow FLOAT64 Fraction of the residues in the prediction with pLDDT less than 50
gene STRING The name of the gene if known, e.g. "COII"
geneSynonyms ARRAY<STRING> Additional synonyms for the gene
globalMetricValue FLOAT64 The mean pLDDT of this prediction
isReferenceProteome BOOL Is this protein part of the reference proteome?
isReviewed BOOL Has this protein been reviewed, i.e. is it part of SwissProt?
latestVersion INT64 The latest AFDB version for this prediction
modelCreatedDate DATE The date of creation for this entry, e.g. "2022-06-01"
organismCommonNames ARRAY<STRING> List of common organism names
organismScientificName STRING The scientific name of the organism
organismSynonyms ARRAY<STRING> List of synonyms for the organism
proteinFullNames ARRAY<STRING> Full names of the protein
proteinShortNames ARRAY<STRING> Short names of the protein
sequenceChecksum STRING CRC64 hash of the sequence. Can be used for cheaper lookups.
sequenceVersionDate DATE Date when the sequence data was last modified in UniProt
taxId INT64 NCBI taxonomy id of the originating species
uniprotAccession STRING Uniprot accession ID
uniprotDescription STRING The name recommended by the UniProt consortium
uniprotEnd INT64 Number of the last residue in the entry relative to the UniProt entry. This is equal to the length of the protein unless we are dealing with protein fragments.
uniprotId STRING The Uniprot EntryName field
uniprotSequence STRING Amino acid sequence for this prediction
uniprotStart INT64 Number of the first residue in the entry relative to the UniProt entry. This is 1 unless we are dealing with protein fragments.

Producing summary statistics

The following query gives the mean of the prediction confidence fractions per species.

SELECT
 organismScientificName AS name,
 SUM(fractionPlddtVeryLow) / COUNT(fractionPlddtVeryLow) AS mean_plddt_very_low,
 SUM(fractionPlddtLow) / COUNT(fractionPlddtLow) AS mean_plddt_low,
 SUM(fractionPlddtConfident) / COUNT(fractionPlddtConfident) AS mean_plddt_confident,
 SUM(fractionPlddtVeryHigh) / COUNT(fractionPlddtVeryHigh) AS mean_plddt_very_high,
 COUNT(organismScientificName) AS num_predictions
FROM bigquery-public-data.deepmind_alphafold.metadata
GROUP by name
ORDER BY num_predictions DESC;

Producing lists of files

We expect that the most important use for the metadata will be to create subsets of proteins according to various criteria, so that users can choose to only copy a subset of the 214M proteins that exist in the dataset. An example query is given below:

with file_rows AS (
  with file_cols AS (
    SELECT
      CONCAT(entryID, '-model_v4.cif') as m,
      CONCAT(entryID, '-predicted_aligned_error_v4.json') as p
    FROM bigquery-public-data.deepmind_alphafold.metadata
    WHERE organismScientificName = "Homo sapiens"
      AND (fractionPlddtVeryHigh + fractionPlddtConfident) > 0.5
  )
  SELECT * FROM file_cols UNPIVOT (files for filetype in (m, p))
)
SELECT CONCAT('gs://public-datasets-deepmind-alphafold-v4/', files) as files
from file_rows

In this case, the list has been filtered to only include proteins from Homo sapiens for which over half the residues are confident or better (>70 pLDDT).

This creates a table with one column "files", where each row is the cloud location of one of the two file types that has been provided for each protein. There is an additional confidence_v4.json file which contains the per-residue pLDDT. This information is already in the CIF file but may be preferred if only this information is required.

This allows users to bulk download the exact proteins they need, without having to download the entire dataset. Other columns may also be used to select subsets of proteins, and we point the user to the BigQuery documentation to understand other ways to filter for their desired protein lists. Likewise, the documentation should be followed to download these file subsets locally, as the most appropriate approach will depend on the filesize. Note that it may be easier to download large files using Colab (e.g. pandas to_csv).

Previous versions

Previous versions of AFDB will remain available at gs://public-datasets-deepmind-alphafold to enable reproducible research. We recommend using the latest version (v4).