Crates.io | uniparc_xml_parser |
lib.rs | uniparc_xml_parser |
version | 0.2.1 |
source | src |
created_at | 2016-12-31 03:04:29.526311 |
updated_at | 2021-02-09 23:03:33.18212 |
description | Scripts for parsing UniParc XML files downloaded from the Uniprot website into CSV files. |
homepage | https://gitlab.com/ostrokach/uniparc_xml_parser |
repository | https://gitlab.com/ostrokach/uniparc_xml_parser |
max_upload_size | |
id | 7867 |
size | 1,225,001 |
Process the UniParc XML file (uniparc_all.xml.gz
) downloaded from the UniProt website into CSV files that can be loaded into a relational database.
Uncompressed XML data can be piped into uniparc_xml_parser
in order to
$ curl -sS ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/uniparc_all.xml.gz \
| zcat \
| uniparc_xml_parser
The output is a set of CSV (or more specifically TSV) files:
$ ls
-rw-r--r-- 1 user group 174G Feb 9 13:52 xref.tsv
-rw-r--r-- 1 user group 149G Feb 9 13:52 domain.tsv
-rw-r--r-- 1 user group 138G Feb 9 13:52 uniparc.tsv
-rw-r--r-- 1 user group 107G Feb 9 13:52 protein_name.tsv
-rw-r--r-- 1 user group 99G Feb 9 13:52 ncbi_taxonomy_id.tsv
-rw-r--r-- 1 user group 74G Feb 9 20:13 uniparc.parquet
-rw-r--r-- 1 user group 64G Feb 9 13:52 gene_name.tsv
-rw-r--r-- 1 user group 39G Feb 9 13:52 component.tsv
-rw-r--r-- 1 user group 32G Feb 9 13:52 proteome_id.tsv
-rw-r--r-- 1 user group 15G Feb 9 13:52 ncbi_gi.tsv
-rw-r--r-- 1 user group 21M Feb 9 13:52 pdb_chain.tsv
-rw-r--r-- 1 user group 12M Feb 9 13:52 uniprot_kb_accession.tsv
-rw-r--r-- 1 user group 656K Feb 9 04:04 uniprot_kb_accession.parquet
The generated CSV files conform to the following schema:
Linux binaries are available here: https://gitlab.com/ostrokach/uniparc_xml_parser/-/packages.
Use cargo
to compile and install uniparc_xml_parser
for your target platform:
cargo install uniparc_xml_parser
Use conda
to install precompiled binaries:
conda install -c ostrokach-forge uniparc_xml_parser
Parquet files containing the processed data are available at the following URL and are updated monthly: http://uniparc.data.proteinsolver.org/.
The data can also be queried directly using Google BigQuery: https://console.cloud.google.com/bigquery?project=ostrokach-data&p=ostrokach-data&page=dataset&d=uniparc.
Parsing 10,000 XML entires takes around 30 seconds (the process is mostly IO-bound):
$ time bash -c "zcat uniparc_top_10k.xml.gz | uniparc_xml_parser >/dev/null"
real 0m33.925s
user 0m36.800s
sys 0m1.892s
The actual uniparc_all.xml.gz
file has around 373,914,570 elements.
Why not split uniparc_all.xml.gz
into multiple small files and process them in parallel?
uniparc_all.xml.gz
makes it easier to create an incremental unique index column (e.g. xref.xref_id
).TODO