biorustlings

Crates.io	biorustlings
lib.rs	biorustlings
version	0.0.2
created_at	2017-11-13 05:28:52.824749+00
updated_at	2017-11-13 05:28:52.824749+00
description	Scripts for parsing UniParc XML files downloaded from the Uniprot website into CSV files.
homepage	https://kimlab.gitlab.io/biorustlings
repository	https://gitlab.com/kimlab/biorustlings
max_upload_size
id	39195
size	58,237

Alexey Strokach (ostrokach)

documentation

https://kimlab.gitlab.io/biorustlings

Process the UniParc XML file (uniparc_all.xml.gz) downloaded from the UniProt website into CSV files that can be loaded into a relational database.

Parsing 1 million lines takes about 5.5 seconds:

$ mkdir uniparc
$ time bash -c "zcat tests/uniparc_1mil.xml.gz | uniparc_xml_parser >/dev/null"

real    0m5.564s
user    0m5.528s
sys     0m0.132s

The actual uniparc_all.xml.gz file is about 5 billion rows.

Splitting the file requires reading the entire file. If we're reading the entire file anyway, why not parse it as we read it?
Having a single process which parses uniparc_all.xml.gz makes it easier to create an incremental unique index column (e.g. UniparcXRef.idx, Property.idx, etc.).

Commit count: 0