| Crates.io | biorustlings |
| lib.rs | biorustlings |
| version | 0.0.2 |
| created_at | 2017-11-13 05:28:52.824749+00 |
| updated_at | 2017-11-13 05:28:52.824749+00 |
| description | Scripts for parsing UniParc XML files downloaded from the Uniprot website into CSV files. |
| homepage | https://kimlab.gitlab.io/biorustlings |
| repository | https://gitlab.com/kimlab/biorustlings |
| max_upload_size | |
| id | 39195 |
| size | 58,237 |
Process the UniParc XML file (uniparc_all.xml.gz) downloaded from the UniProt website into CSV files that can be loaded into a relational database.
Parsing 1 million lines takes about 5.5 seconds:
$ mkdir uniparc
$ time bash -c "zcat tests/uniparc_1mil.xml.gz | uniparc_xml_parser >/dev/null"
real 0m5.564s
user 0m5.528s
sys 0m0.132s
The actual uniparc_all.xml.gz file is about 5 billion rows.
uniparc_all.xml.gz into multiple small files and process them in paralleluniparc_all.xml.gz makes it easier to create an incremental unique index column (e.g. UniparcXRef.idx, Property.idx, etc.).