# UniParc XML parser [![docs](https://img.shields.io/badge/docs-v0.2.0-blue.svg)](https://ostrokach.gitlab.io/uniparc_xml_parser/v0.2.0/) [![conda](https://img.shields.io/conda/pn/ostrokach-forge/uniparc_xml_parser)](https://anaconda.org/ostrokach-forge/uniparc_xml_parser/) [![pipeline status](https://gitlab.com/ostrokach/uniparc_xml_parser/badges/v0.2.0/pipeline.svg)](https://gitlab.com/ostrokach/uniparc_xml_parser/commits/v0.2.0/) - [Introduction](#introduction) - [Usage](#usage) - [Table schema](#table-schema) - [Installation](#installation) - [Binaries](#binaries) - [Cargo](#cargo) - [Conda](#conda) - [Output files](#output-files) - [Parquet](#parquet) - [Google BigQuery](#google-bigquery) - [Benchmarks](#benchmarks) - [Roadmap](#roadmap) - [FAQ (Frequently Asked Questions)](#faq-frequently-asked-questions) - [FUQ (Frequently Used Queries)](#fuq-frequently-used-queries) ## Introduction Process the UniParc XML file (`uniparc_all.xml.gz`) downloaded from the UniProt [website](http://www.uniprot.org/downloads) into CSV files that can be loaded into a relational database. ## Usage Uncompressed XML data can be piped into `uniparc_xml_parser` in order to ```bash $ curl -sS ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/uniparc_all.xml.gz \ | zcat \ | uniparc_xml_parser ``` The output is a set of CSV (or more specifically TSV) files: ```bash $ ls -rw-r--r-- 1 user group 174G Feb 9 13:52 xref.tsv -rw-r--r-- 1 user group 149G Feb 9 13:52 domain.tsv -rw-r--r-- 1 user group 138G Feb 9 13:52 uniparc.tsv -rw-r--r-- 1 user group 107G Feb 9 13:52 protein_name.tsv -rw-r--r-- 1 user group 99G Feb 9 13:52 ncbi_taxonomy_id.tsv -rw-r--r-- 1 user group 74G Feb 9 20:13 uniparc.parquet -rw-r--r-- 1 user group 64G Feb 9 13:52 gene_name.tsv -rw-r--r-- 1 user group 39G Feb 9 13:52 component.tsv -rw-r--r-- 1 user group 32G Feb 9 13:52 proteome_id.tsv -rw-r--r-- 1 user group 15G Feb 9 13:52 ncbi_gi.tsv -rw-r--r-- 1 user group 21M Feb 9 13:52 pdb_chain.tsv -rw-r--r-- 1 user group 12M Feb 9 13:52 uniprot_kb_accession.tsv -rw-r--r-- 1 user group 656K Feb 9 04:04 uniprot_kb_accession.parquet ``` ## Table schema The generated CSV files conform to the following schema: