# gfatk
A command line utility to explore, extract, and linearise plant mitochondrial assemblies. The Graphical Fragment Assembly files (GFA's) used to refine the code in this repository are almost exclusively generated from the assembly program ``MBG``. See the testing section below for caveats.
## Install
Grab from the releases (Mac & Linux only):
```bash
# for mac
curl -L "https://github.com/tolkit/gfatk/releases/download/0.2.2/gfatk_mac_0.2.2" > gfatk && chmod +x gfatk
# and linux (ubuntu)
curl -L "https://github.com/tolkit/gfatk/releases/download/0.2.2/gfatk_ubuntu_0.2.2" > gfatk && chmod +x gfatk
```
Or build from source.
```bash
# e.g. get rustup!
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# get directly from crates.io
# currently this is the latest available version 0.2.3XX
cargo install gfatk
# or clone this repo!
git clone https://github.com/tolkit/gfatk
# cd!
cd gfatk
# build!
cargo build --release
# or install into your path!
cargo install --path .
```
## Features
The features of the toolkit reflect only their usefulness in debugging, visualising, and linearising GFA's from (especially) plant mitochondrial genome assemblies output from ``MBG``. These genomes are usually pretty small (up to 2Mb), and in many cases have circular or branching paths.
Current help:
```
Explore and linearise (plant organellar) GFA files.
Usage: gfatk [COMMAND]
Commands:
overlap Extract overlaps from a GFA.
extract Extract subgraph from a GFA, given a segment name.
linear Force a linear representation of the graph.
fasta Extract a fasta file.
Almost as simple as: awk '/^S/{print ">"$2"\n"$3}'.
stats Some stats about the input GFA.
extract-mito Extract the mitochondria from a GFA.
extract-chloro Extract the plastid from a GFA.
dot Return the dot representation of a GFA.
trim Trim a GFA to remove nodes of degree < 4 (i.e. only has one neighbour).
path Supply an input path to evaluate a linear representation of.
Input must be a text file of a single comma separated line with node ID's and orientations. E.g. 1+,2-,3+
rename Rename the segment ID's of a GFA.
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
```
To explain each of these briefly:
- `gfatk dot ` - generates a ``DOT language`` representation of the GFA.
- `gfatk extract -s -i ` - extracts the subgraph from the GFA, given a segment name, or multiple (if multiple, these must be comma separated without space). Number of iterations may need to be increased for large graphs.
- `gfatk extract-chloro ` - extracts the plastid from the GFA. It has default parameters which seem to work okay.
- `gfatk extract-mito ` - extracts the mitochondria from the GFA. It has default parameters which seem to work okay.
- `gfatk fasta ` - extracts a fasta file from the GFA. This simply prints each of the segments from the GFA. I say it's almost as simple as the `awk` version, but the toolkit does some checks to see if we are actually dealing with a GFA or not.
- `gfatk linear -e -i -n ` - forces the longest linear legal representation of the graph. You can evaluate within subgraphs (`-e`), or include node coverage information (`-i`).
- `gfatk overlap -s ` - extracts the overlaps from the GFA. These are taken from the CIGAR string from each of the links, and optionally extended (e.g. `-s 1000` to 1000bp either side of the overlap).
- `gfatk path (-p path/to/path.txt)` - evaluates a linear representation of the graph, given an input path. The input path can be on the command line, or a file. Simply, it must be an comma separated list of node ID's and orientations (1+,2-,3+ ... ).
- `gfatk stats -t` - some stats about the input GFA. Can be quite verbose for large, unconnected graphs. `-t` outputs tabular data (TSV).
- `gfatk trim ` - removes segments if they have only a single neighbour. Useful for trimming GFA's which have segments attached at low coverage.
These are not all the options for each subcommand. Run:
`gfatk help ` for more information on each subcommand.
Many of these commands can be chained in a pipeline, e.g. `gfatk extract-chloro in.gfa | gfatk linear > out.fa`.
## Examples and docs
A couple of more detailed examples can be seen in the `examples` directory, where there is a `README.md` file. To view the auto-generated documentation of the binary itself, including details of all underlying functions, see:
API documentation
## Requirements and testing
Some unit tests are now provided in the `tests` directory. To run these (you'll need Rust):
```bash
cargo test --release
```
For full functionality of the toolkit, two tags are required, node coverage and edge coverage. Other functionality will fail if the CIGAR string is not purely an overlap; i.e. in the format `M`. Only GFA version 1 supported. Only header (`H`), segment (`S`), and link (`L`) lines are required. `P` lines are used in `gfatk path --all `.
```
H VN:Z:1.0
S 11 ACCTT ll:f:30.0 <- this tag indicates node/segment coverage (here it's 30.0)
S 12 TCAAGG ll:f:60.0
S 13 CTTGATT ll:f:30.0
L 11 + 12 - 4M ec:i:1 <- this tag indicates edge coverage (here it's 1)
L 12 - 13 + 5M ec:i:1
L 11 + 13 + 3M ec:i:1
L 12 + 11 - 4M ec:i:1
L 13 - 12 + 5M ec:i:1
L 13 - 11 - 3M <- simple overlap on the CIGAR string (overlap == 3) ec:i:1
```
## Thanks
Many thanks to the developers of MBG, and partners in the Tree of Life program, and beyond:
- Marcela Uliano-Silva
- Sergey Nurk
- Alex Twyford
- Lucia Campos
- Chenxi Zhou
- Mark Blaxter