## Example datasets

These example GFA files have been generated from <a href="https://github.com/maickrau/MBG">MBG</a> on readsets generated by the Darwin Tree of Life project.

We will go through an example or two of how `gfatk` can be used to interrogate a GFA file and find out some interesting things, including getting a linearised assembly. We will be using the latest version of `gfatk`, version 0.2.16. Grab a copy, either using the releases, or compile yourself.

```bash
# latest release from crates.io
cargo install gfatk
```

### The Tree of Heaven

*Ailanthus altissima*, whilst not a native plant to Britain and Ireland, has made its way into the DToL datasets. It is a reasonably common weedy tree, at least in the south of England, and is planted widely. MBG (with parameters k = 5001, a = 5, w = 250, and u = 150), was able to produce an assembly GFA. Let's take a look at it.

![alt text](./drAilAlti1_k=5001a=5w=250u=150.png)

This is a nice bandage plot which shows also relative coverage as the colours (low = black, high = red). There are two bigger subgraphs, each with six nodes and one is at much higher coverage than the other. There's also some shrapnel, which we won't consider further. We fed MBG the raw reads, and because of the parameters we used, we have assembled those structures which are at higher coverage than the nuclear genome. So we suspect that the mitochondrial genome and the chloroplast genome are present in this GFA.

Firstly, we can take a look at some statistics from the GFA file. `gfatk stats` can be used for this.

```bash
# run gfatk stats with the tabular output option.
gfatk stats -t ./drAilAlti1_k=5001a=5w=250u=150.gfa
```

And the output of this command is below:

| subgraph_index | gc         | node_count | edge_count | coverage  | segments       | total_seq_len | is_circular |
| -------------- | ---------- | ---------- | ---------- | --------- | -------------- | ------------- | ----------- |
| 2              | 0.43428758 | 6          | 16         | 242.35101 | 3,4,7,14,15,18 | 706976        | true        |
| 1              | 0.38155055 | 6          | 16         | 4610.6597 | 2,5,8,10,20,21 | 190041        | true        |
| 0              | 0.5853151  | 3          | 8          | 1091.1666 | 1,9,16         | 36609         | true        |
| 3              | 0.58213186 | 1          | 2          | 298.333   | 6              | 16577         | true        |
| 7              | 0.5867449  | 1          | 2          | 496.333   | 17             | 16537         | true        |
| 4              | 0.57701683 | 1          | 2          | 773.667   | 11             | 16412         | true        |
| 8              | 0.5879017  | 1          | 2          | 406       | 19             | 16399         | true        |
| 5              | 0.5853971  | 1          | 2          | 613.333   | 12             | 16394         | true        |
| 6              | 0.42857143 | 1          | 2          | 1333      | 13             | 11669         | true        |
| 9              | 0.5870257  | 1          | 0          | 236.5     | 22             | 8170          | false       |

The output is grouped by subgraph, as this is the natural unit in a GFA. The subgraphs are given an integer name starting from zero (this is arbitrary, but it's the order in which they were generated). I've ordered the TSV above by total sequence length, so we can see the largest subgraphs first. Not surprisingly, we see the top row is the statistical output of the largest graph in the bandage plot above. Two things to note are the GC% and the coverage. The GC% is ~0.43, and the coverage is ~242. Plant mitochondria generally have GC% in the range 0.41-0.5, and are of lower coverage than the chloroplast. We can use this information to extract the mitochondria from this larger GFA.

```bash
# extract the mitochondrion from the GFA.
gfatk extract-mito drAilAlti1_k=5001a=5w=250u=150.gfa > ./drAilAlti1_k=5001a=5w=250u=150_extract_mito.gfa
# we can also check the stats directly from an extraction
# by piping the output to `gfatk stats`
gfatk extract-mito drAilAlti1_k=5001a=5w=250u=150.gfa | gfatk stats -t
```

Under the hood, `gfatk extract-mito` uses `gfatk stats` to make a decision on which subgraph is a mitochondrion. Now that we have a putative mitochondrion, we can use `gfatk linear` to get a linearised assembly.

```bash
# `-e` is a flag to evaluate subgraphs (so if we had multiple subgraphs in the same file, the linearisation algorithm would be applied to every subgraph). Redundant in this case, but gives more info in the final fasta headers.
# `-i` includes node coverage in the linearisation algorithm. This means if some nodes were present at relative double coverage, the final path would try and include this node twice.
gfatk linear -e -i drAilAlti1_k=5001a=5w=250u=150_extract_mito.gfa > drAilAlti1_k=5001a=5w=250u=150_extract_mito.fa
```

You will note if you run this, that two lines will be printed to the STDERR.

```txt
[+]     Highest cumulative coverage path = 1142
[+]     Chosen path through graph: 4-,14-,4-,15+,18-,3+,18+,7-
```

If you follow that chosen path through the graph, each of the big loops get included twice, as their coverage (~400x) is roughly double that of the four smaller segments (~200x). Beautiful.

The whole pipeline can be run in a single line:

```bash
gfatk extract-mito drAilAlti1_k=5001a=5w=250u=150.gfa | gfatk linear -e -i > drAilAlti1_k=5001a=5w=250u=150_extract_mito.fa
```

But what about the chloroplast? We also have functionality to extract this too. If we run through the above:

```bash
# yep, extract-chloro
gfatk extract-chloro ./drAilAlti1_k=5001a=5w=250u=150.gfa | gfatk linear -e -i
```

We get a warning!

```
[-]     Recursion depth limit (1000) exceeded in permutation 30. Switching to default path finder.
[+]     Highest cumulative coverage path = 16456
[+]     Chosen path through graph: 20-,21-,10-,5-,8+
```

This is because segment 2 has coverage ~200x, whilst the other segments are ~4000x +. This means segment two would need to appear twenty times in the linearised assembly, which is too many possible paths to compute. We can however, save this by removing the `-i` flag.

```bash
gfatk extract-chloro ./drAilAlti1_k=5001a=5w=250u=150.gfa | gfatk linear -e > drAilAlti1_k=5001a=5w=250u=150_extract_chloro.fa
```

Now we get this printed to the STDERR:

```
[+]     Highest cumulative coverage path = 16456
[+]     Chosen path through graph: 8-,5+,10+,21+,20+
```

And we notice that segment two no longer appears in the longest path. It is instead appended as an extra record in the fasta file at the end. We can check this quickly with `grep`.

```bash
grep "^>" drAilAlti1_k=5001a=5w=250u=150_extract_chloro.fa
```

Returning:

```
>gfatk_linear:path=8-,5+,10+,21+,20+:coverage=16456 subgraph-1:is_circular-true
>2 subgraph-1:is_circular-true
```

And from these fasta headers, we can see that both of these fasta records came from the same subgraph, subgraph 1.

All of this functionality in `gfatk` is the basis of the <a href="https://github.com/tolkit/plant-organellome-assembly">plant organelle assembly pipeline</a>.

### Other `gfatk` functionality

The above illustrates most of the functionality of `gfatk`, however there are a few more tools which can be elaborated on here.

#### `gfatk fasta`

If you want to print the segments out in fasta format, this is the one for you. For example:

```bash
gfatk fasta ./drAilAlti1_k=5001a=5w=250u=150.gfa | grep "^>"
# lists segments 1-22, nothing fancy going on here.

# you could do the same thing with awk
# awk '/^S/{print ">"$2"\n"$3}'
# or gfatools
# ./gfatools gfa2fa ...
```

#### `gfatk path`

Sometimes, a path may need to be manually specified. For example, in the GFA at the beginning, we see the chloroplast. This could not be linearised including node coverage due to an extra segment (segment 2) being present. We could omit this, and specify our own path:

```bash
gfatk path ./drAilAlti1_k=5001a=5w=250u=150_extract_chloro.gfa 20+,21-,10-,5-,8+,21+
```

Of course, we could specify any subpath:

```bash
gfatk path ./drAilAlti1_k=5001a=5w=250u=150_extract_chloro.gfa 20+,21-,10-,5-
```

In any orientation that is legal:

```bash
# e.g.
gfatk path ./drAilAlti1_k=5001a=5w=250u=150_extract_chloro.gfa 5+,10+,21+
```

#### `gfatk extract`

Sometimes the automatic subgraph extractions by `gfatk extract-chloro/mito` will fail. `gfatk extract` provides a way of manually extracting the subgraph we are interested in. We can only really extract subgraphs, extracting individual nodes seemed less useful.

For example:

```bash
# extract the mito
gfatk extract -s 4 ./drAilAlti1_k=5001a=5w=250u=150.gfa
# extract the chloro
gfatk extract -s 20 ./drAilAlti1_k=5001a=5w=250u=150.gfa
# extract both and remove shrapnel
gfatk extract -s 4,20 ./drAilAlti1_k=5001a=5w=250u=150.gfa
# for large graphs, the iterations of searching may need to be
# increased
gfatk extract -i 30 -s 20 ./drAilAlti1_k=5001a=5w=250u=150.gfa 

# gfatools also supports subgraph extraction
# ./gfatools view -l <id> -r 1 <gfa> > out.gfa
# but this deletes the CIGAR overlaps...
```