Crates.io | fusta |
lib.rs | fusta |
version | 1.7.1 |
source | src |
created_at | 2022-08-16 14:17:07.888935 |
updated_at | 2023-01-23 22:43:56.61919 |
description | FUSTA leverages the FUSE interface to transparently manipulate multiFASTA files as independent files |
homepage | https://github.com/delehef/fusta |
repository | https://github.com/delehef/fusta |
max_upload_size | |
id | 646689 |
size | 148,394 |
[[file:fusta.png]]
The virtual files exposed by FUSTA behave like standard flat text files, and provide automatic compatibility with all existing programs. When handling large multiFASTA files, the intrinsic file caching capacities of the OS are leveraged to ensure the best experience to the user.
** Citation
If you use FUSTA, please cite [[https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac091/6851693][FUSTA: leveraging FUSE for manipulation of multiFASTA files at scale]], https://doi.org/10.1093/bioadv/vbac091
** Licensing FUSTA is distributed under the CeCILL-C (LGPLv3 compatible) license. Please see the LICENSE file for details.
You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Fedora/Rocky Linux/Alma Linux #+begin_src sudo yum install rust cargo fuse3 fuse3-devel cargo install --git https://github.com/delehef/fusta #+end_src
You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Debian #+begin_src bash sudo apt install curl fuse3 libfuse3-dev curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Debian cargo is outdated #+end_src then rebot your shell to update the =PATH= environment variable.
Finally, install FUSTA: #+begin_src bash cargo install --git https://github.com/delehef/fusta #+end_src You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** Scientific Linux #+begin_src bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh sudo yum install fuse3 fuse3-devel #+end_src then rebot your shell to update the =PATH= environment variable.
Finally, install FUSTA: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src
You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** macOS On macOS, you will need to install the build tools if you have not them ready yet: =xcode-select --install=
You must then download and install [[https://osxfuse.github.io/][FUSE for macOS]] in order to be able to use FUSTA.
Finally, to build FUSTA, you need to install the [[https://www.rust-lang.org/en-US/install.html][Rust compiler]]. You can then build FUSTA by running =cargo=, the Rust build tool: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src ** FreeBSD #+begin_src bash sudo pkg install rust pkgconf fusefs-libs # Install build dependencies sudo sysctl vfs.usermount=1 # enable FUSE mounting without requiring administrator permissions sudo kldload fuse # load the FUSE kernel module #+end_src
Finally, install FUSTA: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src
You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use. ** From Sources You should install FUSE (as well as its potential =devel= package), from your package manager – note that a reboot might be necessary for the kernel module to be loaded.
To build FUSTA, you need to install the [[https://www.rust-lang.org/en-US/install.html][Rust compiler]]. You can then build FUSTA by running =cargo=, the Rust build tool: #+begin_src cargo install --git https://github.com/delehef/fusta #+end_src
You can now find =fusta= in =$HOME/cargo/bin/=; you should add this this path to your =$PATH= for easier use.
#+begin_src fusta file.fa tree -h fusta/ fusermount -u fusta #+end_src ** Description Once started, =fusta= will expose the content of a FASTA file in a way that makes it usable by any piece of software using as if it were a set of independent files, detailed as follow.
For instance, here is the virtual hierarchy created by =fusta= after mounting a FASTA file containing /A. thaliana/ genome #+begin_src fusta ├── append ├── fasta │ ├── 1.fa │ ├── 2.fa │ ├── 3.fa │ ├── 4.fa │ ├── 5.fa │ ├── Mt.fa │ └── Pt.fa ├── get ├── infos.csv ├── infos.txt ├── labels.txt └── seqs ├── 1.seq ├── 2.seq ├── 3.seq ├── 4.seq ├── 5.seq ├── Mt.seq └── Pt.seq #+end_src
FUSTA supports all FUSTA files using UNIX-style line endings, including but not restricted to DNA files, protein files, gapped files, mixed-case files, and independently of their inner formatting (line wrapping, line length, /etc./).
*** =infos.csv=
This read-only CSV file contains a list of all the fragments present in the mounted FASTA file, with, for each of them, the standard =id= and =additional informations= field, plus a third one containing the length of the sequence.
*** =infos.txt=
This read-only text file provides the same informations, but in a more human-readable format.
*** =labels.txt=
This read-only file contains a list of all the sequence headers present in the mounted FASTA file.
*** =fasta=
This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read-only FASTA files.
*** =seqs=
This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read/write files containing only the sequences - without the FASTA headers, but with any newline preserved. These files can be read, copied, removed, edited, etc. as normal files, and any alteration will be reflected on the original FASTA file when fusta is closed.
*** =append=
This folder should be used to add new sequences to the mounted FASTA file. Any valid fasta file copied or moved to this directory will be appended to the original FASTA files. It should be noted that the process is completely transparent and the the folder will remain empty, even though the operation is successful.
*** =get=
This folder is used for range-access to the sequences in the mounted FASTA file. Although it is empty, any read access to a (non-existing) file following the pattern =SEQID:START-END= will return the corresponding range (1-indexed, fully-closed) in the specified sequence. It should be noted that the access skip headers and newlines, so that the =START-END= coordinates map to actual loci in the corresponding sequence and not to bytes in the mounted FASTA file.
** Examples
All the following examples assume that a FASTA file has been mounted (/e.g./ =fusta -D genome.fa=), and is unmounted after manipulation (/e.g./ =fusermount -u fusta=).
*** Get an overview of the file content
#+begin_src shell
cat fusta/infos.txt
#+end_src
*** Extract individual sequences as FASTA files
#+begin_src shell
cat fusta/fasta/chr{X,Y}.fa > ~/sex_chrs.fa
#+end_src
*** Extract a range of chromosome 12
#+begin_src shell
cat fusta/get/chr12:12000000-12002000
#+end_src
*** Remove sequences from the original file
#+begin_src shell
rm fusta/seq/chr{3,5}.seq
#+end_src
*** Add a new sequence
#+begin_src shell
cp more_sequences.fa fusta/append
#+end_src
*** Upcasing a sequence
#+begin_src shell
sed 's/[a-z]/\U&/g' fusta/seqs/chr21.seq | sponge fusta/seqs/chr21.seq
#+end_src
*** Edit the mitochondrial genome
#+begin_src shell
nano fusta/seq/chrMT.seq
#+end_src
*** Batch-rename chromosomes
#+begin_src shell
cd fusta/seq; for i in *; do mv ${i} chr${i}; done
#+end_src
*** Use independent sequences in external programs
#+begin_src shell
blastn mydb.db -query fusta/fasta/seq25.fa
asgart fusta/fasta/chrX.fa fusta/asgart/chrY.fa --out result.json
#+end_src
** Compressed FASTA files
FUSTA only works with uncompressed (multi)FASTA files. If you wish to use FUSTA on compressed (multi)FASTA files, we recommend to use [[https://github.com/yhoogstrate/fastafs][FASTAFS]] as an intermediary to expose a compressed (multi)FASTA file to FUSTA without requiring to ully uncompress it.
** Runtime options
#+begin_src
USAGE:
fusta [OPTIONS]
ARGS:
OPTIONS:
-C, --max-cache
*** =--cache= The cache option is key in adapting FUSTA to your use, and for files of non-trivial size, a correct choice is the difference between a memory overflow and a smooth run: