Crates.io | htsget-search |
lib.rs | htsget-search |
version | 0.11.3 |
created_at | 2023-01-06 05:35:29.153232+00 |
updated_at | 2025-08-21 00:38:09.695413+00 |
description | The primary mechanism by which htsget-rs interacts with, and processes bioinformatics files. It does this by using noodles to query files and their indices. |
homepage | https://github.com/umccr/htsget-rs/blob/main/htsget-search/README.md |
repository | https://github.com/umccr/htsget-rs |
max_upload_size | |
id | 752092 |
size | 262,074 |
Creates URL tickets for htsget-rs by processing bioinformatics files. It:
This crate is the primary mechanism by which htsget-rs interacts with, and processes bioinformatics files. It does this by using noodles to query files and their indices. This crate contains abstractions that remove commonalities between file formats. Together with file format specific code, this defines an interface that handles the core logic of a htsget request.
This crate is responsible for handling bioinformatics file data. It supports BAM, CRAM, VCF and BCF files. For htsget-rs to function, files need to be organised in the following way:
.bam
; paired with BAI index, which must end with .bam.bai
..cram
; paired with CRAI index, which must end with .cram.crai
..vcf.gz
; paired with TBI index, which must end with .vcf.gz.tbi
..bcf
; paired with CSI index, which must end with .bcf.csi
..gzi
.This crate has the following features:
HtsGet
trait represents an entity that can resolve queries according to the htsget spec.
The htsget trait comes with a basic model to represent components needed to perform a search: Query
, Format
,
Class
, Tags
, Headers
, Url
, Response
. HtsGetFromStorage
is the struct which is
used to process requests.This crate has the following features:
aws
: used to enable S3
location functionality and any other AWS features.url
: used to enable Url
location functionality.experimental
: used to enable experimental features that aren't necessarily part of the htsget spec, such as Crypt4GH support through C4GHStorage
.One challenge involved with implementing htsget is minimising the size of byte ranges returned in response tickets. Since htsget is used to reduce the amount of data a client needs to fetch by querying specific parts of a file, the data returned by htsget should ideally be as minimal as possible. This is done by reading the index file or the underlying target file, to determine the required byte ranges.
For BGZF files, GZI files are supported, which enable the smallest possible byte ranges.
For BGZF compressed files, htsget-rs needs to return compressed byte positions. Also, after concatenating data from URL tickets, the resulting file must be valid. This means that byte ranges must start and finish on BGZF blocks, otherwise the concatenation would not result in a valid file. Index files (BAI, TBI, CSI) do not contain all the information required to produce minimal byte ranges. For example, consider this file:
4668
, 256721
, 499249
, 555224
, 627987
, 824361
, 977196
, 1065952
, 1350270
, 1454565
, 1590681
, 1912645
, 2060795
and 2112141
.referenceName=11
, start=5015000
, and end=5050000
bytes=0-4667
bytes=256721-1065951
bytes=0-4667
bytes=256721-647345
bytes=824361-842100
bytes=977196-996014
To produce the smallest byte ranges, htsget-rs needs can search through GZI files and regular index files. It does not read data from the underlying target file.
Since this crate is used to query file data, it is the most performance critical component of htsget-rs. Benchmarks, using Criterion.rs are written to test performance. Run benchmarks by executing:
cargo bench -p htsget-search --all-features
Alternatively if you are using cargo-criterion
and want a machine-readable JSON output, run:
cargo criterion --bench search-benchmarks --message-format=json -- LIGHT 1> search-benchmarks.json
This project is licensed under the MIT license.