fetch-data ========== [![github](https://img.shields.io/badge/github-fetch--data-8da0cb?style=flat&labelColor=555555&logo=github)](https://github.com/CarlKCarlK/fetch-data) [![crates.io](https://img.shields.io/crates/v/fetch-data.svg?style=flat&color=fc8d62&logo=rust)](https://crates.io/crates/fetch-data) [![docs.rs](https://img.shields.io/badge/docs.rs-fetch--data-66c2a5?style=flat&labelColor=555555&logoColor=white&logo=)](https://docs.rs/fetch-data) [![CI](https://github.com/CarlKCarlK/anyinput/actions/workflows/ci.yml/badge.svg)](https://github.com/CarlKCarlK/fetch-data/actions/workflows/ci.yml) Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility. `Fetch-Data` checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash. `Fetch-Data` makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file. ```rust use fetch_data::sample_file; let path = sample_file("small.fam")?; println!("{}", std::fs::metadata(path)?.len()); // Prints 85 # Ok::<(), anyhow::Error>(()) ``` Features -------- * Thread-safe -- allowing it to be used with Rust's multithreaded testing framework. * Inspired by Python's popular [Pooch](https://pypi.org/project/pooch/) and our PySnpTools [filecache module](https://fastlmm.github.io/PySnpTools/#module-pysnptools.util.filecache). * Avoids run-times such as Tokio (by using [`ureq`](https://crates.io/crates/ureq) to download files via blocking I/O). Suggested Usage --------------- You can set up [`FetchData`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html) many ways. Here are the steps -- followed by sample code -- for one set up. * Create a `registry.txt` file containing a whitespace-delimited list of files and their hashes. (This is the same format as [Pooch](https://pypi.org/project/pooch/). See section [Registry Creation](#registry-creation) for tips on creating this file.) * As shown below, create a global static [`FetchData`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html#method.new) instance that reads your `registry.txt` file. Give it: * the URL root from which to download the files * an environment variable telling the local data directory in which to store the files * a `qualifier`, `organization`, and `application` -- Used to create a local data directory when the environment variable is not set. See crate [ProjectsDir](https://docs.rs/directories/latest/directories/struct.ProjectDirs.html#method.from_path) for details. * As shown below, define a public `sample_file` function that takes a file name and returns a `Result` containing the path to the downloaded file. ```rust use fetch_data::{ctor, FetchData, FetchDataError}; use std::path::{Path, PathBuf}; #[ctor] static STATIC_FETCH_DATA: FetchData = FetchData::new( include_str!("../registry.txt"), "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/", "BAR_APP_DATA_DIR", // env_key "com", // qualifier "Foo Corp", // organization "Bar App", // application ); /// Download a data file. pub fn sample_file>(path: P) -> Result> { STATIC_FETCH_DATA.fetch_file(path) } ``` You can now use your `sample_file` function to download your files as needed. Registry Creation ------------------ You can create your `registry.txt` file many ways. Here are the steps -- followed by sample code -- for one way to create it. * Upload your data files to the Internet. * For example, `Fetch-Data` puts its sample data files in `tests/data`, so they upload to [this GitHub folder](https://github.com/CarlKCarlK/fetch-data/tree/main/tests/data). In GitHub, by looking at the [raw view of a data file](https://github.com/CarlKCarlK/fetch-data/blob/main/tests/data/small.fam), we see the root URL for these files. In `cargo.toml`, we keep these data files out of our crate via `exclude = ["tests/data/*"]` * As shown below, write code that * Creates a [`FetchData`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html#method.new) instance without registry contents. * Lists the files in your data directory. * Calls the [`gen_registry_contents`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html#method.gen_registry_contents) method on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes. * Print this string, then manually paste it into a file called `registry.txt`. ```rust use fetch_data::{FetchData, dir_to_file_list}; let fetch_data = FetchData::new( "", // registry_contents ignored "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/", "BAR_APP_DATA_DIR", // env_key "com", // qualifier "Foo Corp", // organization "Bar App", // application ); let file_list = dir_to_file_list("tests/data")?; let registry_contents = fetch_data.gen_registry_contents(file_list)?; println!("{registry_contents}"); # use fetch_data::FetchDataError; // '#' needed for doctest # Ok::<(), Box>(()) ``` Notes ----- * Feature requests and contributions are welcome. * Don't use our sample `sample_file`. Define your own `sample_file` that knows where to find *your* data files. * The [`FetchData`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html) instance need not be global and static. See [`FetchData::new`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html#method.new) for an example of a non-global instance. * Additional [`methods on the FetchData`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html#implementations) instance can fetch multiples files and can give the path to the local data directory. * You need not use a `registry.txt` file and [`FetchData`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html) instance. You can instead use the stand-alone function [`fetch`](https://docs.rs/fetch-data/latest/fetch_data/fn.fetch.html) to retrieve a single file with known URL, hash, and local path. * Additional [stand-alone functions](https://docs.rs/fetch-data/latest/fetch_data/#functions) can download files and hash files. * `Fetch-Data` always does binary downloads to maintain consistent line endings across OSs. * The [Bed-Reader](https://github.com/fastlmm/bed-reader/tree/fetch-hash) genomics crate uses `Fetch-Data`. * To make `FetchData` work well as a static global, [`FetchData::new`](https://docs.rs/fetch-data/latest/fetch_data/struct.FetchData.html#method) never fails. Instead, `FetchData` stores any error and returns it when the first call to `fetch_file`, etc., is made. * Debugging this crate under Windows can cause a "Oops! The debug adapter has terminated abnormally" exception. This is some kind of [LLVM, Windows, NVIDIA(?) problem](https://github.com/vadimcn/vscode-lldb/issues/410) via ureq. * This crate follows [Nine Rules for Elegant Rust Library APIs](https://towardsdatascience.com/nine-rules-for-elegant-rust-library-apis-9b986a465247) from *Towards Data Science*. Project Links ----- * [**Installation**](https://crates.io/crates/fetch-data) * [**Documentation**](https://docs.rs/fetch-data/) * [**Source code**](https://github.com/CarlKCarlK/fetch-data) * [**Discussion**](https://github.com/CarlKCarlK/fetch-data/discussions/) * [**Bug Reports and Feature Requests**](https://github.com/CarlKCarlK/fetch-data/issues)