fetch-data

Crates.iofetch-data
lib.rsfetch-data
version0.2.0
sourcesrc
created_at2022-06-29 14:34:33.007567
updated_at2024-07-29 16:37:51.99802
descriptionFetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.
homepagehttps://github.com/CarlKCarlK/fetch-data
repositoryhttps://github.com/CarlKCarlK/fetch-data
max_upload_size
id615469
size41,882
Carl Kadie (CarlKCarlK)

documentation

https://docs.rs/fetch-data/latest/fetch-data/

README

fetch-data

github crates.io docs.rs CI

Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.

Fetch-Data checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.

Fetch-Data makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.

use fetch_data::sample_file;

let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85

# Ok::<(), anyhow::Error>(())

Features

  • Thread-safe -- allowing it to be used with Rust's multithreaded testing framework.
  • Inspired by Python's popular Pooch and our PySnpTools filecache module.
  • Avoids run-times such as Tokio (by using ureq to download files via blocking I/O).

Suggested Usage

You can set up FetchData many ways. Here are the steps -- followed by sample code -- for one set up.

  • Create a registry.txt file containing a whitespace-delimited list of files and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.)

  • As shown below, create a global static FetchData instance that reads your registry.txt file. Give it:

    • the URL root from which to download the files
    • an environment variable telling the local data directory in which to store the files
    • a qualifier, organization, and application -- Used to create a local data directory when the environment variable is not set. See crate ProjectsDir for details.
  • As shown below, define a public sample_file function that takes a file name and returns a Result containing the path to the downloaded file.

use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};

#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
    include_str!("../registry.txt"),
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);

/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, Box<FetchDataError>> {
    STATIC_FETCH_DATA.fetch_file(path)
}

You can now use your sample_file function to download your files as needed.

Registry Creation

You can create your registry.txt file many ways. Here are the steps -- followed by sample code -- for one way to create it.

  • Upload your data files to the Internet.
    • For example, Fetch-Data puts its sample data files in tests/data, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. In cargo.toml, we keep these data files out of our crate via exclude = ["tests/data/*"]
  • As shown below, write code that
    • Creates a FetchData instance without registry contents.
    • Lists the files in your data directory.
    • Calls the gen_registry_contents method on your list of files. This method will download the files, compute their hashes, and create a string of file names and hashes.
  • Print this string, then manually paste it into a file called registry.txt.
use fetch_data::{FetchData, dir_to_file_list};

let fetch_data = FetchData::new(
    "", // registry_contents ignored
    "https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
    "BAR_APP_DATA_DIR", // env_key
    "com",              // qualifier
    "Foo Corp",         // organization
    "Bar App",          // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");

# use fetch_data::FetchDataError; // '#' needed for doctest
# Ok::<(), Box<FetchDataError>>(())

Notes

  • Feature requests and contributions are welcome.

  • Don't use our sample sample_file. Define your own sample_file that knows where to find your data files.

  • The FetchData instance need not be global and static. See FetchData::new for an example of a non-global instance.

  • Additional methods on the FetchData instance can fetch multiples files and can give the path to the local data directory.

  • You need not use a registry.txt file and FetchData instance. You can instead use the stand-alone function fetch to retrieve a single file with known URL, hash, and local path.

  • Additional stand-alone functions can download files and hash files.

  • Fetch-Data always does binary downloads to maintain consistent line endings across OSs.

  • The Bed-Reader genomics crate uses Fetch-Data.

  • To make FetchData work well as a static global, FetchData::new never fails. Instead, FetchData stores any error and returns it when the first call to fetch_file, etc., is made.

  • Debugging this crate under Windows can cause a "Oops! The debug adapter has terminated abnormally" exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.

  • This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.

Project Links

Commit count: 31

cargo fmt