Crates.io | fetch-data |
lib.rs | fetch-data |
version | 0.2.0 |
source | src |
created_at | 2022-06-29 14:34:33.007567 |
updated_at | 2024-07-29 16:37:51.99802 |
description | Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility. |
homepage | https://github.com/CarlKCarlK/fetch-data |
repository | https://github.com/CarlKCarlK/fetch-data |
max_upload_size | |
id | 615469 |
size | 41,882 |
Fetch data files from a URL, but only if needed. Verify contents via SHA256. Some Python Pooch compatibility.
Fetch-Data
checks a local data directory and then downloads needed files. It always verifies the local files and downloaded files via a hash.
Fetch-Data
makes it easy to download large and small sample files. For example, here we download a genomics file from GitHub (if it has not already been downloaded). We then print the size of the now local file.
use fetch_data::sample_file;
let path = sample_file("small.fam")?;
println!("{}", std::fs::metadata(path)?.len()); // Prints 85
# Ok::<(), anyhow::Error>(())
ureq
to download files via blocking I/O).You can set up FetchData
many ways. Here are the steps -- followed by sample code -- for one set up.
Create a registry.txt
file containing a whitespace-delimited list of files
and their hashes. (This is the same format as Pooch. See section Registry Creation for tips on creating this file.)
As shown below, create a global static
FetchData
instance that reads your registry.txt
file. Give it:
qualifier
, organization
, and application
-- Used to
create a local data
directory when the environment variable is not set. See crate ProjectsDir for details.As shown below, define a public sample_file
function that takes a file name and returns a Result
containing the path to the downloaded file.
use fetch_data::{ctor, FetchData, FetchDataError};
use std::path::{Path, PathBuf};
#[ctor]
static STATIC_FETCH_DATA: FetchData = FetchData::new(
include_str!("../registry.txt"),
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
/// Download a data file.
pub fn sample_file<P: AsRef<Path>>(path: P) -> Result<PathBuf, Box<FetchDataError>> {
STATIC_FETCH_DATA.fetch_file(path)
}
You can now use your sample_file
function to download your files as needed.
You can create your registry.txt
file many ways. Here are the steps -- followed by sample code -- for one way to create it.
Fetch-Data
puts its sample data files
in tests/data
, so they upload to this GitHub folder. In GitHub, by looking at the raw view of a data file, we see the root URL for these files. In cargo.toml
, we keep these data files out of our crate via exclude = ["tests/data/*"]
FetchData
instance without registry contents.gen_registry_contents
method on your list of files. This method will download
the files, compute their hashes, and create a string of file names and hashes.registry.txt
.use fetch_data::{FetchData, dir_to_file_list};
let fetch_data = FetchData::new(
"", // registry_contents ignored
"https://raw.githubusercontent.com/CarlKCarlK/fetch-data/main/tests/data/",
"BAR_APP_DATA_DIR", // env_key
"com", // qualifier
"Foo Corp", // organization
"Bar App", // application
);
let file_list = dir_to_file_list("tests/data")?;
let registry_contents = fetch_data.gen_registry_contents(file_list)?;
println!("{registry_contents}");
# use fetch_data::FetchDataError; // '#' needed for doctest
# Ok::<(), Box<FetchDataError>>(())
Feature requests and contributions are welcome.
Don't use our sample sample_file
. Define your own sample_file
that
knows where to find your data files.
The FetchData
instance need not be global and static. See FetchData::new
for an example of a non-global instance.
Additional methods on the FetchData
instance can fetch multiples files
and can give the path to the local data directory.
You need not use a registry.txt
file
and FetchData
instance. You can instead use the stand-alone function fetch
to retrieve a single file with known URL, hash, and local path.
Additional stand-alone functions can download files and hash files.
Fetch-Data
always does binary downloads to maintain consistent line endings across OSs.
The Bed-Reader genomics crate
uses Fetch-Data
.
To make FetchData
work well as a static global,
FetchData::new
never fails. Instead,
FetchData
stores any error
and returns it when the first call to fetch_file
, etc., is made.
Debugging this crate under Windows can cause a "Oops! The debug adapter has terminated abnormally" exception. This is some kind of LLVM, Windows, NVIDIA(?) problem via ureq.
This crate follows Nine Rules for Elegant Rust Library APIs from Towards Data Science.