| Crates.io | flowrider |
| lib.rs | flowrider |
| version | 0.1.1 |
| created_at | 2025-06-30 21:31:19.222473+00 |
| updated_at | 2025-06-30 21:31:19.222473+00 |
| description | High-performance PyTorch-compatible streaming dataset with distributed caching for on-the-fly remote dataset fetching |
| homepage | https://github.com/fpgaminer/flowrider |
| repository | https://github.com/fpgaminer/flowrider |
| max_upload_size | |
| id | 1732402 |
| size | 202,833 |
Inspired by MosaicML's streaming library (https://github.com/mosaicml/streaming), this library provides a PyTorch IterableDataset implementation that streams data from cloud storage. It is distributed training compatible, and can cache data to disk.
cargo test --no-default-features --features auto-initialize
Logging has to use envlogger, even though there are ways to send logs to the Python logger. This is because when sending logs to Python's logger, the GIL is required. Since we have a background thread doing work (and potentially logging), that can create a minefield of either deadlocks or not allowing background threads to work.