Crates.io | get_chunk |
lib.rs | get_chunk |
version | 1.2.2 |
source | src |
created_at | 2023-08-19 19:28:13.789541 |
updated_at | 2024-07-07 13:41:56.695347 |
description | File iterator or stream with auto or manual chunk size selection |
homepage | |
repository | https://github.com/m62624/get_chunk |
max_upload_size | |
id | 948866 |
size | 77,183 |
get_chunk
is a library for creating file iterators or streams (asynchronous iterators),
specialized in efficient file chunking. The main task, the ability to retrieve chunk data especially from large files.
Key Features:
⚠️ Important Notice:
The algorithm adjusts chunk sizes for optimal performance after the "Next" call, taking into account available RAM. However, crucially, this adjustment occurs only after the current chunk is sent and before the subsequent "Next" call.
It's important to note a potential scenario: Suppose a chunk is 15GB, and there's initially 16GB of free RAM. If, between the current and next "Next" calls, 2GB of RAM becomes unexpectedly occupied, the current 15GB chunk will still be processed. This situation introduces a risk, as the system might either reclaim resources (resulting in io::error) or lead to a code crash.
Iterators created by get_chunk
don't store the entire file in memory, especially for large datasets.
Their purpose is to fetch data from files in chunks, maintaining efficiency.
Key Points:
use get_chunk::iterator::FileIter;
// Note: requires a `size_format` attribute.
use get_chunk::data_size_format::iec::IECUnit;
fn main() -> std::io::Result<()> {
let file_iter = FileIter::new("file.txt")?;
// or
// let file_iter = FileIter::try_from(File::open("file.txt")?)?;
// ...
for chunk in file_iter {
match chunk {
Ok(data) => {
// some calculations with chunk
//.....
println!("{}", IECUnit::auto(data.len() as f64));
}
Err(_) => break,
}
}
Ok(())
}
use get_chunk::{
// Note: requires a `size_format` attribute.
data_size_format::iec::{IECSize, IECUnit},
iterator::FileIter,
ChunkSize,
};
fn main() -> std::io::Result<()> {
let file_iter = FileIter::new("file.txt")?
.include_available_swap()
.set_mode(ChunkSize::Bytes(40000))
.set_start_position_bytes(IECUnit::new(432.0, IECSize::Mebibyte).into());
Ok(())
}
// Note: requires the `size_format` and `stream` attributes.
use get_chunk::data_size_format::iec::IECUnit;
use get_chunk::stream::{FileStream, StreamExt};
#[tokio::main]
async fn main() -> std::io::Result<()> {
let mut file_stream = FileStream::new("file.txt").await?;
// or
// let mut file_stream = FileStream::try_from_data(File::open("file.txt").await?)?;
// ...
while let Ok(chunk) = file_stream.try_next().await {
match chunk {
Some(chunk) => {
// some calculations with chunk
},
None => break,
}
}
Ok(())
}
The calculate_chunk
function in the ChunkSize
enum determines the optimal chunk size based on various parameters. Here's a breakdown of how the size is calculated:
The variables prev
and now
represent the previous and current read time, respectively.
prev:
Definition: prev
represents the time taken to read a piece of data in the previous iteration.
now:
Definition: now
represents the current time taken to read the data fragment in the current iteration.
Auto Mode:
prev
) is greater than zero:
now
) is also greater than zero:
now
is less than prev
, decrease the chunk size using decrease_chunk
method.now
is greater than or equal to prev
, increase the chunk size using increase_chunk
method.now
is zero or negative, maintain the previous chunk size (prev
).Percent Mode:
percentage_chunk
method. The percentage is capped between 0.1% and 100%.Bytes Mode:
bytes_chunk
method. The size is capped by the file size and available RAM.(prev * (1.0 + ((now - prev) / prev).min(0.15))).min(ram_available * 0.85).min(f64::MAX)
(prev * (1.0 - ((prev - now) / prev).min(0.45))).min(ram_available * 0.85).min(f64::MAX)
(file_size * (0.1 / 100.0)).min(ram_available * 0.85).min(f64::MAX)
(file_size * (percentage.min(100.0).max(0.1) / 100.0)).min(ram_available * 0.85)
(bytes as f64).min(file_size).min(ram_available * 0.85)