Crates.io | chunk_norris |
lib.rs | chunk_norris |
version | |
source | src |
created_at | 2025-01-23 11:25:10.637465+00 |
updated_at | 2025-01-24 10:26:32.58054+00 |
description | A Rust library for splitting large text into smaller batches for LLM input. |
homepage | |
repository | https://github.com/valeriouberti/ChunkNorris |
max_upload_size | |
id | 1527722 |
Cargo.toml error: | TOML parse error at line 18, column 1 | 18 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
A simple and efficient Rust library for splitting large text into smaller batches based on different strategies. This is particularly useful when working with large language models (LLMs) that have input size limitations.
CharCountBatcher
: Splits text into batches of a specified maximum character length.SentenceBatcher
: Splits text into batches based on complete sentences, respecting a minimum batch size.CharCountBatcher
.Add chunk_norris
to your Cargo.toml
:
[dependencies]
chunk_norris = "0.1.0" # Replace with the latest version
use chunk_norris::{BatchingStrategy, CharCountBatcher, TextBatch};
fn main() {
let text = "This is an example text. It will be split into smaller batches.";
// Create a batcher with a maximum of 25 characters per batch
let batcher = CharCountBatcher::new(25);
// Generate the batches
let batches: Vec<TextBatch> = batcher.create_batches(text);
// Print the batches
for (i, batch) in batches.iter().enumerate() {
println!("Batch {}: {}", i + 1, batch.content);
}
}
Batch 1: This is an example text.
Batch 2: It will be split into
Batch 3: smaller batches.
use chunk_norris::{BatchingStrategy, SentenceBatcher, TextBatch};
fn main() {
let text = "This is a sentence. This is another. And a third one!";
// Create a batcher with a minimum batch size of 10 characters
let batcher = SentenceBatcher::new(10);
// Generate the batches
let batches: Vec<TextBatch> = batcher.create_batches(text);
// Print the batches
for (i, batch) in batches.iter().enumerate() {
println!("Batch {}: {}", i + 1, batch.content);
}
}
Batch 1: This is a sentence.
Batch 2: This is another.
Batch 3: And a third one!
The library is designed to be extensible. Although the current version only provides CharCountBatcher, you can implement the BatchingStrategy trait to create custom batching logic:
use chunk_norris::{BatchingStrategy, TextBatch};
// Example: A hypothetical SentenceBatcher (not yet implemented in the library)
struct SentenceBatcher {
max_sentences: usize,
}
impl BatchingStrategy for SentenceBatcher {
fn create_batches(&self, text: &str) -> Vec<TextBatch> {
// ... your implementation to split text into batches by sentences ...
}
}
You could then use your custom batcher similarly to the CharCountBatcher.
Contributions are welcome! If you'd like to add new batching strategies, improve the existing code, or fix any issues, please feel free to open an issue or submit a pull request.
This project is licensed under either the MIT License or the Apache License, Version 2.0 at your option.