| Crates.io | niblits |
| lib.rs | niblits |
| version | 0.3.5 |
| created_at | 2026-01-11 05:57:01.158375+00 |
| updated_at | 2026-01-18 18:00:48.477447+00 |
| description | Token-aware, multi-format text chunking library with language-aware semantic splitting |
| homepage | https://github.com/casualjim/niblits |
| repository | https://github.com/casualjim/niblits |
| max_upload_size | |
| id | 2035259 |
| size | 440,285 |
A powerful, token-aware text chunking library for processing multiple file formats with language-aware semantic splitting.
This library provides streaming, async-first text chunking capabilities designed for ingestion pipelines and search systems. It handles diverse document types while maintaining semantic boundaries and offering configurable tokenization strategies.
Add to your Cargo.toml:
[dependencies]
niblits = "0.3.0"
tokio = { version = "1", features = ["rt", "macros"] }
futures = "0.3"
use niblits::{chunk_stream, ChunkerConfig, Tokenizer};
use futures::StreamExt;
use std::io::Cursor;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure chunking
let config = ChunkerConfig {
max_chunk_size: 1000,
overlap_percentage: 0.2,
tokenizer: Tokenizer::Tiktoken("cl100k_base".to_string()),
};
// Process a file
let content = r#"fn main() {
println!("Hello, world!");
}
fn helper() {
println!("This is a helper function");
}"#;
let reader = Cursor::new(content.as_bytes());
let mut stream = chunk_stream("main.rs", reader, config).await;
while let Some(result) = stream.next().await {
match result? {
project_chunk => {
println!("File: {}", project_chunk.file_path);
match project_chunk.chunk {
niblits::Chunk::Semantic(chunk) => {
println!("Semantic chunk: {} bytes", chunk.text.len());
}
niblits::Chunk::Text(chunk) => {
println!("Text chunk: {} bytes", chunk.text.len());
}
niblits::Chunk::EndOfFile { expected_chunks, .. } => {
println!("File complete. Expected {} chunks", expected_chunks);
}
_ => {}
}
}
}
}
Ok(())
}
pub struct ChunkerConfig {
/// Percentage of tokens to reserve for overlap (0.0 - 1.0)
pub overlap_percentage: f32,
/// Maximum size of each chunk (in tokens/characters)
pub max_chunk_size: usize,
/// Tokenizer strategy for size calculation
pub tokenizer: Tokenizer,
}
pub enum Tokenizer {
/// Simple character-based tokenization
Characters,
/// OpenAI tiktoken with encoding name
Tiktoken(String), // "cl100k_base", "p50k_base", etc.
/// HuggingFace tokenizer with model ID
HuggingFace(String), // "bert-base-uncased", etc.
// Preloaded variants (internal use)
PreloadedTiktoken(Arc<CoreBPE>),
PreloadedHuggingFace(Arc<Tokenizer>),
}
Check supported programming languages:
use niblits::{supported_languages, is_language_supported};
// Get all supported languages
let languages = supported_languages();
println!("Supported languages: {:?}", languages);
// Check specific language
assert!(is_language_supported("rust"));
assert!(is_language_supported("python"));
Commonly supported languages include: Rust, Python, JavaScript, TypeScript, Go, Java, C++, C#, Ruby, PHP, Swift, Kotlin, and many more.
chunk_stream(path, reader, config) - Process a file stream and yield chunkswalk_project(path, options) - Recursively walk a directory and stream chunkswalk_files(files, project_root, options) - Chunk a stream of file paths with ignore ruleswalker_includes_path(project_root, path, max_file_size) - Check if a path would be includedsupported_languages() - Get list of supported programming languagesis_language_supported(name) - Check if a language is supportedChunk - Represents different chunk types (Semantic, Text, EndOfFile, Delete)SemanticChunk - Contains text, tokens, and byte offset informationProjectChunk - File path, chunk data, and file sizeChunkError - Error types for parsing, IO, and unsupported formats// Markdown file
let config = ChunkerConfig::default();
let reader = Cursor::new("# Header\n\nSome content\n\n## Subheader").as_bytes();
let stream = chunk_stream("doc.md", reader, config).await;
// PDF file
let file = tokio::fs::File::open("document.pdf").await?;
let stream = chunk_stream("document.pdf", file, config).await;
// Code file
let code_stream = chunk_stream("script.py", python_file, config).await;
use niblits::{walk_project, WalkOptions};
use futures::StreamExt;
let mut stream = walk_project(
"./my-project",
WalkOptions {
max_chunk_size: 1000,
overlap_percentage: 0.2,
..Default::default()
},
);
while let Some(result) = stream.next().await {
let chunk = result?;
println!("{} -> {:?}", chunk.file_path, chunk.chunk);
}
// Using HuggingFace tokenizer
let config = ChunkerConfig {
tokenizer: Tokenizer::HuggingFace("bert-base-uncased".to_string()),
..Default::default()
};
// Using characters for simple cases
let config = ChunkerConfig {
tokenizer: Tokenizer::Characters,
max_chunk_size: 500,
overlap_percentage: 0.1,
};
src/
├── lib.rs # Public API and main exports
├── types.rs # Core data structures and error types
├── chunker/ # Format-specific chunkers
│ ├── code.rs # Language-aware code chunking
│ ├── text.rs # Plain text chunking
│ ├── markdown.rs # Markdown-aware chunking
│ ├── html.rs # HTML-aware chunking
│ ├── pdf.rs # PDF processing
│ └── docx.rs # Word document processing
├── languages.rs # Language support utilities
├── grammars.rs # Tree-sitter grammar management
└── grammar_loader.rs # Dynamic grammar loading
mise build # Build the workspace
mise build:rust # Rust-only build
mise test # All tests
mise test:rust # Crate tests only
Key dependencies:
text-splitter: Core splitting logic with tokenization supporttree-sitter: Code parsing for semantic chunkingtiktoken-rs: OpenAI tokenizer implementationtokenizers: HuggingFace tokenizer supportoxidize-pdf: PDF text extractiondocx-parser: Word document parsinghtmd: HTML processingpalate: Language detection