| Crates.io | wg-ragsmith |
| lib.rs | wg-ragsmith |
| version | 0.1.2 |
| created_at | 2025-10-14 21:08:02.852667+00 |
| updated_at | 2025-11-16 19:13:20.889966+00 |
| description | Semantic chunking and RAG utilities for document processing and retrieval-augmented generation. |
| homepage | https://github.com/Idleness76/weavegraph |
| repository | https://github.com/Idleness76/weavegraph |
| max_upload_size | |
| id | 1883195 |
| size | 305,678 |
Semantic chunking and RAG utilities for document processing and retrieval-augmented generation.
wg-ragsmith provides high-performance semantic chunking algorithms and vector storage utilities designed for building RAG (Retrieval-Augmented Generation) applications. It supports multiple document formats (HTML, JSON, plain text) and integrates with popular embedding providers.
β οΈ EARLY BETA WARNING
This crate is in early development (v0.1.x). APIs are unstable and will change between minor versions.
Breaking changes may arrive without fanfare. Pin exact versions in production, and check release notes carefully before upgrading.
That said, the core algorithms workβjust expect some assembly required.
Add wg-ragsmith to your Cargo.toml:
[dependencies]
wg-ragsmith = "0.1"
use wg_ragsmith::semantic_chunking::service::{SemanticChunkingService, ChunkDocumentRequest, ChunkSource};
use wg_ragsmith::semantic_chunking::embeddings::MockEmbeddingProvider;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a chunking service with mock embeddings
let service = SemanticChunkingService::builder()
.with_embedding_provider(MockEmbeddingProvider)
.build()?;
// Chunk an HTML document
let html_content = r#"
<html><body>
<h1>Introduction</h1>
<p>This is a sample document for chunking.</p>
<h2>Section 1</h2>
<p>More content here with detailed information.</p>
</body></html>
"#;
let request = ChunkDocumentRequest::new(ChunkSource::Html(html_content.to_string()));
let response = service.chunk_document(request).await?;
println!("Created {} chunks", response.outcome.chunks.len());
for chunk in &response.outcome.chunks {
println!("Chunk: {} ({} tokens)", chunk.content.chars().take(50).collect::<String>(), chunk.tokens);
}
Ok(())
}
use wg_ragsmith::stores::sqlite::SqliteChunkStore;
use wg_ragsmith::ingestion::chunk_response_to_ingestion;
use std::sync::Arc;
// Set up vector store
let store = Arc::new(SqliteChunkStore::new("chunks.db").await?);
// Store chunks (from previous example)
let ingestion = chunk_response_to_ingestion(&url, response)?;
store.store_batch(&ingestion.batch).await?;
// Search for similar content
let query_embedding = vec![0.1, 0.2, 0.3]; // Your query embedding
let results = store.search_similar(&query_embedding, 5).await?;
for result in results {
println!("Found: {} (score: {})", result.content, result.score);
}
semantic-chunking-tiktoken (default): Enable OpenAI tiktoken-based tokenizationsemantic-chunking-rust-bert: Enable Rust BERT integration for advanced NLPsemantic-chunking-segtok: Enable segtok sentence segmentationSemanticChunkingService: Main entry point for document processingHtmlSemanticChunker: HTML-specific chunking with DOM awarenessJsonSemanticChunker: JSON document processing with structural preservationSqliteChunkStore: Vector storage with SQLite backenduse rig::providers::openai::Client as OpenAIClient;
use wg_ragsmith::semantic_chunking::service::SemanticChunkingService;
let openai_client = OpenAIClient::new("your-api-key");
let service = SemanticChunkingService::builder()
.with_embedding_provider(openai_client.embedding_model("text-embedding-ada-002"))
.build()?;
use wg_ragsmith::semantic_chunking::embeddings::{EmbeddingProvider, SharedEmbeddingProvider};
struct MyEmbeddingProvider;
#[async_trait::async_trait]
impl EmbeddingProvider for MyEmbeddingProvider {
async fn embed(&self, texts: &[String]) -> Result<Vec<Vec<f32>>, EmbeddingError> {
// Your custom embedding logic
todo!()
}
fn identify(&self) -> &'static str {
"my-provider"
}
}
Extensive configuration options for tuning chunking behavior:
use wg_ragsmith::semantic_chunking::config::{ChunkingConfig, BreakpointStrategy};
let config = ChunkingConfig {
strategy: BreakpointStrategy::Percentile { threshold: 0.9 },
max_tokens: 512,
min_tokens: 32,
batch_size: 16,
// ... more options
};
Contributions welcome! Please see the main Weavegraph repository for contribution guidelines.
Licensed under the MIT License. See LICENSE for details.