| Crates.io | langextract-rust |
| lib.rs | langextract-rust |
| version | 0.4.3 |
| created_at | 2025-08-15 07:32:53.68695+00 |
| updated_at | 2025-08-31 17:04:42.13128+00 |
| description | A Rust library for extracting structured and grounded information from text using LLMs |
| homepage | |
| repository | https://github.com/modularflow/langextract-rust |
| max_upload_size | |
| id | 1796353 |
| size | 982,208 |
A powerful Rust library for extracting structured and grounded information from text using Large Language Models (LLMs).
LangExtract processes unstructured text and extracts specific information with precise character-level alignment, making it perfect for document analysis, research paper processing, product catalogs, and more.
Linux/macOS (Auto-detect best method):
curl -fsSL https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.sh | bash
Windows (PowerShell):
iwr -useb https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.ps1 | iex
From crates.io (requires Rust):
cargo install langextract-rust --features cli
Pre-built binaries (no Rust required):
# Download from GitHub releases
curl -fsSL https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.sh | bash -s -- --prebuilt
Homebrew (macOS/Linux - coming soon):
brew install modularflow/tap/lx-rs
From source:
git clone https://github.com/modularflow/langextract-rust
cd langextract-rust
cargo install --path . --features cli
# Initialize configuration (provider required)
lx-rs init --provider ollama
# Extract from text (provider required)
lx-rs extract "John Doe is 30 years old" --prompt "Extract names and ages" --provider ollama
# Test your setup
lx-rs test --provider ollama
# Process files
lx-rs extract document.txt --examples examples.json --export html --provider ollama
# Check available providers
lx-rs providers
Add this to your Cargo.toml:
[dependencies]
langextract-rust = "0.1.0"
use langextract::{
extract, ExtractConfig, FormatType,
data::{ExampleData, Extraction},
providers::ProviderConfig,
};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Set up examples to guide extraction
let examples = vec![
ExampleData::new(
"John Doe is 30 years old and works as a doctor".to_string(),
vec![
Extraction::new("person".to_string(), "John Doe".to_string()),
Extraction::new("age".to_string(), "30".to_string()),
Extraction::new("profession".to_string(), "doctor".to_string()),
],
)
];
// Configure for Ollama
let provider_config = ProviderConfig::ollama("mistral", None);
let config = ExtractConfig {
model_id: "mistral".to_string(),
format_type: FormatType::Json,
max_char_buffer: 8000,
max_workers: 6,
batch_length: 4,
temperature: 0.3,
model_url: Some("http://localhost:11434".to_string()),
language_model_params: {
let mut params = std::collections::HashMap::new();
params.insert("provider_config".to_string(), serde_json::to_value(&provider_config)?);
params
},
debug: true,
..Default::default()
};
// Extract information
let result = extract(
"Alice Smith is 25 years old and works as a doctor. Bob Johnson is 35 and is an engineer.",
Some("Extract person names, ages, and professions from the text"),
&examples,
config,
).await?;
println!("โ
Extracted {} items", result.extraction_count());
// Show extractions with character positions
if let Some(extractions) = &result.extractions {
for extraction in extractions {
println!("โข [{}] '{}' at {:?}",
extraction.extraction_class,
extraction.extraction_text,
extraction.char_interval
);
}
}
Ok(())
}
The CLI provides a powerful interface for text extraction without writing code.
# Linux/macOS
curl -fsSL https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.sh | bash
# Windows PowerShell
iwr -useb https://raw.githubusercontent.com/modularflow/langextract-rust/main/install.ps1 | iex
# From source with CLI features
cargo install langextract-rust --features cli
# Or clone and build
git clone https://github.com/modularflow/langextract-rust
cd langextract-rust
cargo install --path . --features cli
Extract structured information from text, files, or URLs:
# Basic extraction
lx-rs extract "Alice Smith is 25 years old" --prompt "Extract names and ages" --provider ollama
# From file with custom examples
lx-rs extract document.txt \
--examples my_examples.json \
--output results.json \
--export html \
--provider ollama
# With specific provider and model
lx-rs extract text.txt \
--provider ollama \
--model mistral \
--workers 8 \
--multipass
# From URL
lx-rs extract "https://example.com/article.html" \
--prompt "Extract key facts" \
--format yaml \
--provider openai
# Advanced options
lx-rs extract large_document.txt \
--examples patterns.json \
--provider openai \
--model gpt-4o \
--max-chars 12000 \
--workers 10 \
--batch-size 6 \
--temperature 0.1 \
--multipass \
--passes 3 \
--export html \
--show-intervals \
--verbose
# Initialize configuration files (provider required)
lx-rs init --provider ollama
# Initialize for OpenAI provider
lx-rs init --provider openai
# Force overwrite existing configs
lx-rs init --provider ollama --force
# Test provider connectivity (provider required)
lx-rs test --provider ollama
lx-rs test --provider ollama --model mistral
lx-rs test --provider openai --api-key your_key
# List available providers and models
lx-rs providers
# Show example configurations
lx-rs examples
# Get help
lx-rs --help
lx-rs extract --help
# Convert between formats
lx-rs convert results.json --output report.html --format html
lx-rs convert data.json --output summary.csv --format csv
The CLI supports configuration files for easier management:
[
{
"text": "Dr. Sarah Johnson works at Mayo Clinic in Rochester, MN",
"extractions": [
{"extraction_class": "person", "extraction_text": "Dr. Sarah Johnson"},
{"extraction_class": "organization", "extraction_text": "Mayo Clinic"},
{"extraction_class": "location", "extraction_text": "Rochester, MN"}
]
}
]
# Provider API keys
OPENAI_API_KEY=your_openai_key_here
GEMINI_API_KEY=your_gemini_key_here
# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434
# Default configuration
model: "mistral"
provider: "ollama"
model_url: "http://localhost:11434"
temperature: 0.3
max_char_buffer: 8000
max_workers: 6
batch_length: 4
multipass: false
extraction_passes: 1
# Academic papers
lx-rs extract research_paper.pdf \
--prompt "Extract authors, institutions, key findings, and methodology" \
--examples academic_examples.json \
--export html \
--show-intervals
# Legal documents
lx-rs extract contract.txt \
--prompt "Extract parties, dates, obligations, and key terms" \
--provider openai \
--model gpt-4o \
--temperature 0.1
# Product catalogs
lx-rs extract catalog.txt \
--prompt "Extract product names, prices, descriptions, and specs" \
--multipass \
--passes 2 \
--export csv
# Contact information
lx-rs extract directory.txt \
--prompt "Extract names, emails, phone numbers, and addresses" \
--format yaml \
--show-intervals
# Process multiple files
for file in documents/*.txt; do
lx-rs extract "$file" \
--examples patterns.json \
--output "results/$(basename "$file" .txt).json"
done
# URL processing
lx-rs extract "https://news.site.com/article" \
--prompt "Extract headline, author, date, and key points" \
--export html
# Install and start Ollama
ollama serve
ollama pull mistral
# Test connection
lx-rs test --provider ollama --model mistral
# Set API key
export OPENAI_API_KEY="your-key-here"
# Test connection
lx-rs test --provider openai --model gpt-4o-mini
# Set API key
export GEMINI_API_KEY="your-key-here"
# Test connection
lx-rs test --provider gemini --model gemini-2.5-flash
# High-performance extraction
langextract-rust extract large_file.txt \
--workers 12 \ # Increase parallel workers
--batch-size 8 \ # Larger batches
--max-chars 10000 \ # Optimal chunk size
--provider ollama \ # Local inference
--temperature 0.2 # Consistent results
# Memory-efficient processing
langextract-rust extract huge_file.txt \
--max-chars 6000 \ # Smaller chunks
--workers 4 \ # Fewer workers
--batch-size 2 # Smaller batches
# Verbose output for debugging
langextract-rust extract text.txt --verbose --debug
# Test specific provider
langextract-rust test --provider ollama --verbose
# Check installation
langextract-rust --version
langextract-rust providers
# Reset configuration
langextract-rust init --force
use langextract::{ValidationConfig, ValidationResult};
// Enable advanced validation
let validation_config = ValidationConfig {
enable_schema_validation: true,
enable_type_coercion: true,
save_raw_output: true,
validate_required_fields: true,
raw_output_dir: Some("./raw_outputs".to_string()),
..Default::default()
};
// Automatic type coercion handles:
// - Currencies: "$1,234.56" โ 1234.56
// - Percentages: "95.5%" โ 0.955
// - Booleans: "true", "yes", "1" โ true
// - Numbers: "42" โ 42, "3.14" โ 3.14
// - Emails, phones, URLs, dates
use langextract::visualization::{export_document, ExportConfig, ExportFormat};
// Export to interactive HTML
let html_config = ExportConfig {
format: ExportFormat::Html,
title: Some("Document Analysis".to_string()),
highlight_extractions: true,
show_char_intervals: true,
include_statistics: true,
..Default::default()
};
let html_output = export_document(&annotated_doc, &html_config)?;
std::fs::write("analysis.html", html_output)?;
// Also supports Markdown, JSON, and CSV exports
use langextract::providers::ProviderConfig;
// OpenAI configuration
let openai_config = ProviderConfig::openai("gpt-4o-mini", Some(api_key));
// Ollama configuration
let ollama_config = ProviderConfig::ollama("mistral", Some("http://localhost:11434".to_string()));
// Custom HTTP API
let custom_config = ProviderConfig::custom("https://my-api.com/v1", "my-model");
# Extract product information from catalogs
./test_product_extraction.sh
# Extract research information from papers
./test_academic_extraction.sh
# Test with multiple LLM providers
./test_providers.sh
| Provider | Models | Features | Use Case |
|---|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-3.5-turbo | High accuracy, JSON mode | Production applications |
| Ollama | mistral, llama2, codellama, qwen | Local, privacy-first | Development, sensitive data |
| Custom | Any OpenAI-compatible API | Flexible integration | Custom deployments |
# For OpenAI
export OPENAI_API_KEY="your-openai-key"
# For Ollama (local)
ollama serve
ollama pull mistral
# For custom providers
export CUSTOM_API_KEY="your-key"
The ExtractConfig struct provides fine-grained control over extraction performance:
let config = ExtractConfig {
model_id: "mistral".to_string(),
temperature: 0.3, // Lower = more consistent
max_char_buffer: 8000, // Chunk size for large documents
batch_length: 6, // Chunks per batch
max_workers: 8, // Parallel workers (key for speed!)
extraction_passes: 1, // Multiple passes for better recall
enable_multipass: false, // Advanced multi-pass extraction
multipass_min_extractions: 5, // Minimum extractions to avoid re-processing
multipass_quality_threshold: 0.8, // Quality threshold for keeping extractions
debug: true, // Enable debug information
..Default::default()
};
See PERFORMANCE_TUNING.md for detailed optimization guide.
Perfect for processing contracts, research papers, or reports:
let examples = vec![
ExampleData::new(
"Dr. Sarah Johnson (contact: s.johnson@mayo.edu) works at Mayo Clinic in Rochester, MN since 2019".to_string(),
vec![
Extraction::new("person".to_string(), "Dr. Sarah Johnson".to_string()),
Extraction::new("email".to_string(), "s.johnson@mayo.edu".to_string()),
Extraction::new("institution".to_string(), "Mayo Clinic".to_string()),
Extraction::new("location".to_string(), "Rochester, MN".to_string()),
Extraction::new("year".to_string(), "2019".to_string()),
],
)
];
The library handles large documents automatically with intelligent chunking:
// Configure for academic papers or catalogs
let config = ExtractConfig {
max_char_buffer: 8000, // Optimal chunk size
max_workers: 8, // High parallelism
batch_length: 6, // Process multiple chunks per batch
enable_multipass: true, // Multiple extraction rounds
multipass_min_extractions: 3,
multipass_quality_threshold: 0.8,
debug: true, // See processing details
..Default::default()
};
The library provides comprehensive error types:
use langextract::LangExtractError;
match extract(/* ... */).await {
Ok(result) => println!("Success: {} extractions", result.extraction_count()),
Err(LangExtractError::ConfigurationError(msg)) => {
eprintln!("Configuration issue: {}", msg);
}
Err(LangExtractError::InferenceError { message, provider, .. }) => {
eprintln!("Inference failed ({}): {}", provider.unwrap_or("unknown"), message);
}
Err(LangExtractError::NetworkError(e)) => {
eprintln!("Network error: {}", e);
}
Err(e) => eprintln!("Other error: {}", e),
}
This Rust implementation provides a complete, production-ready text extraction system:
Run the included test scripts to explore LangExtract capabilities:
# Test with product catalogs
./test_product_extraction.sh
# Test with academic papers
./test_academic_extraction.sh
# Test multiple LLM providers
./test_providers.sh
Each test generates interactive HTML reports, structured JSON data, and CSV exports for analysis.
We welcome contributions! Key areas for enhancement:
Licensed under the Apache License, Version 2.0. See LICENSE for details. For health-related applications, use of LangExtract is also subject to the Health AI Developer Foundations Terms of Use.
This work builds upon research and implementations from the broader NLP and information extraction community:
@misc{langextract,
title={langextract},
author={Google Research Team},
year={2024},
publisher={GitHub},
url={https://github.com/google/langextract}
}
Acknowledgments: