| Crates.io | anno |
| lib.rs | anno |
| version | 0.2.0 |
| created_at | 2025-11-26 21:50:24.368844+00 |
| updated_at | 2025-12-02 14:07:48.503255+00 |
| description | Unified API for NER, coreference resolution, and evaluation in Rust. Zero-dependency baselines, ML backends (BERT/GLiNER), and comprehensive evaluation framework. |
| homepage | https://docs.rs/anno |
| repository | https://github.com/arclabs561/anno |
| max_upload_size | |
| id | 1952424 |
| size | 5,938,176 |
Information extraction for Rust: NER, coreference resolution, and evaluation.
Unified API for named entity recognition, coreference resolution, and evaluation. Swap between regex patterns (~400ns), transformer models (~50-150ms), and zero-shot NER without changing your code.
Key features:
RegexNER, HeuristicNER) for fast iterationDual-licensed under MIT or Apache-2.0.
cargo add anno
Extract entity spans from text. Each entity includes the matched text, its type, and character offsets:
use anno::{Model, RegexNER};
let ner = RegexNER::new();
let entities = ner.extract_entities("Contact alice@acme.com by Jan 15", None)?;
for e in &entities {
println!("{}: \"{}\" [{}, {})", e.entity_type.as_label(), e.text, e.start, e.end);
}
// Output:
// EMAIL: "alice@acme.com" [8, 22)
// DATE: "Jan 15" [26, 32)
Note: All examples use ? for error handling. In production, handle Result types appropriately.
RegexNER detects structured entities via regex: dates, times, money, percentages, emails, URLs, phone numbers. It won't find "John Smith" or "Apple Inc." — those require context, not patterns.
For person names, organizations, and locations, use StackedNER which combines patterns with heuristics. StackedNER is composable — you can add ML backends on top for better accuracy:
use anno::StackedNER;
let ner = StackedNER::default();
let entities = ner.extract_entities("Sarah Chen joined Microsoft in Seattle", None)?;
This prints:
PER: "Sarah Chen" [0, 10)
ORG: "Microsoft" [18, 27)
LOC: "Seattle" [31, 38)
This requires no model downloads and runs in ~100μs, but accuracy varies by domain.
StackedNER is composable: You can add ML backends on top of the default pattern+heuristic layers for better accuracy while keeping fast structured entity extraction:
#[cfg(feature = "onnx")]
use anno::{StackedNER, GLiNEROnnx};
// ML-first: GLiNER runs first, then patterns fill gaps
let ner = StackedNER::with_ml_first(
Box::new(GLiNEROnnx::new("onnx-community/gliner_small-v2.1")?)
);
// Or ML-fallback: patterns/heuristics first, ML as fallback
let ner = StackedNER::with_ml_fallback(
Box::new(GLiNEROnnx::new("onnx-community/gliner_small-v2.1")?)
);
// Or custom stack with builder
let ner = StackedNER::builder()
.layer(RegexNER::new()) // High-precision structured entities
.layer(HeuristicNER::new()) // Quick named entities
.layer_boxed(Box::new(GLiNEROnnx::new("onnx-community/gliner_small-v2.1")?)) // ML fallback
.build();
For standalone ML backends, enable the onnx feature:
#[cfg(feature = "onnx")]
use anno::BertNEROnnx;
#[cfg(feature = "onnx")]
let ner = BertNEROnnx::new(anno::DEFAULT_BERT_ONNX_MODEL)?;
#[cfg(feature = "onnx")]
let entities = ner.extract_entities("Marie Curie discovered radium in 1898", None)?;
Note: ML backends (BERT, GLiNER, etc.) download models on first run:
Models are cached locally after download.
To download models ahead of time:
# Download all models (ONNX + Candle)
cargo run --example download_models --features "onnx,candle"
# Download only ONNX models
cargo run --example download_models --features onnx
This pre-warms the cache so models are ready for offline use or faster first runs.
Supervised NER models only recognize entity types seen during training. GLiNER uses a bi-encoder architecture that lets you specify entity types at inference time:
#[cfg(feature = "onnx")]
use anno::GLiNEROnnx;
#[cfg(feature = "onnx")]
let ner = GLiNEROnnx::new("onnx-community/gliner_small-v2.1")?;
// Extract domain-specific entities without retraining
#[cfg(feature = "onnx")]
let entities = ner.extract(
"Patient presents with diabetes, prescribed metformin 500mg",
&["disease", "medication", "dosage"],
0.5, // confidence threshold
)?;
This is slower (~100ms) but supports arbitrary entity schemas.
GLiNER2 extends GLiNER with multi-task capabilities. Extract entities, classify text, and extract hierarchical structures in a single forward pass:
#[cfg(feature = "onnx")]
use anno::backends::gliner2::{GLiNER2Onnx, TaskSchema};
#[cfg(feature = "onnx")]
let model = GLiNER2Onnx::from_pretrained(anno::DEFAULT_GLINER2_MODEL)?;
// DEFAULT_GLINER2_MODEL is "onnx-community/gliner-multitask-large-v0.5"
// Alternative: "fastino/gliner2-base-v1" (if available)
#[cfg(feature = "onnx")]
let schema = TaskSchema::new()
.with_entities(&["person", "organization", "product"])
.with_classification("sentiment", &["positive", "negative", "neutral"], false); // false = single-label
#[cfg(feature = "onnx")]
let result = model.extract("Apple announced iPhone 15", &schema)?;
// result.entities: [Apple/organization, iPhone 15/product]
// result.classifications["sentiment"].labels: ["positive"]
GLiNER2 supports zero-shot NER, text classification, and structured extraction. See the GLiNER2 paper (arxiv:2507.18546) for details.
Extract entities and relations, then export to knowledge graphs for RAG applications:
use anno::graph::GraphDocument;
use anno::backends::tplinker::TPLinker;
use anno::backends::inference::RelationExtractor;
use anno::StackedNER;
let text = "Steve Jobs founded Apple in 1976. The company is headquartered in Cupertino.";
// Extract entities
let ner = StackedNER::default();
let entities = ner.extract_entities(text, None)?;
// Extract relations between entities
// Note: TPLinker is currently a placeholder implementation using heuristics.
// For production, consider GLiNER2 which supports relation extraction via ONNX.
let rel_extractor = TPLinker::new()?;
let result = rel_extractor.extract_with_relations(
text,
&["person", "organization", "location", "date"],
&["founded", "headquartered_in", "founded_in"],
0.5,
)?;
// Convert relations to graph format
use anno::entity::Relation;
let relations: Vec<Relation> = result.relations.iter().map(|r| {
let head = &result.entities[r.head_idx];
let tail = &result.entities[r.tail_idx];
Relation::new(
head.clone(),
tail.clone(),
r.relation_type.clone(),
r.confidence,
)
}).collect();
// Build graph document (deduplicates via coreference if provided)
let graph = GraphDocument::from_extraction(&result.entities, &relations, None);
// Export to Neo4j Cypher
println!("{}", graph.to_cypher());
// Output: Creates nodes for entities and edges for relations
// Or NetworkX JSON for Python
println!("{}", graph.to_networkx_json());
This creates a knowledge graph with:
The grounded module provides a hierarchy for entity representation that unifies text NER and visual detection:
use anno::grounded::{GroundedDocument, Signal, Track, Identity, Location};
// Create a document with the Signal → Track → Identity hierarchy
let mut doc = GroundedDocument::new("doc1", "Marie Curie won the Nobel Prize. She was a physicist.");
// Level 1: Signals (raw detections)
let s1 = doc.add_signal(Signal::new(0, Location::text(0, 12), "Marie Curie", "Person", 0.95));
let s2 = doc.add_signal(Signal::new(1, Location::text(38, 41), "She", "Person", 0.88));
// Level 2: Tracks (within-document coreference)
let mut track = Track::new(0, "Marie Curie");
track.add_signal(s1, 0);
track.add_signal(s2, 1);
let track_id = doc.add_track(track);
// Level 3: Identities (knowledge base linking)
let identity = Identity::from_kb(0, "Marie Curie", "wikidata", "Q7186");
let identity_id = doc.add_identity(identity);
doc.link_track_to_identity(track_id, identity_id);
// Traverse the hierarchy
for signal in doc.signals() {
if let Some(identity) = doc.identity_for_signal(signal.id) {
println!("{} → {}", signal.surface, identity.canonical_name);
}
}
The same Location type works for text spans, bounding boxes, and other modalities. See examples/grounded.rs for a complete walkthrough.
| Backend | Use Case | Latency | Accuracy | Feature | When to Use |
|---|---|---|---|---|---|
RegexNER |
Structured entities (dates, money, emails) | ~400ns | ~95%* | always | Fast structured data extraction |
HeuristicNER |
Person/Org/Location via heuristics | ~50μs | ~65% | always | Quick baseline, no dependencies |
StackedNER |
Composable layered extraction | ~100μs | varies | always | Combine patterns + heuristics + ML backends |
BertNEROnnx |
High-quality NER (fixed types) | ~50ms | ~86% | onnx |
Standard 4-type NER (PER/ORG/LOC/MISC) |
GLiNEROnnx |
Zero-shot NER (custom types) | ~100ms | ~92% | onnx |
Recommended: Custom entity types without retraining |
NuNER |
Zero-shot NER (token-based) | ~100ms | ~86% | onnx |
Alternative zero-shot approach |
W2NER |
Nested/discontinuous NER | ~150ms | ~85% | onnx |
Overlapping or non-contiguous entities |
CandleNER |
Pure Rust BERT NER | varies | ~86% | candle |
Rust-native, no ONNX dependency |
GLiNERCandle |
Pure Rust zero-shot NER | varies | ~90% | candle |
Rust-native zero-shot (requires model conversion) |
GLiNER2 |
Multi-task (NER + classification) | ~130ms | ~92% | onnx/candle |
Joint NER + text classification |
*Pattern accuracy on structured entities only
Quick selection guide:
RegexNER for structured entities, StackedNER for general useGLiNEROnnx for zero-shot, BertNEROnnx for fixed typesGLiNEROnnx (zero-shot, no retraining needed)StackedNER (patterns + heuristics)StackedNER::with_ml_first() or with_ml_fallback() to combine ML accuracy with pattern speedKnown limitations:
ljynlp/w2ner-bert-base) requires HuggingFace authentication. You may need to authenticate with huggingface-cli login or use an alternative model.torch, safetensors). Prefer GLiNEROnnx for production use.This library includes an evaluation framework for measuring precision, recall, and F1 with different matching semantics (strict, partial, type-only). It also implements coreference metrics (MUC, B³, CEAF, LEA) for systems that resolve mentions to entities.
use anno::{Model, RegexNER};
use anno::eval::report::ReportBuilder;
let model = RegexNER::new();
let report = ReportBuilder::new("RegexNER")
.with_core_metrics(true)
.with_error_analysis(true)
.build(&model);
println!("{}", report.summary());
See docs/EVALUATION.md for details on evaluation modes, bias analysis, and dataset support.
What makes anno different:
RegexNER and StackedNER work out of the box| Feature | What it enables |
|---|---|
| (default) | RegexNER, HeuristicNER, StackedNER, GraphDocument, SchemaMapper |
onnx |
BERT, GLiNER, GLiNER2, NuNER, W2NER via ONNX Runtime |
candle |
Pure Rust inference (CandleNER, GLiNERCandle, GLiNER2Candle) with optional Metal/CUDA |
eval |
Core metrics (P/R/F1), datasets, evaluation framework |
eval-bias |
Gender, demographic, temporal, length bias analysis |
eval-advanced |
Calibration, robustness, OOD detection, dataset download |
discourse |
Event extraction, shell nouns, abstract anaphora |
full |
Everything |
This crate's minimum supported rustc version is 1.75.0.
MIT OR Apache-2.0