| Crates.io | pii |
| lib.rs | pii |
| version | 0.1.0 |
| created_at | 2026-01-11 13:58:31.258416+00 |
| updated_at | 2026-01-11 13:58:31.258416+00 |
| description | PII detection and anonymization with deterministic, capability-aware NLP pipelines. |
| homepage | https://github.com/worka-ai/pii |
| repository | https://github.com/worka-ai/pii |
| max_upload_size | |
| id | 2035853 |
| size | 240,275 |
Worka PII is a Rust library for detecting and anonymizing personally identifiable information (PII). It provides deterministic, capability-aware NLP pipelines designed to run on CPU-only environments with explicit auditability and controlled degradation when language features are unavailable.
This crate was extracted from the Worka internal monorepo to become a standalone, reusable component. The APIs and the RFCs are maintained here to support independent development and external adoption.
candle-ner featurecargo run --example redact
cargo run --example extract
use pii::anonymize::{AnonymizeConfig, Anonymizer};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;
let analyzer = Analyzer::new(
Box::new(SimpleNlpEngine::default()),
default_recognizers(),
Vec::new(),
PolicyConfig::default(),
);
let text = "Contact Jane at jane@example.com or +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();
let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), pii::anonymize::Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), pii::anonymize::Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;
let redacted = Anonymizer::anonymize(text, &result.entities, &config).unwrap();
assert!(redacted.text.contains("<EMAIL>"));
This example keeps the input text intact and uses the detected spans directly.
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
let analyzer = Analyzer::new(
Box::new(SimpleNlpEngine::default()),
default_recognizers(),
Vec::new(),
PolicyConfig::default(),
);
let text = "Reach me at jane@example.com from 10.0.0.5.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();
for detection in &result.entities {
let span = &text[detection.start..detection.end];
println!(
"type={} start={} end={} value={}",
detection.entity_type.as_str(),
detection.start,
detection.end,
span
);
}
This example applies per-entity operators and emits a simple audit log that records the original value alongside the replacement.
use pii::anonymize::{AnonymizeConfig, Anonymizer, Operator};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;
let analyzer = Analyzer::new(
Box::new(SimpleNlpEngine::default()),
default_recognizers(),
Vec::new(),
PolicyConfig::default(),
);
let text = "Email jane@example.com or call +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();
let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;
let anonymized = Anonymizer::anonymize(text, &result.entities, &config).unwrap();
for item in &anonymized.items {
let original = &text[item.entity.start..item.entity.end];
println!(
"type={} value={} replacement={}",
item.entity.entity_type.as_str(),
original,
item.replacement
);
}
The following entity types are supported out of the box via built-in recognizers:
The following types are supported when a NER engine is enabled:
You can add custom entities and recognizers to the pipeline.
use pii::recognizers::regex::RegexRecognizer;
use pii::types::EntityType;
let mut recognizers = default_recognizers();
let employee_id = RegexRecognizer::new(
"regex_employee_id",
EntityType::Custom("EmployeeId".to_string()),
r"\bEMP-\d{4}\b",
0.7,
"employee_id",
).unwrap();
recognizers.push(Box::new(employee_id));
let analyzer = Analyzer::new(
Box::new(SimpleNlpEngine::default()),
recognizers,
Vec::new(),
PolicyConfig::default(),
);
The pipeline is fully customizable: you can supply your own NLP engine, recognizers, and context enhancers.
NlpEngine if you want custom tokenization, lemma/POS, or NER.The default SimpleNlpEngine is language-agnostic and provides tokenization plus sentence
splitting for any language tag. For EN/DE/ES, you can provide richer language profiles
and context terms to improve recall.
For unsupported languages:
NlpEngine provides them.To add a new language with higher fidelity:
NlpEngine that can emit token offsets, lemmas, POS tags, and/or NER.LanguageProfile with context terms for that language.The full specification is in docs/rfc-1200-pii.md and defines the data model, pipeline behavior,
capability reporting, and conformance requirements.
cargo test
cargo bench
Candle NER tests are ignored by default and require --features candle-ner plus a model:
PII_CANDLE_MODEL_DIR=/path/to/model \
cargo test --features candle-ner --test candle_ner -- --ignored
You can also set PII_CANDLE_MODEL_ID to download a model via hf-hub.
Licensed under either of: