| Crates.io | text_analysis |
| lib.rs | text_analysis |
| version | 0.4.8 |
| created_at | 2020-06-29 16:03:44.325588+00 |
| updated_at | 2025-08-25 15:05:40.747389+00 |
| description | A robust multilingual text analysis CLI with context, N-grams, named entities, and CSV/JSON export. |
| homepage | https://crates.io/crates/text_analysis |
| repository | https://github.com/LazyEmpiricist/text_analysis |
| max_upload_size | |
| id | 259456 |
| size | 1,647,107 |
A fast, pragmatic CLI & library for multi-language text analysis across .txt, .pdf, .docx, and .odt files.
With cargo:
cargo install text_analysis
Download binary from Releases
Clone the repository and build from source
# Default TXT summary (one file)
text_analysis <path>
# CSV exports (multiple files: ngrams, wordfreq, context, neighbors, pmi, namedentities)
text_analysis <path> --export-format csv
# Combine all files into one corpus (Map-Reduce) and export as JSON
text_analysis <path> --combine --export-format json
Path can be a file or a directory (recursively scanned). Supported: .txt, .pdf, .docx, .odt.
text_analysis <path> [--stopwords <FILE>] [--ngram N] [--context N]
[--export-format {txt|csv|tsv|json}] [--entities-only]
[--combine]
[--stem] [--stem-lang <CODE>] [--stem-strict]
--stopwords <FILE> – optional stopword list (one token per line).--ngram N – n‑gram size (default: 2).--context N – context window size for context & PMI (default: 5).--export-format – txt (default), csv, tsv, json.--entities-only – only export Named Entities (skips other tables).--combine – analyze all files as one corpus (Map‑Reduce) and write a single set of outputs.--stem – enable stemming with auto language detection.--stem-lang <CODE> – force stemming language (e.g., en, de, fr, es, it, pt, nl, ru, sv, fi, no, ro, hu, da, tr).--stem-strict – in auto mode, require detectable & supported language:
When the CLI finishes, it prints a concise summary to stdout. The order is tuned for usefulness:
This surfaces phrases and salient collocations before common function words.
<stem>_<timestamp>_summary.txtMultiple files per run (one per analysis):
<stem>_<timestamp>_ngrams.<ext><stem>_<timestamp>_wordfreq.<ext><stem>_<timestamp>_context.<ext><stem>_<timestamp>_neighbors.<ext><stem>_<timestamp>_pmi.<ext><stem>_<timestamp>_namedentities.<ext>| File suffix | Contents | Notes |
|---|---|---|
_ngrams.<ext> |
List of all observed n-grams and their counts | Sorted by count ↓, then lexicographically ↑ |
_wordfreq.<ext> |
Word frequency table (unigrams only) | Sorted by count ↓, then lexicographically ↑ |
_context.<ext> |
Directed co-occurrence counts for all tokens in a ±N window around each center token | Window size set by --context (default 5); includes all words except the center word |
_neighbors.<ext> |
Directed co-occurrence counts for immediate left/right neighbors (±1 distance) | Always exactly one left and one right position per center token |
_pmi.<ext> |
Word pairs within the context window with their counts, distances, and Pointwise Mutual Information | Pairs are unordered in storage, sorted by count ↓, PMI ↓ in export |
_namedentities.<ext> |
Named entities detected via capitalization heuristic and their counts | Case-sensitive; ignores acronyms and common articles/determiners |
Sorting rules applied to all tabular exports:
With --combine, all inputs are processed as one corpus and exported once with stem "combined":
combined_<timestamp>_wordfreq.<ext>, combined_<timestamp>_ngrams.<ext>, …<stem> is collision‑safe: derived from the file name plus a short path hash. In per‑file mode each input gets its own stem; in combined mode the stem is literally combined.
Add to Cargo.toml:
[dependencies]
text_analysis = "0.4.7"
Basic example:
use std::collections::HashSet;
use text_analysis::*;
fn main() -> Result<(), String> {
let text = "The quick brown fox jumps over the lazy dog.";
let opts = AnalysisOptions {
ngram: 2,
context: 5,
export_format: ExportFormat::Json,
entities_only: false,
combine: false,
stem_mode: StemMode::Off,
stem_require_detected: false,
};
let stop = HashSet::new();
let result = analyze_text_with(text, &stop, &opts);
println!("Top words: {:?}", result.wordfreq);
Ok(())
}
Counts are case‑sensitive and computed on original tokens (not stemmed).
StemMode::Off – no stemmingStemMode::Auto – language via whatlang; stem if supportedStemMode::Force(lang) – use a specific stemmerstem_require_detected controls strictness in auto mode (see CLI).
Uses pdf-extract. Files that fail to parse are listed in the warnings and don’t abort the run.
word/document.xml and extracting text content.content.xml and extracting text content.Notes:
--export-format csv (or tsv/json) for downstream analysis in pandas/R/Excel.--ngram 2 or --ngram 3 and check PMI first.--stem-strict to avoid inconsistent stemming.MIT
If you open exports in Excel/LibreOffice, cells that begin with =, +, -, or @ can be interpreted
as formulas. The recommended approach is:
csv::Writer) for escaping.' for any text cell that starts with one of those characters.This prevents spreadsheet software from executing user-provided content.