| Crates.io | seams |
| lib.rs | seams |
| version | 0.1.1 |
| created_at | 2025-07-18 00:52:17.689427+00 |
| updated_at | 2025-07-18 23:54:26.730251+00 |
| description | High-throughput sentence extractor for Project Gutenberg texts with dialog-aware detection |
| homepage | https://github.com/knowseams/knowseams |
| repository | https://github.com/knowseams/knowseams |
| max_upload_size | |
| id | 1758434 |
| size | 1,571,670 |
Splits large English text corpora into meaningful sentences while preserving narrative flow and dialog structure.
SEAMS excels at preserving narrative structure where other tools break dialog incorrectly:
Input text:
"Well, you can see him easily enough," said Mr. Hoad. "He's staying in
your village, I believe. He's a nephew of Squire Broderick's."
"What! Captain Forrester?" cried I.
SEAMS output (3 sentences):
"Well, you can see him easily enough," said Mr. Hoad."He's staying in your village, I believe. He's a nephew of Squire Broderick's.""What! Captain Forrester?" cried I.Other tools break this into 6+ fragments:
"He's staying in + your village, I believe.)"What!" + Captain Forrester?" + cried I.)See examples for more cases demonstrating:
Found a mis-split in English narrative? Show us!
No currently known mis-splits on 20K Project Gutenberg English texts. Python scripts in exploration/ help search for potential examples.
If you discover a counter-example:
Dialog-heavy text is where other sentence splitters fail - show us where SEAMS does too.
Test Corpus: 20,440 Project Gutenberg files (7.4 billion characters, 56 million sentences)
Test System: Intel i9-13900KF (16 cores, 32 threads) running Linux 5.15 WSL2, 32GB RAM
| Benchmark (version) | Cores | End-to-end time | Speed-up vs nupunkt | Sentences / s | Sentence detection throughput | Total e2e throughput | Note |
|---|---|---|---|---|---|---|---|
| seams | 32 | 6 s | 59 × | 8.6 M | 105.4 MB/s | 1176.2 MB/s | line offsets included |
| seams-single-cpu | 1 | 1 m 31 s | 4 × | 611 k | 450.6 MB/s | 90.4 MB/s | single-CPU baseline |
| nupunkt (0.5.1) | 1 | 6 m 23 s | 1 × | 179 k | 19.7 MB/s | 19.3 MB/s | pure-Python |
Additional Results: See benchmarks/performance-results.md for results across different systems including macOS ARM64.
For complete benchmark methodology and comparison tools, see benchmarks/ and run python run_comparison.py.
From crates.io:
cargo install seams
From source:
git clone https://github.com/KnowSeams/KnowSeams.git
cd KnowSeams
cargo install --path .
Process all Project Gutenberg texts in a directory:
seams /path/to/gutenberg_texts
The tool will:
*-0.txt files recursively*_seams2.txt files alongside originalsrun_stats.jsonProcess a Project Gutenberg mirror:
seams ~/gutenberg_mirror
Reprocess all files (ignore existing _seams2.txt outputs):
seams --overwrite-all ~/gutenberg_texts
Run benchmark comparison:
cd benchmarks
uv venv # One-time setup
uv sync # Install dependencies
source .venv/bin/activate # Per session
python run_analysis.py ~/gutenberg_texts # Assumes location of a (likely partial) gutenberg mirror
Debug sentence detection with state transitions:
seams --debug-text 'He said "Hello world!" and left. She replied "Goodbye!" quickly.'
Output shows internal state machine transitions:
0 He said "Hello world!" and left. (1,1,1,34) Narrative DialogDoubleQuote Continue " IndependentDialog[0] He said "Hello worl
0 He said "Hello world!" and left. (1,1,1,34) DialogDoubleQuote Narrative Continue !" a DialogSoftEnd llo world!" and left. S
1 She replied "Goodbye!" quickly. (1,35,1,67) Narrative Narrative Split . S NarrativeSentenceBoundary " and left. She replied
1 She replied "Goodbye!" quickly. (1,35,1,67) Narrative DialogDoubleQuote Continue " IndependentDialog[0] he replied "Goodbye!"
1 She replied "Goodbye!" quickly. (1,35,1,67) DialogDoubleQuote Narrative Continue !" q DialogSoftEnd "Goodbye!" quickly.
For each input file book-0.txt, seams creates book-0_seams2.txt with:
1 This is the first sentence. (1,1,1,32)
2 Here is the second sentence. (1,33,2,15)
Format: index<TAB>sentence<TAB>(start_line,start_col,end_line,end_col)
seams [OPTIONS] [PATH]
Arguments:
[PATH] Directory to scan recursively for *-0.txt files, or single *-0.txt file to process
Options:
--overwrite-all Reprocess all files, even those with complete _seams.txt files
--fail-fast Stop processing immediately on first I/O, UTF-8, or detection error
--no-progress Disable progress bars (useful for automation/CI)
-q, --quiet Suppress all non-error output (implies --no-progress)
--stats-out <FILE> Write performance statistics to JSON file [default: run_stats.json]
--clear-restart-log Clear the restart log and reprocess all files
--max-cpus <MAX_CPUS> Limit processing to specified number of CPUs/threads
--sentence-length-stats Calculate and display sentence length statistics
--debug-seams Generate debug TSV files with state transition details
--debug-text <DEBUG_TEXT> Debug sentence detection on provided text string
--debug-stdin Debug sentence detection on text from stdin
-h, --help Print help
-V, --version Print version
Performance scales with available CPU cores and I/O bandwidth. Actual throughput varies by:
Hardware: CPU cores, storage speed, memory bandwidth
File characteristics: Size distribution, text complexity
Workload: Complete pipeline vs. raw boundary detection only
Algorithm: DFA-based boundary detection using regex-automata with narrative-aware heuristics for dialog coalescing.
Performance: 23× faster than nupunkt single-threaded (451 MB/s vs 20 MB/s sentence detection). Multi-threaded end-to-end throughput reaches 1176 MB/s on test system.
Architecture: Two-stage pipeline with bounded parallelism (file enumeration + sentence splitting), async I/O with memory-mapped files.
For detailed design documentation, see SEAMS-Design.md.
MIT License - see LICENSE file for details.
Thanks to Project Gutenberg for providing the freely available corpus used for testing and benchmarking.