Crates.io | segtok |
lib.rs | segtok |
version | 0.1.5 |
created_at | 2025-01-10 21:10:53.807115+00 |
updated_at | 2025-02-18 21:24:18.292711+00 |
description | Sentence segmentation and word tokenization tools |
homepage | |
repository | https://github.com/xamgore/segtok |
max_upload_size | |
id | 1511734 |
size | 129,936 |
A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.
use segtok::{segmenter::*, tokenizer::*};
fn main() {
let input = include_str!("../tests/test_google.txt");
let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
.into_iter()
.map(|span| split_contractions(web_tokenizer(&span)).collect())
.collect();
}