segtok

Crates.iosegtok
lib.rssegtok
version0.1.5
created_at2025-01-10 21:10:53.807115+00
updated_at2025-02-18 21:24:18.292711+00
descriptionSentence segmentation and word tokenization tools
homepage
repositoryhttps://github.com/xamgore/segtok
max_upload_size
id1511734
size129,936
Igor Strebz (xamgore)

documentation

README

segtok

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features. Ported from the python package (not maintained anymore), and fixes the contractions bug.

use segtok::{segmenter::*, tokenizer::*};

fn main() {
    let input = include_str!("../tests/test_google.txt");

    let sentences: Vec<Vec<_>> = split_multi(input, SegmentConfig::default())
        .into_iter()
        .map(|span| split_contractions(web_tokenizer(&span)).collect())
        .collect();
}
Commit count: 28

cargo fmt