cutters

Crates.iocutters
lib.rscutters
version0.1.4
sourcesrc
created_at2022-02-10 13:22:48.801523
updated_at2023-07-17 09:51:32.779246
descriptionRule based sentence segmentation library.
homepage
repositoryhttps://github.com/cyanic-selkie/cutters
max_upload_size
id530271
size25,726
(cyanic-selkie)

documentation

README

cutters

A rule based sentence segmentation library.

Release Docs License Downloads

🚧 This library is experimental. 🚧

Features

  • Full UTF-8 support.
  • Robust parsing.
  • Language specific rules (each defined by its own PEG).
  • Fast and memory efficient parsing via the pest library.
  • Sentences can contain quotes which can contain subsentences.

Bindings

Besides native Rust, bindings for the following programming languages are available:

Supported languages

  • Croatian (standard)
  • English (standard)

There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.

Example

After adding the cutters dependency to your Cargo.toml file, usage is simple.

fn main(){
    let text = r#"Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.""#;

    let sentences = cutters::cut(text, cutters::Language::Croatian);

    println!("{:#?}", sentences);
}

This results in the following output (note that the str struct fields are &str).

[
    Sentence {
        str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
        quotes: [],
    },
    Sentence {
        str: "St. Louis 9LX je događaj u svijetu šaha.",
        quotes: [],
    },
    Sentence {
        str: "To je prof.dr.sc. Ivan Horvat.",
        quotes: [],
    },
    Sentence {
        str: "Volim rock, punk, funk, pop itd.",
        quotes: [],
    },
    Sentence {
        str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
        quotes: [
            Quote {
                str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
                sentences: [
                    "Sve sretne obitelji nalik su jedna na drugu.",
                    "Svaka nesretna obitelj nesretna je na svoj način.",
                ],
            },
        ],
    },
]
Commit count: 5

cargo fmt