Crates.io | icu_segmenter |
lib.rs | icu_segmenter |
version | 2.0.0-beta1 |
source | src |
created_at | 2021-04-29 19:41:03.987822 |
updated_at | 2024-11-23 02:17:37.810027 |
description | Unicode line breaking and text segmentation algorithms for text boundaries analysis |
homepage | https://icu4x.unicode.org |
repository | https://github.com/unicode-org/icu4x |
max_upload_size | |
id | 391231 |
size | 4,035,461 |
Segment strings by lines, graphemes, words, and sentences.
This module is published as its own crate (icu_segmenter
)
and as part of the icu
crate. See the latter for more details on the ICU4X project.
This module contains segmenter implementation for the following rules.
line-break
and
word-break
properties.Find line break opportunities:
use icu::segmenter::LineSegmenter;
let segmenter = LineSegmenter::new_auto();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(&breakpoints, &[0, 6, 13, 17, 23, 29, 36]);
See [LineSegmenter
] for more examples.
Find all grapheme cluster boundaries:
use icu::segmenter::GraphemeClusterSegmenter;
let segmenter = GraphemeClusterSegmenter::new();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(
&breakpoints,
&[
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 21, 22, 23, 24, 25, 28, 29, 30, 31, 34, 35, 36
]
);
See [GraphemeClusterSegmenter
] for more examples.
Find all word boundaries:
use icu::segmenter::WordSegmenter;
let segmenter = WordSegmenter::new_auto();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(
&breakpoints,
&[0, 5, 6, 11, 12, 13, 16, 17, 22, 23, 28, 29, 35, 36]
);
See [WordSegmenter
] for more examples.
Segment the string into sentences:
use icu::segmenter::SentenceSegmenter;
let segmenter = SentenceSegmenter::new();
let breakpoints: Vec<usize> = segmenter
.segment_str("Hello World. Xin chào thế giới!")
.collect();
assert_eq!(&breakpoints, &[0, 13, 36]);
See [SentenceSegmenter
] for more examples.
For more information on development, authorship, contributing etc. please visit ICU4X home page
.