unobtanium-segmenter

Crates.io	unobtanium-segmenter
lib.rs	unobtanium-segmenter
version	0.4.0
created_at	2025-06-26 00:20:48.724286+00
updated_at	2026-01-07 14:09:27.060888+00
description	A text segmentation toolbox for search applications inspired by charabia and tantivy.
homepage
repository	https://codeberg.org/unobtanium/unobtanium-segmenter
max_upload_size
id	1726716
size	112,737

Slatian (slatian)

documentation

README

Unobtanium Segmenter

Documentation | Codeberg

This is a rust crate to help with segmenting text into word-like tokens.

What it does

Unobtanium Segmenter is based on chaining iterators together, that at first take one or more strings and then subdivide those strings into tokens until one arrives at the desired level of splitting, words, sentences, graphemes, a mix of them or something inbetween.

There are different kinds of processing:

Subdividing to split a token into one or more tokens
Augmenting for language and script detection
Filtering for removing tokens
Normalizing for altering tokens

It currently works well for European and other "space" seperated languages, though it still needs some work for non-space seperated languages (that's mainly asian languages), help with these is welcome, if usure ask via an issue.

Overview

Segmentation

Segmentation splits bigger chunks of text into smaller ones.

DecompositionAhoCorasick : Decompose compound words into their parts found in a given dictionary. Useful for small or on the fly generated dictionaries.

DecompositionFst : Decompose compound words into their parts found in a given dictionary. Useful for compressed, memory mapped dictionaries.

UnicodeSentenceSplitter : Split Large chunks of text into sentances according to the Unicode definition of a Sentence.

UnicodeWordSplitter : Split text into words according to the Unicode definition of what a word is. While not perfect, it should work well enough as an easy starting point.

Augmentation

Augmentation adds metadata to segments.

AugmentationDetectLanguage : Runs language and script detection using whatlang. Use before splitting sentences into words.

AugmentationDetectScript : Runs script detection using whatlang. This one is cheaper than full language detection and works on smaller chunks of text to detect which script they're written in.

AugmentationClassify : Sets the token kind to one of AlphaNumeric, Separator, Symbol or None depending on which characters are in use.

Normalization

Takes away detils you don't need.

NormalizationLowercase : Will lowercase anything that can be lowercased using the rust builtin lowercasing methods.

NormalizationRustStemmers : Will run stemming with the language tagged onto the token if an algorithm is available. This uses the rust_stemmers crate. Apply after a language detection step, otherwise it won't work.

License

The Unobtanium Segmenter is licensed as LGPL-3.0-only.

The project aims to be compliant with version 3.3 of the reuse specification.

It includes code that was taken and adapted from other libraries:

From Charabia (Copyright (c) 2020-2025 Meili SAS) licensed under the MIT license.
From Tantivy (Copyright (c) 2018 by the project authors, as listed in the AUTHORS file.)

Commit count: 79