sakurs-core

Crates.iosakurs-core
lib.rssakurs-core
version0.1.1
created_at2025-07-27 14:01:27.131576+00
updated_at2025-07-27 15:29:09.451502+00
descriptionHigh-performance sentence boundary detection using Delta-Stack Monoid algorithm
homepagehttps://github.com/sog4be/sakurs
repositoryhttps://github.com/sog4be/sakurs
max_upload_size
id1770105
size515,477
(sog4be)

documentation

https://docs.rs/sakurs-core

README

sakurs-core

High-performance sentence boundary detection library using the Delta-Stack Monoid algorithm.

⚠️ API Stability Warning: This crate is in pre-release (v0.1.0). APIs may change significantly before v1.0.0. We recommend pinning to exact versions:

sakurs-core = "=0.1.0"

Table of Contents

Features

  • Parallel Processing: Efficient speedup with multiple cores using the Delta-Stack Monoid algorithm
  • Language Support: Configurable rules for English and Japanese via TOML-based configuration
  • Mathematically Sound: Based on monoid algebra, ensuring correct results in parallel execution
  • Complex Text Support: Handles nested quotes, abbreviations, and cross-chunk boundaries correctly

Quick Start

use sakurs_core::api::{SentenceProcessor, Input};

// Create processor with default configuration
let processor = SentenceProcessor::with_language("en")?;

// Process text
let text = "Hello world. This is a test.";
let output = processor.process(Input::from_text(text))?;

// Use the boundaries
for boundary in &output.boundaries {
    println!("Sentence ends at byte offset: {}", boundary.offset);
}

Advanced Usage

Custom Configuration

use sakurs_core::api::{Config, Input, SentenceProcessor};

let config = Config::builder()
    .language("ja")?           // Japanese language rules
    .threads(Some(4))          // Use 4 threads
    .chunk_size_kb(Some(512))  // 512KB chunks
    .build()?;

let processor = SentenceProcessor::with_config(config)?;

Processing Files

use sakurs_core::api::{Input, SentenceProcessor};

let processor = SentenceProcessor::new();
let output = processor.process(Input::from_file("document.txt"))?;

println!("Found {} sentences", output.boundaries.len());
println!("Processing took {:?}", output.metadata.processing_time);

Streaming Large Files

use sakurs_core::api::{Config, Input, SentenceProcessor};

// Use streaming configuration for memory-efficient processing
let config = Config::streaming()
    .language("en")?
    .build()?;

let processor = SentenceProcessor::with_config(config)?;
let output = processor.process(Input::from_file("large_document.txt"))?;

Language Support

Currently supported:

  • English (en)
  • Japanese (ja)

Language rules are configured via TOML files. See the main repository for documentation on adding new languages.

Algorithm

This library implements the Delta-Stack Monoid algorithm, which represents parsing state as an associative monoid. This mathematical property enables:

  1. Splitting text into chunks
  2. Processing chunks in parallel
  3. Combining results in any order
  4. Getting identical results to sequential processing

For detailed algorithm documentation, see the main repository.

License

MIT License. See LICENSE for details.

Links

Commit count: 0

cargo fmt