udpipe-rs

Crates.ioudpipe-rs
lib.rsudpipe-rs
version0.2.0
created_at2025-12-25 07:18:30.739919+00
updated_at2026-01-25 05:32:57.96857+00
descriptionRust bindings for UDPipe - a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
homepage
repositoryhttps://github.com/ccostello97/udpipe-rs
max_upload_size
id2004289
size4,872,147
Christopher Costello (ccostello97)

documentation

https://docs.rs/udpipe-rs

README

udpipe-rs

Crates.io Downloads Documentation Build Status Coverage MSRV Dependencies MIT/Apache-2.0 licensed

Rust bindings for UDPipe — a trainable pipeline for tokenization, tagging, lemmatization, and dependency parsing using Universal Dependencies.

Features

  • Full parsing pipeline: Tokenization, POS tagging, lemmatization, and dependency parsing
  • Universal Dependencies: Output follows the UD annotation scheme
  • Model download utility: Easy download of pre-trained models for 65+ languages (optional)
  • Thread-friendly: Models are Send (can be moved between threads)

Installation

Add to your Cargo.toml:

[dependencies]
udpipe-rs = "0.1"

Or install via command line:

cargo add udpipe-rs

Usage

Download and load a model

use udpipe_rs::{download_model, Model};

fn main() {
    // Download model by language (saved to current directory)
    let model_path = download_model("english-ewt", ".")
        .expect("Failed to download model");

    // Load and parse
    let model = Model::load(&model_path).expect("Failed to load model");
    let words = model.parse("The quick brown fox jumps over the lazy dog.")
        .expect("Failed to parse");

    for word in words {
        println!("{:<4} {:<10} {:<6} {:<10} {:>2} <- {}",
            word.id,
            word.form,
            word.upostag,
            word.lemma,
            word.head,
            word.deprel
        );
    }
}

Output:

1    The        DET    the         2 <- det
2    quick      ADJ    quick       5 <- amod
3    brown      ADJ    brown       5 <- amod
4    fox        NOUN   fox         5 <- nsubj
5    jumps      VERB   jump        0 <- root
6    over       ADP    over        9 <- case
7    the        DET    the         9 <- det
8    lazy       ADJ    lazy        9 <- amod
9    dog        NOUN   dog         5 <- obl
10   .          PUNCT  .           5 <- punct

Available languages

Pre-trained models are available for 65+ languages. Use udpipe_rs::AVAILABLE_MODELS to see the full list:

// Some examples:
// "english-ewt", "english-gum", "english-lines", "english-partut"
// "german-gsd", "german-hdt"
// "french-gsd", "french-sequoia", "french-spoken"
// "spanish-ancora", "spanish-gsd"
// "dutch-alpino", "dutch-lassysmall"
// "chinese-gsd", "japanese-gsd", "korean-gsd"
// ... and many more

for lang in udpipe_rs::AVAILABLE_MODELS {
    println!("{}", lang);
}

Working with morphological features

use udpipe_rs::Model;

fn main() {
    let model = Model::load("english-ewt-ud-2.5-191206.udpipe").expect("Failed to load");
    let words = model.parse("Run quickly!").expect("Failed to parse");

    for word in &words {
        // Check for imperative mood
        if word.is_verb() && word.has_feature("Mood", "Imp") {
            println!("Found imperative: {}", word.form);
        }

        // Get specific features
        if let Some(tense) = word.get_feature("Tense") {
            println!("{} has tense: {}", word.form, tense);
        }
    }
}

Working with sentence structure

use udpipe_rs::Model;

fn main() {
    let model = Model::load("english-ewt-ud-2.5-191206.udpipe").expect("Failed to load");
    let words = model.parse("Hello world. Goodbye world.").expect("Failed to parse");

    // Group words by sentence
    let mut current_sentence = -1;
    for word in &words {
        if word.sentence_id != current_sentence {
            println!("\n--- Sentence {} ---", word.sentence_id + 1);
            current_sentence = word.sentence_id;
        }
        println!("  {}: {} ({})", word.id, word.form, word.upostag);
    }
}

Download from custom URL

If you need to download from a different source:

use udpipe_rs::download_model_from_url;

download_model_from_url(
    "https://example.com/custom-model.udpipe",
    "custom-model.udpipe",
).expect("Failed to download");

Thread Safety

Model is Send but not Sync. This means:

  • You can move a model to another thread (ownership transfer)
  • You cannot share &Model across threads simultaneously

For concurrent access, either:

Option 1: Wrap in Mutex (shared model, serialized access)

use std::sync::{Arc, Mutex};
use udpipe_rs::Model;

let model = Arc::new(Mutex::new(Model::load("model.udpipe")?));

// Clone Arc for each thread
let model_clone = Arc::clone(&model);
std::thread::spawn(move || {
    let guard = model_clone.lock().unwrap();
    let words = guard.parse("Hello world").unwrap();
});

Option 2: Separate models per thread (parallel access, higher memory)

use udpipe_rs::Model;

std::thread::spawn(|| {
    let model = Model::load("model.udpipe").unwrap();
    let words = model.parse("Hello world").unwrap();
});

API Reference

Word struct

Each parsed word contains:

Field Type Description
form String The surface form (actual text)
lemma String The lemma (dictionary form)
upostag String Universal POS tag (NOUN, VERB, ADJ, etc.)
xpostag String Language-specific POS tag
feats String Morphological features (e.g., "Mood=Imp|VerbForm=Fin")
deprel String Dependency relation (root, nsubj, obj, etc.)
misc String Miscellaneous annotations (e.g., "SpaceAfter=No")
id i32 1-based index of this word within its sentence
head i32 Index of head word (0 = root of sentence)
sentence_id i32 0-based index of the sentence this word belongs to

Helper methods on Word

  • has_feature(key, value) — Check if a morphological feature is present
  • get_feature(key) — Get the value of a morphological feature
  • is_verb() — Returns true for VERB or AUX tags
  • is_noun() — Returns true for NOUN or PROPN tags
  • is_adjective() — Returns true for ADJ tag
  • is_punct() — Returns true for PUNCT tag
  • is_root() — Returns true if this word is the sentence root
  • has_space_after() — Returns true if there's a space after this word (default)

Examples

# Download a model
cargo run --example download_model
cargo run --example download_model -- german-gsd ./models

# Parse text
cargo run --example parse_text
cargo run --example parse_text -- "Your text here."

Models

Pre-trained models for 100+ treebanks are available from the LINDAT/CLARIAH-CZ repository. The download_model function fetches from this repository automatically.

Requirements

For users: A C++ compiler with C++11 support. The build script compiles UDPipe as a static library automatically.

For contributors: Just Docker. See CONTRIBUTING.md for details.

License

This crate is dual-licensed under MIT OR Apache-2.0.

UDPipe itself is licensed under the Mozilla Public License 2.0.

Commit count: 19

cargo fmt