rust_tokenizers

Crates.io	rust_tokenizers
lib.rs	rust_tokenizers
version	8.1.1
source	src
created_at	2020-02-15 10:24:06.801927
updated_at	2023-10-01 09:32:29.392146
description	High performance tokenizers for Rust
homepage
repository	https://github.com/guillaume-be/rust-tokenizers
max_upload_size
id	209420
size	1,165,029

(guillaume-be)

documentation

README

rust-tokenizers

Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library It includes a broad range of tokenizers for state-of-the-art transformers architectures, including:

Sentence Piece (unigram model)
Sentence Piece (BPE model)
BERT
ALBERT
DistilBERT
RoBERTa
GPT
GPT2
ProphetNet
CTRL
Pegasus
MBart50
M2M100
NLLB
DeBERTa
DeBERTa (v2)

The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.

Usage example

use std::path::PathBuf;

use rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy};
use rust_tokenizers::vocab::{BertVocab, Vocab};

let lowercase: bool = true;
let strip_accents: bool = true;
let vocab_path: PathBuf  = PathBuf::from("path/to/vocab");
let vocab: BertVocab = BertVocab::from_file(&vocab_path)?;
let test_sentence: Example = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, lowercase, strip_accents);

println!("{:?}", bert_tokenizer.encode(&test_sentence.sentence_1,
                                       None,
                                       128,
                                       &TruncationStrategy::LongestFirst,
                                       0));

Commit count: 442

rust_tokenizers

documentation

README

rust-tokenizers

Usage example

cargo fmt