Crates.io | tokenizers-enfer |
lib.rs | tokenizers-enfer |
version | |
source | src |
created_at | 2024-12-06 04:16:30.768627 |
updated_at | 2024-12-06 04:26:31.601531 |
description | Provides an implementation of today's most used tokenizers, with a focus on performances and versatility. |
homepage | https://github.com/sssemil/tokenizers-enfer |
repository | https://github.com/sssemil/tokenizers-enfers |
max_upload_size | |
id | 1473909 |
Cargo.toml error: | TOML parse error at line 29, column 1 | 29 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
The core of tokenizers
, written in Rust.
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding
.
The various steps of the pipeline are:
Normalizer
: in charge of normalizing the text. Common examples of normalization are
the unicode normalization standards, such as NFD
or NFKC
.
More details about how to use the Normalizers
are available on the
Hugging Face blogPreTokenizer
: in charge of creating initial words splits in the text. The most common way of
splitting text is simply on whitespace.Model
: in charge of doing the actual tokenization. An example of a Model
would be
BPE
or WordPiece
.PostProcessor
: in charge of post-processing the Encoding
to add anything relevant
that, for example, a language model would need, such as special tokens.use tokenizers::tokenizer::{Result, Tokenizer};
fn main() -> Result<()> {
# #[cfg(feature = "http")]
# {
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None)?;
let encoding = tokenizer.encode("Hey there!", false)?;
println!("{:?}", encoding.get_tokens());
# }
Ok(())
}
use tokenizers::tokenizer::{Result, Tokenizer, EncodeInput};
use tokenizers::models::bpe::BPE;
fn main() -> Result<()> {
let bpe_builder = BPE::from_file("./path/to/vocab.json", "./path/to/merges.txt");
let bpe = bpe_builder
.dropout(0.1)
.unk_token("[UNK]".into())
.build()?;
let mut tokenizer = Tokenizer::new(bpe);
let encoding = tokenizer.encode("Hey there!", false)?;
println!("{:?}", encoding.get_tokens());
Ok(())
}
use tokenizers::decoders::DecoderWrapper;
use tokenizers::models::bpe::{BpeTrainerBuilder, BPE};
use tokenizers::normalizers::{strip::Strip, unicode::NFC, utils::Sequence, NormalizerWrapper};
use tokenizers::pre_tokenizers::byte_level::ByteLevel;
use tokenizers::pre_tokenizers::PreTokenizerWrapper;
use tokenizers::processors::PostProcessorWrapper;
use tokenizers::{AddedToken, Model, Result, TokenizerBuilder};
use std::path::Path;
fn main() -> Result<()> {
let vocab_size: usize = 100;
let mut trainer = BpeTrainerBuilder::new()
.show_progress(true)
.vocab_size(vocab_size)
.min_frequency(0)
.special_tokens(vec![
AddedToken::from(String::from("<s>"), true),
AddedToken::from(String::from("<pad>"), true),
AddedToken::from(String::from("</s>"), true),
AddedToken::from(String::from("<unk>"), true),
AddedToken::from(String::from("<mask>"), true),
])
.build();
let mut tokenizer = TokenizerBuilder::new()
.with_model(BPE::default())
.with_normalizer(Some(Sequence::new(vec![
Strip::new(true, true).into(),
NFC.into(),
])))
.with_pre_tokenizer(Some(ByteLevel::default()))
.with_post_processor(Some(ByteLevel::default()))
.with_decoder(Some(ByteLevel::default()))
.build()?;
let pretty = false;
tokenizer
.train_from_files(
&mut trainer,
vec!["path/to/vocab.txt".to_string()],
)?
.save("tokenizer.json", pretty)?;
Ok(())
}
RAYON_RS_NUM_THREADS
environment variable. As an example setting RAYON_RS_NUM_THREADS=4
will allocate a maximum of 4 threads.
Please note this behavior may evolve in the futureprogressbar: The progress bar visualization is enabled by default. It might be disabled if compilation for certain targets is not supported by the termios dependency of the indicatif progress bar.