token-string

Crates.iotoken-string
lib.rstoken-string
version
sourcesrc
created_at2024-12-01 23:14:52.797985
updated_at2024-12-03 12:10:12.817171
descriptionShort (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called 'German Strings', because Germans have written the paper mentioning them.
homepage
repositoryhttps://codeberg.org/Release-Candidate/token-string
max_upload_size
id1467959
Cargo.toml error:TOML parse error at line 25, column 1 | 25 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include`
size0
(Release-Candidate)

documentation

README

token-string

Crates.io Version docs.rs MPL v2.0 license Built with cargo-make

Short (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called "German Strings", because Germans have written the paper mentioning them, see The Paper for details.

Why?

When working with single words or even just single letters, which are Unicode grapheme clusters and not necessarily single bytes, we most of the time can get by with a bit[what a pun!] more than 10 bytes of UTF-8. English has an average letter count of about 5 and needs exactly one byte per letter, Finnish has about 8 with most of them needing at most 2 bytes (the Euro sign € needs 3 bytes (0xE2 0x82 0xAC), for example). Chinese symbols take 3 to 4 bytes, but most1 words are up to two symbols long. As a pointer on 64 bit architectures2 are already 8 bytes we could store almost the whole string in the pointer itself without needing further indirection or heap allocation. This library is an attempt to use the string struct TokenString which has a size of 16 bytes (128 bits) to store such small strings itself.

And immutable strings, because they make life easier -- except when concatenating -- and in many use cases the strings don't change after being generated anyways. E.g. after tokenization the scanned strings don't change any more.

The Compromise

I have chosen the maximal size of such a "small" string to use 2 bytes -- 16 bits, so the maximum string length is 2^16 - 1 = 65,535 bytes. Which leaves the maximum size of strings saved in the string struct itself at 14 bytes. In other words, there is space for up to 14 ASCII (single byte) letters or three 4 byte Unicode scalar points, so e.g. 3 Chinese symbols.

Japanese Hiragana or Katakana needs 3 bytes per symbol, so there is space for 4 of them. Which is not enough to store "most" words. By using one more byte of the size, we could save up to 15 bytes or 5 Katakana symbols, but the maximum size of a string would be only 255 bytes.

How?

Strings which are at most 14 bytes long are short, or "small", TokenStrings. Such small strings are directly contained in the TokenString struct, so no memory is allocated on the heap and no indirection is needed to access the string's data. Every string which is greater than 14 bytes -- up to the maximum of 65,535 bytes -- is allocated on the heap, but a copy of it's first 6 bytes, the so called "prefix", is stored in the struct itself and directly accessible without additional indirection through a pointer.

So, the first 6 bytes of every TokenString, the "prefix", is used to speed up the comparison of strings and similar tasks which may not need to access the whole string in all cases. E.g. comparing two strings of the same length for equality can fail fast if the prefixes of both strings differ, without the need to de-reference a pointer to a string's data.

Memory Layout of a TokenString

A TokenString is guaranteed to be aligned to 64 bits (8 bytes) and has a size of 16 bytes (128 bits).

Below is a diagram depicting the memory used by TokenStrings and small TokenStrings.

Memory layout diagram

The first two bytes of the struct always hold the length of the TokenString and the next 6 bytes the prefix, which is the first 6 bytes of the string. A short TokenString contains the bytes after the prefix in the following 8 bytes, up to the maximum size of 14 bytes for a small TokenString. A bigger TokenString contains a pointer to the whole string data (including the prefix) in the 8 bytes after the prefix.

Usage of the Library

The crate at crates.io: crates.io: token_string.

Full documentation at Docs.rs: token_string

Installation

Run the following Cargo command in your project directory:

cargo add token-string

Or add the following line to your Cargo.toml:

token-string = "0.8.1"

Examples

A few short examples of the library, just to get a feel on how it works. Detailed documentation is available HERE.

These are immutable strings, so don't use mut.

use token_string::TokenString;

// The string we passed may be too big for a `TokenString` and an error returned.
let s1 = TokenString::try_from("Hello, world!").unwrap();

// We can also create a `TokenString` from bytes:
let s2 = TokenString::try_from(b"Hello, world!".as_slice()).unwrap();

// Or a String.
let s3_string = "Hello, world!".to_owned();
let s3 = TokenString::try_from(&s3_string).unwrap();

// We can print `TokenString`s:
println!("{}", &s1);

// We can compare `TokenString`s:
assert!(s1 == s2);

// and compare them to `&str`, `&[u8]` or `&string`:
assert!(&s1 == "Hello, world!");
assert!(&s1 == b"Hello, world!".as_slice());
assert!(s1 == s3_string);

// We also can convert them to `&str`, `&[u8]`, `String` or a vector of `char`s:
let s1_str: &str = s1.as_str();
let s1_bytes: &[u8] = s1.as_bytes();
let s1_string: String = s1.as_string();
let s1_chars: Vec<char> = s1.as_chars();

// We can iterate through the bytes of the string.
// CAUTION: these are bytes, not Unicode scalar values or grapheme clusters, so
// not every iterator "is" some kind of a character.
for i in &s1 {
  println!("Byte: {i}");
}

String Concatenation

Because TokenStrings are immutable, we cannot concatenate two or more strings by directly appending to the first string, as this would mutate this string. To not have to allocate temporary strings when concatenating more than two, all strings to concatenate are added to an array of TokenStrings to concatenate on the stack. Only when calling the Builder's method to actually concatenate all strings, a new TokenString is generated and memory allocated if necessary.

Examples 2

use token_string::{Builder, TokenString};

// If the string given to `try_from` is too big, an error is returned.
let s1 = TokenString::try_from("Hello, ").unwrap();
let s2 = TokenString::try_from("world!").unwrap();

// The number of strings to concatenate must be given to the `Builder` as a
// const generic value. We are concatenating 2 strings, so this is 2.
// The first parameter can be `'_`, to let the Rust compiler infer the lifetime.
let mut builder = Builder::<'_, 2>::new(&s1);

// Append `s2` to `s1`, "world!" to "Hello, ".
// An error is returned if the concatenated string is too big.
builder.concat_checked(&s2).unwrap();

// Create the new `TokenString`. Again, an error may occur.
let s1_s2 = builder.collect_checked().unwrap();

// Check, if the result is actually "Hello, world!".
assert!(&s1_s2 == "Hello, world!");

As checking every step like above is quite cumbersome, the two traits Concat and Collect make this easier by returning early if a step returns an error. Instead of using concat_checked and collect_checked, we use the traits methods Concat::concat and Collect::collect.

use token_string::{Builder, Collect as _, Concat as _, TokenString};

// If the string given to `try_from` is too big, an error is returned.
let s1 = TokenString::try_from("Hello, ").unwrap();
let s2 = TokenString::try_from("world").unwrap();
let s3 = TokenString::try_from("!").unwrap();

// The number of strings to concatenate must be given to the `Builder` as a
// const generic value. We are concatenating 3 strings, so this is 3.
// The first parameter can be `'_`, to let the Rust compiler infer the lifetime.
let s1_s2_s3 = Builder::<'_, 3>::new(&s1)
    .concat(&s2)
    .concat(&s3)
    .collect()
    .unwrap();

// Check, if the result is actually "Hello, world!".
assert!(&s1_s2_s3 == "Hello, world!");

This is better, but having to manually set the number of strings to concatenate is tedious and error-prone. So let's use a macro to do that all for us -- token_string::concat! macro internally does the same as the example above. Just be sure to use token_string::concat! and not another concat! macro.

use token_string::{Builder, Collect as _, Concat as _, TokenString};

// If the string given to `try_from` is too big, an error is returned.
let s1 = TokenString::try_from("Hello, ").unwrap();
let s2 = TokenString::try_from("world").unwrap();
let s3 = TokenString::try_from("!").unwrap();

// Fully qualify `token_string::concat!` to get the right `concat!` macro.
// Again, an error may be returned if the concatenated string would be too big.
let s1_s2_s3 = token_string::concat!(&s1, &s2, &s3).unwrap();

// Check, if the result is actually "Hello, world!".
assert!(&s1_s2_s3 == "Hello, world!");

Performance

You have to test for yourself! It depends on your actual usage, if this string representation is a measurable -- much less significant enough to be noticeable -- optimization.

Building and Testing the Source

This uses cargo-make to not have to input long cargo command lines. The file ./Makefile.toml contains its configuration.

But there is no need to use cargo-make, all commands can be input manually too -- after having installed the needed programs.

The grouped list of cargo-make's build targets and the cargo command line it invokes:

  • build: cargo build - Build the library with debug information.
  • doc: cargo doc --all-features - Generate API documentation using rustdoc.
  • example: cargo run --example example - Run the example binary, source: ./examples/main.rs.
  • clean: cargo clean - Delete files generated by cargo.
  • lint: cargo clippy - Run Clippy, the Rust linter, on the sources. Needs Clippy installed.

Tests:

  • test: cargo nextest run --all-features - Run all tests except for doc-tests. Needs cargo-nextest.
  • doc-test: cargo test --doc - Run all doc-tests.
  • cov: cargo llvm-cov nextest --all-features --lcov --output-path lcov.info - Run tests with coverage information, generate LCOV output at ./lcov.info. Needs cargo-llvm-cov.
  • mutation: cargo mutants --test-tool=nextest -- --all-features - Run mutation tests using cargo-mutants.
  • mutation-iterate: cargo mutants --test-tool=nextest --iterate -- --all-features - Run mutation tests with cargo-mutants, using the data from previous runs to not run passed tests again.
  • miri: PROPTEST_DISABLE_FAILURE_PERSISTENCE=true MIRIFLAGS='-Zmiri-env-forward=PROPTEST_DISABLE_FAILURE_PERSISTENCE -Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri nextest run -j 10 - Run Miri on all tests. Needs Miri installed. Warning: this takes more than 10 hours if the test concat_many in ./tests/test_builder.rs is run, compared to about 1 hour without that test.
  • miri-example: MIRIFLAGS='-Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri run --example example - Run Miri on the example executable. Needs Miri installed.

The Paper

Neumann, Freitag: Umbra: A Disk-Based System with In-Memory Performance mentions comparable small strings in chapter 3.1 String Handling at page 5.

The main difference to the variant I've implemented is that I use only 2 bytes (16 bits) for the string length instead of 4 bytes (32 bits) in the paper. So the maximum string length of TokenStrings is 65KB bytes compared to 4GB in the paper. But we gain 2 bytes of length for the short strings contained in the struct itself, 14 bytes vs. 12 bytes in the paper's version.

License

The source in this repository is licensed under the Mozilla Public License version 2.0, see file ./LICENSE.

Footnotes

  1. Source for all of these average word lengths: "the internet"

  2. This does not work with 32 bit (or 128 bit) pointer types.

Commit count: 0

cargo fmt