Crates.io | token-string |
lib.rs | token-string |
version | |
source | src |
created_at | 2024-12-01 23:14:52.797985 |
updated_at | 2024-12-03 12:10:12.817171 |
description | Short (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called 'German Strings', because Germans have written the paper mentioning them. |
homepage | |
repository | https://codeberg.org/Release-Candidate/token-string |
max_upload_size | |
id | 1467959 |
Cargo.toml error: | TOML parse error at line 25, column 1 | 25 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
Short (up to 65,535 bytes) immutable strings to e.g. parse tokens, implemented in Rust. These are sometimes called "German Strings", because Germans have written the paper mentioning them, see The Paper for details.
When working with single words or even just single letters, which are Unicode grapheme clusters and not necessarily single bytes, we most of the time can get by with a bit[what a pun!] more than 10 bytes of UTF-8. English has an average letter count of about 5 and needs exactly one byte per letter, Finnish has about 8 with most of them needing at most 2 bytes (the Euro sign € needs 3 bytes (0xE2 0x82 0xAC
), for example). Chinese symbols take 3 to 4 bytes, but most1 words are up to two symbols long. As a pointer on 64 bit architectures2 are already 8 bytes we could store almost the whole string in the pointer itself without needing further indirection or heap allocation. This library is an attempt to use the string struct TokenString
which has a size of 16 bytes (128 bits) to store such small strings itself.
And immutable strings, because they make life easier -- except when concatenating -- and in many use cases the strings don't change after being generated anyways. E.g. after tokenization the scanned strings don't change any more.
I have chosen the maximal size of such a "small" string to use 2 bytes -- 16 bits, so the maximum string length is 2^16 - 1 = 65,535
bytes. Which leaves the maximum size of strings saved in the string struct itself at 14 bytes. In other words, there is space for up to 14 ASCII (single byte) letters or three 4 byte Unicode scalar points, so e.g. 3 Chinese symbols.
Japanese Hiragana or Katakana needs 3 bytes per symbol, so there is space for 4 of them. Which is not enough to store "most" words. By using one more byte of the size, we could save up to 15 bytes or 5 Katakana symbols, but the maximum size of a string would be only 255 bytes.
Strings which are at most 14 bytes long are short, or "small", TokenStrings
.
Such small strings are directly contained in the TokenString
struct, so no
memory is allocated on the heap and no indirection is needed to access the
string's data. Every string which is greater than 14 bytes -- up to the maximum of
65,535 bytes -- is allocated on the heap, but a copy of it's first 6 bytes, the so
called "prefix", is stored in the struct itself and directly accessible without
additional indirection through a pointer.
So, the first 6 bytes of every TokenString
, the "prefix", is used to speed up
the comparison of strings and similar tasks which may not need to access the
whole string in all cases. E.g. comparing two strings of the same length for
equality can fail fast if the prefixes of both strings differ, without the need to
de-reference a pointer to a string's data.
A TokenString
is guaranteed to be aligned to 64 bits (8 bytes) and has a size of 16 bytes (128 bits).
Below is a diagram depicting the memory used by TokenString
s and small TokenString
s.
The first two bytes of the struct always hold the length of the TokenString
and
the next 6 bytes the prefix, which is the first 6 bytes of the string. A short
TokenString
contains the bytes after the prefix in the following 8 bytes, up
to the maximum size of 14 bytes for a small TokenString
. A bigger TokenString
contains a pointer to the whole string data (including the prefix) in the
8 bytes after the prefix.
The crate at crates.io: crates.io: token_string.
Full documentation at Docs.rs: token_string
Run the following Cargo command in your project directory:
cargo add token-string
Or add the following line to your Cargo.toml:
token-string = "0.8.1"
A few short examples of the library, just to get a feel on how it works. Detailed documentation is available HERE.
These are immutable strings, so don't use mut
.
use token_string::TokenString;
// The string we passed may be too big for a `TokenString` and an error returned.
let s1 = TokenString::try_from("Hello, world!").unwrap();
// We can also create a `TokenString` from bytes:
let s2 = TokenString::try_from(b"Hello, world!".as_slice()).unwrap();
// Or a String.
let s3_string = "Hello, world!".to_owned();
let s3 = TokenString::try_from(&s3_string).unwrap();
// We can print `TokenString`s:
println!("{}", &s1);
// We can compare `TokenString`s:
assert!(s1 == s2);
// and compare them to `&str`, `&[u8]` or `&string`:
assert!(&s1 == "Hello, world!");
assert!(&s1 == b"Hello, world!".as_slice());
assert!(s1 == s3_string);
// We also can convert them to `&str`, `&[u8]`, `String` or a vector of `char`s:
let s1_str: &str = s1.as_str();
let s1_bytes: &[u8] = s1.as_bytes();
let s1_string: String = s1.as_string();
let s1_chars: Vec<char> = s1.as_chars();
// We can iterate through the bytes of the string.
// CAUTION: these are bytes, not Unicode scalar values or grapheme clusters, so
// not every iterator "is" some kind of a character.
for i in &s1 {
println!("Byte: {i}");
}
Because TokenString
s are immutable, we cannot concatenate two or more strings
by directly appending to the first string, as this would mutate this string. To
not have to allocate temporary strings when concatenating more than two, all
strings to concatenate are added to an array of TokenString
s to concatenate on
the stack. Only when calling the Builder
's method to actually concatenate all
strings, a new TokenString
is generated and memory allocated if necessary.
use token_string::{Builder, TokenString};
// If the string given to `try_from` is too big, an error is returned.
let s1 = TokenString::try_from("Hello, ").unwrap();
let s2 = TokenString::try_from("world!").unwrap();
// The number of strings to concatenate must be given to the `Builder` as a
// const generic value. We are concatenating 2 strings, so this is 2.
// The first parameter can be `'_`, to let the Rust compiler infer the lifetime.
let mut builder = Builder::<'_, 2>::new(&s1);
// Append `s2` to `s1`, "world!" to "Hello, ".
// An error is returned if the concatenated string is too big.
builder.concat_checked(&s2).unwrap();
// Create the new `TokenString`. Again, an error may occur.
let s1_s2 = builder.collect_checked().unwrap();
// Check, if the result is actually "Hello, world!".
assert!(&s1_s2 == "Hello, world!");
As checking every step like above is quite cumbersome, the two traits Concat
and Collect
make this easier by returning early if a step returns an error.
Instead of using concat_checked
and collect_checked
, we use the traits
methods Concat::concat
and Collect::collect
.
use token_string::{Builder, Collect as _, Concat as _, TokenString};
// If the string given to `try_from` is too big, an error is returned.
let s1 = TokenString::try_from("Hello, ").unwrap();
let s2 = TokenString::try_from("world").unwrap();
let s3 = TokenString::try_from("!").unwrap();
// The number of strings to concatenate must be given to the `Builder` as a
// const generic value. We are concatenating 3 strings, so this is 3.
// The first parameter can be `'_`, to let the Rust compiler infer the lifetime.
let s1_s2_s3 = Builder::<'_, 3>::new(&s1)
.concat(&s2)
.concat(&s3)
.collect()
.unwrap();
// Check, if the result is actually "Hello, world!".
assert!(&s1_s2_s3 == "Hello, world!");
This is better, but having to manually set the number of strings to concatenate
is tedious and error-prone. So let's use a macro to do that all for us --
token_string::concat!
macro internally does the same as the example above.
Just be sure to use token_string::concat!
and not another concat!
macro.
use token_string::{Builder, Collect as _, Concat as _, TokenString};
// If the string given to `try_from` is too big, an error is returned.
let s1 = TokenString::try_from("Hello, ").unwrap();
let s2 = TokenString::try_from("world").unwrap();
let s3 = TokenString::try_from("!").unwrap();
// Fully qualify `token_string::concat!` to get the right `concat!` macro.
// Again, an error may be returned if the concatenated string would be too big.
let s1_s2_s3 = token_string::concat!(&s1, &s2, &s3).unwrap();
// Check, if the result is actually "Hello, world!".
assert!(&s1_s2_s3 == "Hello, world!");
You have to test for yourself! It depends on your actual usage, if this string representation is a measurable -- much less significant enough to be noticeable -- optimization.
This uses cargo-make to not have to input long cargo command lines. The file ./Makefile.toml contains its configuration.
But there is no need to use cargo-make
, all commands can be input manually too -- after having installed the needed programs.
The grouped list of cargo-make
's build targets and the cargo command line it invokes:
build
: cargo build
- Build the library with debug information.doc
: cargo doc --all-features
- Generate API documentation using rustdoc.example
: cargo run --example example
- Run the example binary, source: ./examples/main.rs.clean
: cargo clean
- Delete files generated by cargo.lint
: cargo clippy
- Run Clippy, the Rust linter, on the sources. Needs Clippy installed.Tests:
test
: cargo nextest run --all-features
- Run all tests except for doc-tests. Needs cargo-nextest.doc-test
: cargo test --doc
- Run all doc-tests.cov
: cargo llvm-cov nextest --all-features --lcov --output-path lcov.info
- Run tests with coverage information, generate LCOV output at ./lcov.info
. Needs cargo-llvm-cov.mutation
: cargo mutants --test-tool=nextest -- --all-features
- Run mutation tests using cargo-mutants.mutation-iterate
: cargo mutants --test-tool=nextest --iterate -- --all-features
- Run mutation tests with cargo-mutants, using the data from previous runs to not run passed tests again.miri
: PROPTEST_DISABLE_FAILURE_PERSISTENCE=true MIRIFLAGS='-Zmiri-env-forward=PROPTEST_DISABLE_FAILURE_PERSISTENCE -Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri nextest run -j 10
- Run Miri on all tests. Needs Miri installed. Warning: this takes more than 10 hours if the test concat_many
in ./tests/test_builder.rs is run, compared to about 1 hour without that test.miri-example
: MIRIFLAGS='-Zmiri-backtrace=full -Zmiri-tree-borrows' cargo miri run --example example
- Run Miri on the example executable. Needs Miri installed.Neumann, Freitag: Umbra: A Disk-Based System with In-Memory Performance mentions comparable small strings in chapter 3.1 String Handling at page 5.
The main difference to the variant I've implemented is that I use only 2 bytes
(16 bits) for the string length instead of 4 bytes (32 bits) in the paper. So the
maximum string length of TokenString
s is 65KB bytes compared to 4GB in the paper.
But we gain 2 bytes of length for the short strings contained in the struct
itself, 14 bytes vs. 12 bytes in the paper's version.
The source in this repository is licensed under the Mozilla Public License version 2.0, see file ./LICENSE.