| Crates.io | compressed-intvec |
| lib.rs | compressed-intvec |
| version | 0.5.2 |
| created_at | 2025-02-18 17:02:34.041856+00 |
| updated_at | 2025-09-25 11:01:34.450243+00 |
| description | Space-efficient integer vectors for Rust. Offers a fixed-width implementation for O(1) mutable and atomic access, and a variable-width implementation that uses instantaneous codes and sampling for high compression ratios on non-uniform data. |
| homepage | |
| repository | https://github.com/lukefleed/compressed-intvec |
| max_upload_size | |
| id | 1560156 |
| size | 684,285 |
A Rust library that provides space-efficient, in-memory representations for integer vectors. It offers two complementary data structures: FixedVec, which uses fixed-width encoding for O(1) mutable and atomic access, and IntVec, which uses variable-length instantaneous codes for high-ratio compression and amortized O(1) random access.
The library is designed to reduce the memory footprint of standard std::vec::Vec collections of integers while retaining performant access patterns.
FixedVec and IntVecThe library provides two distinct vector types, each based on a different encoding principle. Choosing the right one depends on your specific use case, performance requirements, and data characteristics.
FixedVec: Fixed-Width EncodingImplements a vector where every integer occupies the same, predetermined number of bits.
FixedVec can be 2-3x faster than std::vec::Vec for random access.std::vec::Vec (e.g., push, set, pop).AtomicFixedVec, a thread-safe variant that supports atomic read-modify-write operations for concurrent environments.IntVec: Variable-Width EncodingImplements a vector using variable-length instantaneous codes (e.g., Gamma, Delta, Rice, Zeta) to represent each integer.
k).The following examples shows some use cases for FixedVec and IntVec. All common types and traits are available through the prelude.
FixedVec for Uniform Data and Mutable Accessuse compressed_intvec::prelude::*;
// Data where values are within a relatively small, uniform range.
let data: &[u32] = &[10, 20, 30, 40, 50];
// The builder automatically infers the minimal bit width required.
let mut vec: UFixedVec<u32> = FixedVec::builder()
.build(data)
.unwrap();
assert_eq!(vec.get(2), Some(30));
// `FixedVec` supports in-place mutation.
vec.set(2, 35);
assert_eq!(vec.get(2), Some(35));
IntVec for High-Ratio Compressionuse compressed_intvec::prelude::*;
// Skewed data with a mix of small and large values.
let skewed_data: &[u64] = &[5, 8, 13, 1000, 7, 6, 10_000, 10, 2, 3];
// The builder can automatically select the best compression codec.
let intvec: LEIntVec = IntVec::builder()
.codec(VariableCodecSpec::Auto)
.build(skewed_data)
.unwrap();
assert_eq!(intvec.len(), skewed_data.len());
assert_eq!(intvec.get(3), Some(1000));
IntVecIntVec is the optimal choice when data that is not uniformly distributed and we want to minimize memory usage. It uses variable-length codes to represent integers.
The compression strategy is controlled by the VariableCodecSpec enum, passed to the builder.
For most use cases, the recommended strategy is VariableCodecSpec::Auto, which analyzes the data to select the most space-efficient codec. However, you can also specify a codec explicitly based on your data characteristics.
VariableCodecSpec Variant |
Description & Encoding Strategy | Optimal Data Distribution |
|---|---|---|
Auto |
Recommended default. Analyzes the data to choose the best variable-length code, balancing build time and compression ratio. | Agnostic; adapts to the input data. |
Gamma (γ) |
A universal, parameter-free code. Encodes n using the unary code of log₂(n+1), followed by the remaining bits of n+1. |
Implied distribution is ≈ 1/(2x²). Optimal for data skewed towards small non-negative integers. |
Delta (δ) |
A universal, parameter-free code. Encodes n using the γ code of log₂(n+1) , making it more efficient than γ for larger values. |
Implied distribution is ≈ 1/(2x(log x)²). |
Rice |
A fast, tunable version of Golomb codes where the parameter b must be a power of two. Encodes n by splitting it into a quotient (stored in unary) and a remainder (stored in binary). |
Geometric distributions. |
Golomb |
A tunable code, more general than Rice. Encodes n by splitting it into a quotient (stored in unary) and a remainder (stored using a minimal binary code). |
Geometric distributions. Implied distribution is ≈ 1/rˣ. |
Zeta (ζ) |
A tunable code for power-law data. Encodes n based on log₂(n+1)/k in unary, followed by a minimal binary code for the remainder. |
Power-law distributions (e.g., word frequencies, node degrees). Implied distribution is ≈ 1/x1+1/k. |
VByteLe/Be |
A byte-aligned code that uses a continuation bit to store integers in a variable number of bytes. Fast to decode. The big-endian variant is lexicographical. | Implied distribution is ≈ 1/x8/7. Good for general-purpose integer data. |
Omega (ω) |
A universal, parameter-free code that recursively encodes the length of the binary representation of n. |
Implied distribution is approximately 1/x. Compact for very large numbers. |
Unary |
The simplest code. Encodes n as n zero-bits followed by a one-bit. |
Geometric distributions with a very high probability of small values (e.g., boolean flags). |
Explicit |
An escape hatch to use any code from the dsi-bitstream::codes::Codes enum. |
Advanced use cases requiring specific, unlisted codes. |
VariableCodecSpec::AutoThe Auto strategy removes the guesswork from codec selection. During the build phase, it analyzes the input data and selects the codec that offers the best compression ratio. This introduces a (non negligible) one-time cost for the analysis at construction time. Use the Auto codec when you want to create an IntVec once and read it many times, as the amortized cost of the analysis is negligible compared to the space savings and the performance of subsequent reads.
If you need to create multiple IntVec instances at run-time, consider using a specific codec that matches your data distribution to avoid the overhead of analysis.
IntVec Access PatternsThe access strategy for a compressed IntVec has significant performance implications. The library provides several methods, each optimized for a specific access pattern. Using the appropriate method is key to achieving high throughput.
get_many: Batch Access from a SliceFor retrieving a batch of elements from a slice of indices.
This should be your preferred method for any batch lookup when all indices are known and can be stored in a slice.
use compressed_intvec::prelude::*;
use rand::Rng; // For random number generation
let data: Vec<u64> = (0..10_000).collect();
let intvec: LEIntVec = IntVec::builder()
.codec(VariableCodecSpec::Delta)
.k(32)
.build(&data)
.unwrap();
// Indices can be in any order.
let indices_to_get: Vec<usize> = (0..100).map(|_| rand::rng().gen_range(0..10_000)).collect();
// `get_many` retrieves all values in one optimized pass.
let values = intvec.get_many(&indices_to_get).unwrap();
assert_eq!(values, indices_to_get.iter().map(|&i| data[i]).collect::<Vec<_>>());
get_many_from_iter: Access from an IteratorFor retrieving elements from a streaming iterator of indices.
IntVecSeqReader internally, which is optimized for streams with sequential locality.Use when indices cannot be collected into a slice, for example due to memory constraints.
use compressed_intvec::prelude::*;
let data: Vec<u64> = (0..10_000).collect();
let intvec: LEIntVec = IntVec::from_slice(&data).unwrap();
// Process indices from a streaming source, such as a range iterator.
let values: Vec<u64> = intvec.get_many_from_iter(500..505).unwrap();
assert_eq!(values, vec![500, 501, 502, 503, 504]);
IntVecReader and IntVecSeqReaderThere are interactive scenarios where lookup indices are not known in advance. The library provides two reader types to handle such cases:
IntVecReader: Stateless Random AccessA stateless reader for efficient, repeated random lookups.
get operation performs an independent seek from the nearest sample point.Optimal for sparse, unpredictable access patterns where there is no locality between consecutive lookups.
use compressed_intvec::prelude::*;
let data: Vec<u64> = (0..10_000).collect();
let intvec: LEIntVec = IntVec::from_slice(&data).unwrap();
// Create a stateless reader for random access.
let mut reader = intvec.reader();
assert_eq!(reader.get(500).unwrap(), Some(500));
assert_eq!(reader.get(10).unwrap(), Some(10));
IntVecSeqReader: Stateful Sequential AccessA stateful reader optimized for access patterns with sequential locality.
Optimal for iterating through sorted or clustered indices where consecutive lookups are near each other.
use compressed_intvec::prelude::*;
let data: Vec<u64> = (0..10_000).collect();
let intvec: LEIntVec = IntVec::from_slice(&data).unwrap();
// Create a stateful reader for sequential access.
let mut seq_reader = intvec.seq_reader();
assert_eq!(seq_reader.get(500).unwrap(), Some(500));
// This second call is faster as it decodes forward from index 500.
assert_eq!(seq_reader.get(505).unwrap(), Some(505));
AtomicFixedVecFor concurrent applications, the library provides AtomicFixedVec, a thread-safe variant of FixedVec. It supports atomic read-modify-write (RMW) operations, enabling safe and efficient manipulation of shared integer data across multiple threads. The atomicity guarantees depend on the element's bit width and its alignment within the underlying u64 storage words.
u64 word (guaranteed for power-of-two bit widths), operations are performed using lock-free atomic instructions. Performance here is optimal as no locks are involved.u64 words (common for non-power-of-two bit widths), operations are protected by a fine-grained mutex from a striped lock pool. This ensures atomicity for the two-word update without resorting to a global lock.Ideal for shared counters, parallel data processing, and any scenario requiring multiple threads to read from and write to the same integer vector concurrently. For write-heavy workloads, configuring the bit width to a power of two (e.g., 8, 16, 32) is recommended to ensure all operations remain on the lock-free path.
use compressed_intvec::prelude::*;
use std::sync::Arc;
use std::thread;
use std::sync::atomic::Ordering;
// A vector with a single counter, initialized to 0.
// A bit width of 17 is sufficient to hold the final count (80000).
let vec = Arc::new(
UAtomicFixedVec::<u32>::builder()
.bit_width(BitWidth::Explicit(17))
.build(&[0])
.unwrap(),
);
const NUM_THREADS: u32 = 8;
const INCREMENTS_PER_THREAD: u32 = 10_000;
let mut handles = vec![];
for _ in 0..NUM_THREADS {
let vec_clone = Arc::clone(&vec);
handles.push(thread::spawn(move || {
for _ in 0..INCREMENTS_PER_THREAD {
// Atomically increment the counter.
vec_clone.fetch_add(0, 1, Ordering::SeqCst);
}
}));
}
for handle in handles {
handle.join().unwrap();
}
let final_value = vec.get(0).unwrap();
assert_eq!(final_value, NUM_THREADS * INCREMENTS_PER_THREAD);
The library integrates mem-dbg to provide memory usage statistics, allowing you to compare the size of different encoding strategies easily. This is particularly useful for understanding the trade-offs between FixedVec and IntVec in terms of memory efficiency.
use compressed_intvec::prelude::*;
use mem_dbg::{DbgFlags, MemDbg};
use rand::{rngs::SmallRng, Rng, SeedableRng};
// Generates a vector with uniformly random values.
fn generate_random_vec(size: usize, max: u64) -> Vec<u64> {
let mut rng = SmallRng::seed_from_u64(42);
(0..size).map(|_| rng.random_range(0..max)).collect()
}
fn main() {
let data = generate_random_vec(1_000_000, 1 << 20);
println!("Size of the uncompressed Vec<u64>:");
data.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);
// Create an IntVec with Gamma encoding.
let gamma_intvec = LEIntVec::builder()
.codec(VariableCodecSpec::Gamma)
.build(&data)
.unwrap();
println!("\nSize of the IntVec with gamma encoding:");
gamma_intvec.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);
// Let the library analyze the data and choose the best codec.
let auto_intvec = LEIntVec::builder()
.codec(VariableCodecSpec::Auto)
.build(&data)
.unwrap();
println!("\nSize of the IntVec with Auto encoding:");
auto_intvec.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);
println!("\nCodec selected by Auto: {:?}", auto_intvec.encoding());
// Create a FixedVec with minimal bit width (20 bits)
let fixed_intvec = LEFixedVec::builder()
.bit_width(BitWidth::Minimal)
.build(&data)
.unwrap();
println!("\nSize of the FixedVec with minimal bit width:");
fixed_intvec.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);
}
The output displays memory breakdown for a 1,000,000-element vector of u64 integers, uniformly distributed between 0 and 220
Vec<u64>: 80.02 kB (8 bytes per element)IntVec with Gamma encoding: 47.10 kB (41% reduction)IntVec with Auto codec selection: 28.33 kB (65% reduction, selected Zeta k=10)FixedVec with minimal bit width: 25.06 kB (69% reduction, 20 bits per element)The memory analysis also shows the internal structure of each data type, including storage overhead for metadata, sampling structures, and encoding parameters.
Size of the uncompressed Vec<u64>:
8.000 MB 100.00% ⏺
Size of the IntVec with gamma encoding:
4.727 MB 100.00% ⏺
4.625 MB 97.85% ├╴data
101.6 kB 2.15% ├╴samples
101.6 kB 2.15% │ ├╴bits
8 B 0.00% │ ├╴bit_width
8 B 0.00% │ ├╴mask
8 B 0.00% │ ├╴len
0 B 0.00% │ ╰╴_phantom
8 B 0.00% ├╴k
8 B 0.00% ├╴len
16 B 0.00% ├╴encoding
0 B 0.00% ╰╴_markers
Size of the IntVec with Auto encoding:
2.846 MB 100.00% ⏺
2.749 MB 96.57% ├╴data
97.72 kB 3.43% ├╴samples
97.70 kB 3.43% │ ├╴bits
8 B 0.00% │ ├╴bit_width
8 B 0.00% │ ├╴mask
8 B 0.00% │ ├╴len
0 B 0.00% │ ╰╴_phantom
8 B 0.00% ├╴k
8 B 0.00% ├╴len
16 B 0.00% ├╴encoding
0 B 0.00% ╰╴_markers
Codec selected by Auto: Zeta { k: 10 }
Size of the FixedVec with minimal bit width:
2.500 MB 100.00% ⏺
2.500 MB 100.00% ├╴bits
8 B 0.00% ├╴bit_width
8 B 0.00% ├╴mask
8 B 0.00% ├╴len
0 B 0.00% ╰╴_phantom
The library includes benchmarks for both FixedVec and IntVec. It also tests the performance against two other libraries implementation of a fixed-width vector: sux::BitFieldVec and succinct::IntVector. These benchmarks are in the benches directory and can be run via
cargo bench
The benchmarks measure the performance of random access, batch access, iter_access, and memory usage for various data distributions and vector sizes. For a visual representation of the random access performance, run the following command:
pip3 install -r python/requirements.txt
python3 python/plot.py --random-access
The resulting SVGs file will be saved in the images directory.
usize and isizeBy default, variable::IntVec only supports integer types with a fixed size (e.g., u32, i64). This guarantees that compressed data is portable across different machine architectures (e.g., from a 64-bit server to a 32-bit embedded device).
The arch-dependent-storable feature flag enables Storable implementations for usize and isize. When activated, you can create an IntVec<usize> directly.
Warning: This feature breaks data portability. An IntVec<usize> created on a 64-bit system containing values larger than u32::MAX will cause a panic if deserialized or read on a 32-bit system. Only enable this feature if you can guarantee that your application and its data will only ever run on a single target architecture (e.g., x86_64).
Enable it in your Cargo.toml:
compressed-intvec = { version = "0.5.1", features = ["arch-dependent-storable"] }
epsilon-serde