| Crates.io | jdb_xorf |
| lib.rs | jdb_xorf |
| version | 0.13.11 |
| created_at | 2026-01-09 04:27:56.495705+00 |
| updated_at | 2026-01-16 19:40:38.759391+00 |
| description | Ultra-fast Xor and Binary Fuse filters for Rust / 极致性能的 Rust Xor 与 Binary Fuse 过滤器 |
| homepage | https://github.com/js0-site/rust/tree/main/jdb_xorf |
| repository | https://github.com/js0-site/rust |
| max_upload_size | |
| id | 2031572 |
| size | 234,790 |
jdb_xorf is a high-performance Binary Fuse filter implementation for Rust. Compared to Bloom or Cuckoo filters, this probabilistic data structure offers faster query speeds and smaller memory footprint. Binary Fuse filters represent the state-of-the-art in static set membership testing.
Binary Fuse is the pinnacle of the Xor Filter family and is currently the most efficient static membership test structure known. It has significant advantages over other filters in all key metrics:
Bf8 requires only about 8.64 bits/entry to achieve a false positive rate of about 0.39% (space overhead is only 1.08x the theoretical lower limit).The construction algorithm for Binary Fuse filters (Bf8, Bf16, Bf32) has a strict prerequisite: the input data structure must not contain duplicate keys. If the input u64 hash values contain duplicates, the construction process will almost certainly fail. If you use the raw filter directly, you must remove duplicates yourself before construction.
It is recommended to use the Bf wrapper to handle arbitrary types (such as String, &[u8], etc.). Bf internally automatically handles all hash calculation, sorting, and deduplication work to ensure construction success rate. You only need to pass in the data, and leave the rest to it.
use jdb_xorf::{Filter, Bf8};
let keys = vec![1u64, 2, 3];
let filter = Bf8::from(&keys);
assert!(filter.has(&1));
assert!(!filter.has(&4));
Bf supports borrow-based querying similar to HashMap. For example, after referencing a String filter, you can query it directly using &str.
use jdb_xorf::{Filter, Bf, Bf8};
let fruits = vec!["apple".to_string(), "banana".to_string()];
// Bf automatically handles hashing and deduplication.
// Uses RapidHasher by default for extremely high performance.
let filter: Bf<String, Bf8> = Bf::from(&fruits);
// Query with &str directly on a String filter
assert!(filter.has("apple"));
Bf<str>If you want the filter to semantically represent a "Filter of strings" (rather than a "Filter of string references") at the type level, you can use .into() to convert Bf<&str> to Bf<str>. This is particularly useful for Unsized Types.
use jdb_xorf::{Bf, Bf8, Filter};
let keys = vec!["apple", "banana"];
// 1. Build a normal Bf<&str> first
let temp_filter: Bf<&str, Bf8> = Bf::from(&keys);
// 2. Convert into Bf<str> (Unsized)
let str_filter: Bf<str, Bf8> = temp_filter.into();
// 3. Query
assert!(str_filter.has("apple"));
Bf<[u8]>Similarly, you can create filters for byte slices.
use jdb_xorf::{Bf, Bf8, Filter};
let data: Vec<&[u8]> = vec![b"hello", b"world"];
let temp_filter: Bf<&[u8], Bf8> = Bf::from(&data);
// Convert into Bf<[u8]>
let bytes_filter: Bf<[u8], Bf8> = temp_filter.into();
assert!(bytes_filter.has(b"hello".as_slice()));
After enabling the bitcode feature, you can use bitcode::encode / bitcode::decode for serialization directly.
use jdb_xorf::{Bf, Bf8};
// 1. Serialize (Encode)
let keys = vec!["apple", "banana"];
// No need to convert to String, use &str directly
let filter: Bf<&str, Bf8> = Bf::from(&keys);
// Use bitcode library function directly
let bytes = bitcode::encode(&filter);
// 2. Deserialize (Decode)
// Can fully restore type information, including generic parameters
let loaded: Bf<&str, Bf8> = bitcode::decode(&bytes).expect("Decode failed");
assert!(loaded.has("apple"));
Thanks to bitcode support, you can easily embed Bf into your own structs, such as an SSTable structure in a database.
use bitcode::{Decode, Encode};
use jdb_xorf::{Bf, Bf8};
#[derive(Encode, Decode)]
pub struct Sst {
// Filter for binary data
pub xorf: Bf<[u8], Bf8>,
}
let keys: Vec<&[u8]> = vec![b"key1", b"key2"];
let filter: Bf<&[u8], Bf8> = Bf::from(&keys);
let sst = Sst {
xorf: filter.into(),
};
// Serialize the entire struct
let bytes = bitcode::encode(&sst);
Bf adapter, supports non-u64 types and automatic deduplication.no_std, suitable for embedded environments.bitcode serialization.graph TD
Start["Start Construction"] --> Init["Calculate Parameters: seg_len, capacity"]
Init --> SeedIter["Try Next Seed"]
SeedIter --> Mapping["Map Key: Calculate 3 slots h0, h1, h2"]
Mapping --> Bucketing["Update Bucket State: t2count++ / t2hash XOR= hash"]
Bucketing --> FindAlone["Scan Buckets: Find alone buckets with count == 1"]
FindAlone --> Queue["Add to alone queue"]
Queue --> PeelLoop{"Is Queue Empty?"}
PeelLoop -- "No" --> Pop["Pop bucket index, Push key to reverse_order stack"]
Pop --> Update["Update 2 adjacent buckets: Decrease count and XOR hash sum"]
Update --> NewAlone{"New alone bucket produced?"}
NewAlone -- "Yes" --> Queue
NewAlone -- "No" --> PeelLoop
PeelLoop -- "Yes" --> Success{"All keys processed?"}
Success -- "No" --> SeedIter
Success -- "Yes" --> Done["Enter Solver Phase"]
graph TD
SStart["Start Solver"] --> SInit["Initialize fingerprints array"]
SInit --> PopStack["Pop key and slot info from reverse_order stack top"]
PopStack --> ReadOther["Read other 2 determined or initial fingerprints"]
ReadOther --> Assign["Calculate current fingerprint: fp = target_f XOR fp_other1 XOR fp_other2"]
Assign --> Next{"Is Stack Empty?"}
Next -- "No" --> PopStack
Next -- "Yes" --> SDone["BinaryFuse Construction Successful"]
graph TD
QKey["Input Query Key"] --> QHash["mix64 Hash Mixing"]
QHash --> QSlots["Determine 3 slots: h0, h1, h2"]
QSlots --> QRead["Atomic Read: fp0, fp1, fp2"]
QRead --> QXor["XOR Operation: res = fp0 XOR fp1 XOR fp2"]
QXor --> QMatch{"res == (hash as Fingerprint)?"}
QMatch -- "Yes" --> QPres["Probably Present"]
QMatch -- "No" --> QNot["Definitely Not Present"]
rapidhash crate), high-quality mixing function mix64.src/: Core implementation.
base/: Generic Binary Fuse algorithm implementation (construction, query, tools).bf.rs: Generic wrapper Bf implementation (arbitrary types & deduplication).hash.rs: Hasher and high-quality mixing function implementation.benches/: Performance benchmark suite.analysis/: Uniformity and zero-distribution analysis tools.Filter<T>: Core trait for membership testing.
has(&self, key: &T) -> boollen(&self) -> usizeBf8, Bf16, Bf32: Managed memory filters (Aliases to Base<u8>, Base<u16>, Base<u32>).Bf<T, F, H = RapidHasher>: Generic wrapper, uses hasher H and filter F to handle arbitrary type T, with automatic deduplication.| Filter | Memory Footprint | Query Speed | Construction Speed | Cache Friendliness | Scenario |
|---|---|---|---|---|---|
| Binary Fuse | Very Low (≈1.08x theoretical limit) | Very Fast (3 accesses) | Very Fast (partition optimized) | Excellent | Best choice for static massive data |
| Xor Filter | Low | Fast | Slow | Poor | Previous generation solution |
| Bloom Filter | Medium | Slow (multiple hashes) | Fast | Poor | Dynamic data/Simple scenarios |
| Cuckoo Filter | Low | Medium (random probing) | Slow | Poor | Deletion support needed |
The theoretical probability of Binary Fuse filter construction failure is extremely low. This library automatically retries 1000 times during construction (using a different random seed each time).
According to Mueller & Lemire's paper [2], the lower bound for single construction success rate is 90%. Based on measured data from this library (1000 rounds of testing on 100,000 random keys):
This means the failure rate for a single construction is only .
Therefore, the probability of failure for 1000 consecutive constructions is:
is a value that can be considered absolutely 0.
It is much smaller than the reciprocal of the total number of atoms in the universe, and hundreds of orders of magnitude lower than the probability of uncorrectable errors in modern computer hardware (about
).
Google's research on large-scale data centers shows that about 1.3% of machines experience at least one Uncorrectable Error per year. Assuming a construction process takes 0.1 seconds, the probability of a hardware error occurring during this period is on the order of .
| Event Type | Approximate Probability | Risk Qualification |
|---|---|---|
| Hardware Bit Flip During Construction | Real existent extremely low risk | |
| Binary Fuse Construction Failure | Physically "Absolutely Impossible" |
Therefore, the library's design principle is: Treat construction failure as an unrecoverable fatal error (panic), rather than a runtime error (Result/TryFrom).
If you encounter a panic during construction, it is more likely due to:
For ease of use, we prioritize using the From trait because it matches the psychological expectation in most scenarios: the construction process is always successful.
The technical evolution of probabilistic filters began with Bloom filters (1970) and went through improvements with Cuckoo filters (2014). The emergence of Xor filters in 2020 brought a paradigm shift, achieving better performance through perfect XOR summation. In 2022, Mueller and Lemire further proposed Binary Fuse filters, which approached theoretical limits in space and time efficiency through graph partitioning technology.
This project is an open-source component of js0.site ⋅ Refactoring the Internet Plan.
We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:
| Library | Filter | Bf Ops | Query Ops | Memory | Speedup |
|---|---|---|---|---|---|
| jdb | Bf8 | 4959.87 | 86727.74 | 116.04 KB | 1.65x |
| xorf | BinaryFuse8 | 3971.65 | 52493.29 | 116.04 KB | - |
| jdb | Bf16 | 4809.60 | 76974.42 | 232.04 KB | 1.56x |
| xorf | BinaryFuse16 | 4015.96 | 49222.88 | 232.04 KB | - |
| jdb | Bf32 | 4676.60 | 67297.71 | 464.04 KB | 1.39x |
| xorf | BinaryFuse32 | 4016.81 | 48393.79 | 464.04 KB | - |
| Library | Filter | False Positive Rate | False Negative Rate |
|---|---|---|---|
| jdb | Bf8 | 0.39196% | 0 |
| xorf | BinaryFuse8 | 0.38768% | 0 |
| jdb | Bf16 | 0.00132% | 0 |
| xorf | BinaryFuse16 | 0.00165% | 0 |
| jdb | Bf32 | 0.00000% | 0 |
| xorf | BinaryFuse32 | 0.00000% | 0 |
This project is an open-source component of js0.site ⋅ Refactoring the Internet Plan.
We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:
jdb_xorf 是针对 Rust 开发的高性能 Binary Fuse 过滤器实现。此类概率型数据结构相较于 Bloom 或 Cuckoo 过滤器,具备更快的查询速度与更小的内存占用。Binary Fuse 过滤器代表了目前静态集合成员检测技术的最高水平。
Binary Fuse 是 Xor Filter 系列的巅峰之作,也是目前已知最高效的静态成员检测结构。相比于其他过滤器,它在所有关键指标上都具有显著优势:
Bf8 仅需约 8.64 bits/entry 即可达到约 0.39% 的误报率(空间开销仅为理论下限的 1.08x)。Binary Fuse 过滤器 (Bf8, Bf16, Bf32) 的构建算法有一个严格的前提条件:输入的数据结构中不得包含重复的键 (duplicate keys)。如果输入的 u64 哈希值存在重复,构建过程几乎肯定会失败。如果您直接使用原始过滤器,必须在构建前自行去除重复项。
推荐使用 Bf 包装器来处理任意类型(如 String, &[u8] 等)。Bf 会在内部自动处理所有的哈希计算、排序和去重工作,确保构建成功率。您只需传入数据,剩下的交给它处理。
use jdb_xorf::{Filter, Bf8};
let keys = vec![1u64, 2, 3];
let filter = Bf8::from(&keys);
assert!(filter.has(&1));
assert!(!filter.has(&4));
Bf 支持类似 HashMap 的借用查询。例如,构建 String 过滤器后,可以直接使用 &str 查询。
use jdb_xorf::{Filter, Bf, Bf8};
let fruits = vec!["apple".to_string(), "banana".to_string()];
// Bf 会自动处理哈希和去重。
// 默认使用 RapidHasher 以获得极高性能。
let filter: Bf<String, Bf8> = Bf::from(&fruits);
// 直接使用 &str 查询 String 类型的过滤器
assert!(filter.has("apple"));
Bf<str>如果你希望过滤器在类型层面上就代表 "字符串过滤器" (而非 "字符串引用过滤器"),可以使用 into() 方法将 Bf<&str> 转换为 Bf<str>。
use jdb_xorf::{Bf, Bf8, Filter};
let keys = vec!["apple", "banana"];
// 1. 先构建常规的 Bf<&str>
let temp_filter: Bf<&str, Bf8> = Bf::from(&keys);
// 2. 转换为 Bf<str> (Unsized)
let str_filter: Bf<str, Bf8> = temp_filter.into();
// 3. 查询
assert!(str_filter.has("apple"));
Bf<[u8]>同样地,你可以创建针对字节切片的过滤器。
use jdb_xorf::{Bf, Bf8, Filter};
let data: Vec<&[u8]> = vec![b"hello", b"world"];
let temp_filter: Bf<&[u8], Bf8> = Bf::from(&data);
// 转换为 Bf<[u8]>
let bytes_filter: Bf<[u8], Bf8> = temp_filter.into();
assert!(bytes_filter.has(b"hello".as_slice()));
开启 bitcode 特性后,可直接使用 bitcode::encode / bitcode::decode 进行序列化。
use jdb_xorf::{Bf, Bf8};
// 1. 序列化 (Encode)
let keys = vec!["apple", "banana"];
// 无需转换为 String,直接使用 &str
let filter: Bf<&str, Bf8> = Bf::from(&keys);
// 直接使用 bitcode 库函数
let bytes = bitcode::encode(&filter);
// 2. 反序列化 (Decode)
// 能够完全还原类型信息,包括泛型参数
let loaded: Bf<&str, Bf8> = bitcode::decode(&bytes).expect("Decode failed");
assert!(loaded.has("apple"));
得益于 bitcode 支持,你可以轻松地将 Bf 嵌入到自己的结构体中,甚至是数据库的 SSTable 结构。
use bitcode::{Decode, Encode};
use jdb_xorf::{Bf, Bf8};
#[derive(Encode, Decode)]
pub struct Sst {
// 针对二进制数据的过滤器
pub xorf: Bf<[u8], Bf8>,
}
let keys: Vec<&[u8]> = vec![b"key1", b"key2"];
let filter: Bf<&[u8], Bf8> = Bf::from(&keys);
let sst = Sst {
xorf: filter.into(),
};
// 序列化整个结构体
let bytes = bitcode::encode(&sst);
Bf 适配器,支持非 u64 类型并自动去重。no_std,适用于嵌入式环境。bitcode 或 DMA 零拷贝加载。graph TD
Start["开始构造"] --> Init["计算参数: seg_len, capacity"]
Init --> SeedIter["尝试下一种子 (Seed)"]
SeedIter --> Mapping["映射键: 计算 3 个槽位 h0, h1, h2"]
Mapping --> Bucketing["更新桶状态: t2count++ / t2hash XOR= hash"]
Bucketing --> FindAlone["扫描桶: 寻找 count == 1 的孤立桶"]
FindAlone --> Queue["加入 alone 队列"]
Queue --> PeelLoop{"队列是否为空?"}
PeelLoop -- "否" --> Pop["弹出桶索引, 将键压入 reverse_order 栈"]
Pop --> Update["更新相邻 2 个桶: 减少计数并异或哈希和"]
Update --> NewAlone{"产生新孤立桶?"}
NewAlone -- "是" --> Queue
NewAlone -- "否" --> PeelLoop
PeelLoop -- "是" --> Success{"是否处理了所有键?"}
Success -- "否" --> SeedIter
Success -- "是" --> Done["进入求解阶段"]
graph TD
SStart["开始求解"] --> SInit["初始化指纹数组 fingerprints"]
SInit --> PopStack["从 reverse_order 栈顶弹出键与槽位信息"]
PopStack --> ReadOther["读取另外 2 个已确定或初始的指纹"]
ReadOther --> Assign["计算当前指纹: fp = target_f XOR fp_other1 XOR fp_other2"]
Assign --> Next{"栈是否为空?"}
Next -- "否" --> PopStack
Next -- "是" --> SDone["BinaryFuse 构建成功"]
graph TD
QKey["输入查询键"] --> QHash["mix64 哈希混淆"]
QHash --> QSlots["确定 3 个槽位: h0, h1, h2"]
QSlots --> QRead["原子读取: fp0, fp1, fp2"]
QRead --> QXor["异或运算: res = fp0 XOR fp1 XOR fp2"]
QXor --> QMatch{"res == (hash as Fingerprint)?"}
QMatch -- "是" --> QPres["可能存在 (Probably)"]
QMatch -- "否" --> QNot["绝对不存在 (Definitely)"]
rapidhash crate), 高质量混淆函数 mix64。src/: 核心实现。
base/: 泛型 Binary Fuse 算法实现 (构建、查询、工具)。bf.rs: 通用包装器 Bf 实现 (支持任意类型与去重)。hash.rs: 哈希器与高质量混淆函数 implementation。benches/: 性能基准测试集。analysis/: 均匀性与零分布分析工具。Filter<T>: 成员检测核心 trait。
has(&self, key: &T) -> boollen(&self) -> usizeBf8, Bf16, Bf32: 托管内存的过滤器 (别名 Base<u8>, Base<u16>, Base<u32>)。Bf<T, F, H = RapidHasher>: 通用包装器,使用哈希器 H 与过滤器 F 处理任意类型 T,具备自动去重功能。| 过滤器 | 内存占用 | 查询速度 | 构建速度 | 缓存友好度 | 场景 |
|---|---|---|---|---|---|
| Binary Fuse | 极低 (≈1.08x 理论下限) | 极快 (3 次访问) | 极快 (分区优化) | 优秀 | 静态海量数据最佳选择 |
| Xor Filter | 低 | 快 | 慢 | 差 | 旧一代方案 |
| Bloom Filter | 中 | 慢 (多次哈希) | 快 | 差 | 动态数据/简单场景 |
| Cuckoo Filter | 低 | 中 (随机探测) | 慢 | 差 | 需要支持删除 |
Binary Fuse 过滤器构建失败的理论概率极低。本库在构建时会自动进行 1000 次重试(每次使用不同的随机种子)。
根据 Mueller & Lemire 的论文 [2],单次构建的成功率下界为 90%。而基于本库的实测数据(针对 100,000 个随机键进行 1000 轮测试):
这意味着单次构建的失败率仅为 。
因此,连续 1000 次构建均失败的概率为:
是一个完全可以视为 0 的数值。
它比宇宙中的原子总数倒数还要小得多,更比现代计算机硬件发生不可纠正错误的概率(约为
) 低几百个数量级。
Google 在大规模数据中心的研究表明,每年约有 1.3% 的机器经历过至少一次不可纠正的内存错误 (Uncorrectable Error)。假设一次构建过程耗时 0.1 秒,那么在此期间发生硬件错误的概率约为 量级。
| 事件类型 | 近似概率 | 风险定性 |
|---|---|---|
| 构建期间硬件位翻转 | 真实存在的极低风险 | |
| Binary Fuse 构建失败 | 物理上的“绝对不可能” |
因此,库的设计原则是:将构建失败视为不可恢复的致命错误(panic),而非运行时错误(Result/TryFrom)。
如果您在构建过程中遇到了 panic,更有可能是因为:
出于易用性考虑,我们优先使用 From trait,因为它符合绝大多数场景下的心理预期:构建过程总是成功的。
概率过滤器的技术演进从 Bloom 过滤器 (1970) 开始,历经 Cuckoo 过滤器 (2014) 的改进。2020 年 Xor 过滤器的出现带来了范式转移,通过完美的 XOR 求和实现更优性能。2022 年,Mueller 与 Lemire 进一步提出 Binary Fuse 过滤器,通过图分区技术使其在空间和时间效率上逼近了理论极限。
本项目是 js0.site ⋅ 重构互联网计划 的开源组件。
我们正在以组件化的方式重新定义互联网的开发范式。欢迎关注我们:
| 库 | 过滤器 | 构建(万ops/s) | 查询(万ops/s) | 内存占用 | 对比 |
|---|---|---|---|---|---|
| jdb | Bf8 | 4959.87 | 86727.74 | 116.04 KB | 1.65x |
| xorf | BinaryFuse8 | 3971.65 | 52493.29 | 116.04 KB | - |
| jdb | Bf16 | 4809.60 | 76974.42 | 232.04 KB | 1.56x |
| xorf | BinaryFuse16 | 4015.96 | 49222.88 | 232.04 KB | - |
| jdb | Bf32 | 4676.60 | 67297.71 | 464.04 KB | 1.39x |
| xorf | BinaryFuse32 | 4016.81 | 48393.79 | 464.04 KB | - |
| 库 | 过滤器 | 假阳率 | 假阴率 |
|---|---|---|---|
| jdb | Bf8 | 0.39196% | 0 |
| xorf | BinaryFuse8 | 0.38768% | 0 |
| jdb | Bf16 | 0.00132% | 0 |
| xorf | BinaryFuse16 | 0.00165% | 0 |
| jdb | Bf32 | 0.00000% | 0 |
| xorf | BinaryFuse32 | 0.00000% | 0 |
本项目为 js0.site ⋅ 重构互联网计划 的开源组件。
我们正在以组件化的方式重新定义互联网的开发范式,欢迎关注: