Generic hash table implementation with focus on being minimally invasive on existing items to be indexed. The key is stored arbitrarily in the referenced item. A custom match function `HT_MATCH` provides the necessary abstraction. Items are NOT allocated by the hash table. Removed items are replaced with a sentinel value (1) to preserve chaining. See the example implementations `hash_set.h`, `hash_item_table.h`, and `hash_test.c`. The hash function can also be customized, see the default below. In all cases the key as assumed to be char string that is not (necessarily) zero terminated. The length is given separately. Keys can therefore be arbitrary binary values of arbitrary length. Instead of initializing the hash table, it may be zeroed. In that case the count defaults to 4 upon first insert, meaning it can hold up to 4 items before resizing or less depending on load factor. By zeroing memory, hash tables use no memory until actually used. For increased portability we do not rely upon `stdint.h` outside the default hash function. Build ----- There are no special build requirements. CMakeLists.txt simply links the appropriate hash function with the test files, but CMake is not required, for example: cc load_test.c ptr_set.c cmetrohash64.c -O4 -DNDEBUG -o load_test There are several significant flags that can be set, but look at `CMakeLists.txt`, `hash_test.c`, and `load_test.c`. `initbuild.sh` is an easy way to create a CMake Ninja build for platforms that support it. Usage ----- The hash table is implemented in a generic form with a static (private) interface. The macros `HASH_TABLE_HEADER(name, item)` defines the public prototype for the specialized type, and `HASH_TABLE_API(name)` defines non-static wrapper functions to access the generic implementation. This avoids creating all the code as macros which are painful to develop and debug. See `token_map.h`, `token_map.c` which are used in `hash_test.c`. If the datatype is only needed in one file, the implementation such as `token_map.c` can be included after defining `HT_PRIVATE`. This gives the compiler better optimization opportunities and hides the interface from other compilation units. The basic datatype `hash_table_t` is a small struct that can be embedded anywhere and used as the instance of any hash table derived type. Note on algorithmic choice -------------------------- We use linear or quadratic probing hash tables because it allows for many small hash tables. We overallocate the hash table by a factor 2 (default) but only store a single pointer per item. This probing does not allow for dense tables by itself, but because the hash table only stores a single pointer per bucket, we can afford a larger table. Advanced hashing such as Hopscotch can pack much more densely but e.g. Hopscotch need to store a bitmask, thus already doubling the size of the table. Hopscotch is probably good, but more complex and depends on optimizing bit scan insructions, furthermore, when the use case is many small tables such as symbol table scopes, cache locality is less relevant. Chained hashing with 50% load factor is a good choice, but require intrusive links, and cannot trivially hash string sets without extra allocation. There is some evidence that linear probing may be faster than quadratic probing due to cache effects, as long as we do not pack too densely - however, the tradional quadratic probing (k + i * i) modulo prime does not cover all buckets. We use (k + i * (i + 1) / 2) modulo power of 2 which covers all buckets so without experimentation it is unclear whether linear probing or quadratic probing is best. The use of open addressing leads to more key comparisons than chained hashing. The fact we store the keys indirectly in the stored item is also not ideal, except when the item is also directly the key. If we use larger hash tables from the saved space, we suspect this will still perform well, also considering external factors such as not having to allocate and copy a key from e.g. a text buffer being parsed. It is generally understood that linear probing degrades significantly with a load factor above 0.7. In this light, it is interesting to note that Emmanuel Goossaert tested hopscotch hashing and found that bucket swaps only take place in significance above a load factor of 0.7. A commenter to Goossaert's blog also found that neighbourhoods rarely exceed 64 even when allowed to grow on demand. Without deep analysis it would appear that linear probing and hopscotch is pretty similar at a load factor of 0.5 especially if tombstones are not present. Because hopscotch requires extra data (e.g. the hash key or a bitmap or a linked list) this confirms our intuition that it is better with lower load factors and smaller buckets, than advanced algorithms. Furthermore, hopscotch insert degrades badly when it needs to search for empty buckets at high load factors. Of course, for on disk storage it is a different matter, and this is why Goossaert is interested in caching hash keys in buckets. Robin Hood hashing is mostly interesting when there are many deletions to clean up and when the load factor increases. In our implementation we try to keep the per bucket size small: a pointer and a 8 bit offset, or just a pointer for the linear and quadratic probing implementations. This makes it affordable with a lower load factor. This Robin Hood variation stores the offset from the hashed bucket to where the first entry is stored. This means we can avoiding sampling any bucket not indexed by the current hash key, and it also means that we avoid having to store or calculate the hash key when updating. A sorted Robin Hood hashing implementation was also made, but it prooved to be error prone with many special cases and slower than regular Robin Hood hashing. It would conceivably protect against hash collision attacks through exponential search, but insertions and deletions would still need to move memory in linear time, making this point mood. Therefore the sorted Robin Hood variant has been removed. Section 4.5: Source file references ---------------------- downloaded from As of July 2015, for 64-bit hashes, the C port of the 64 bit metro hash is a good trade-off between speed and simplicity. The For a 32-bit C hash function, the ported MurmurHash3 is safe and easy to use in this environment, but xxHash32 may also be worth considering. See also