Considerations
- Scheme: How does the hash table handle collisions? Any optimizations to the chosen strategy?
- Growth Rate: When the hash table resizes, how much does it grow by?
- Load Factor: How full (as a fraction of
num_keys/slots
) does the hash table need to be before resizing is triggered?
Optimization
Size of bucket
The size of the bucket will be power of 2, since it guarantee to be all 1 in binary. What we only need to do is a Bitwise AND operation and that’s the module trick.
E.g. 16 is a power of 2. If you subtract 1 from it, you get 15, whose binary representation is 1111. Now, if you do a bitwise AND of any number with 1111, you’re going to get the last 4 bits of the number which, in other words, is equivalent to the modulo of the number by 16 (Division operation is usually an expensive operation. Hence, bitwise operation is usually preferred over division). These last 4 bits will evaluate to any number from 0 to 15 which are the indexes of your underlying array.
Hashing Function
Use prime number as mod to make the distribution even.
Hashing Scheme
Static Hashing
The size of the hash table is fixed, rebuild a larger hash table from scratch if it runs out of storage space.
Linear Probe
- Open Addressing
- Deletion with tombstone
- Can store metadata in a separate array (store empty slot/tombstone information in a packed bitmap as a part of the page header or in a separate hash table, which would help us avoid looking up deleted keys.)
Cuckoo Hashing
The essence of cuckoo hashing is that multiple hash functions map a key to different slots. In practice, cuckoo hashing is implemented with multiple hash functions that map a key to different slots in a single hash table.
When we insert, we check every table and choose one that has a free slot (if multiple have one, we can compare things like load factor, or more commonly, just choose a random table). If no table has a free slot, we choose (typically a random one) and evict the old entry. We then rehash the old entry into a different table. In rare cases, we may end up in a cycle. If this happens, we can rebuild all the hash tables with new hash function seeds (less common) or rebuild the hash tables using larger tables (more common).
Cuckoo hashing guarantees O (1) lookups and deletions, but insertions may be more expensive.
Dynamic Hashing Schemes
Dynamic hashing schemes are able to resize the hash table on demand without needing to rebuild the entire table.
Chained Hashing
- a linked list of buckets for each slot in the hash table.
- To look up an element, we hash to its bucket and then scan for it. Use Bloom Filter
Extendible Hashing
Linear Hashing
Instead of immediately splitting a bucket when it overflows, this scheme maintains a split pointer that keeps track of the next bucket to split.