Overview and Difference with Checksum
- A checksum (such as CRC32) is to prevent accidental changes. If one byte changes, the checksum changes. The checksum is not safe to protect against malicious changes: it is pretty easy to create a file with a particular checksum.
- A hash function maps some data to other data. It is often used to speed up comparisons or create a hash table. Not all hash functions are secure and the hash does not necessarily changes when the data changes.
- A cryptographic hash function (such as SHA1) is a checksum that is secure against malicious changes. It is pretty hard to create a file with a specific cryptographic hash.
- To make things more complicated, cryptographic hash functions are sometimes simply referred to as hash functions.
Hashing in DBMS
Split your large file up into smaller groups by hash values
We can not control the output of the hash function, so if it is a biased hash function that are likely to return multiple of
n, and after we mod
n it all goes to the same bucket which is not good.
- language agnostic - Why should hash functions use a prime number modulus? - Stack Overflow
- data structures - Why is it best to use a prime number as a mod in a hashing function? - Computer Science Stack Exchange