Turbo's Thoughts

Fuzzy File Deduplication via Partial Content Hashing

When working on large datasets consisting of many thousands of files of various sizes, deduplication can greatly improve the speed of extraction pipelines. However, hashing all files in their entirety, even once, has a huge performance impact. If deduplication doesn’t have to be perfect, this algorithm can help reduce the overhead.

Algorithm

This is the essence of the algorithm I’ve used for this purpose:

\[f_\text{hash}= \begin{cases} h\left(\left[\begin{array}{c} \text{read}(0,min(B, f_\text{size})) \end{array}\right], f_\text{size}\right),& \text{if } f_\text{size}\leq B&&(1)\\[4pt] h\left(\left[\begin{array}{c} \text{read}(f_\text{size} - B, B) \end{array}\right], f_\text{size}\right),& \text{if } f_\text{size}\leq 2B&&(2)\\[4pt] h\left(\left[\begin{array}{c} \text{read}(0, B) \\ \text{read}(\left \lfloor{\frac{f_\text{size}}{2}}\right \rfloor, B) \\ \text{read}(f_\text{size} - B, B) \end{array}\right], f_\text{size}\right),& \text{otherwise}&&(3) \end{cases} \]

Figure 1 Variable Block Count Content Hash

where the notation \[\left[\begin{array}{c} \text{read}(a_\text{offset}, \text{length}) \\ \text{read}(b_\text{offset}, \text{length}) \\ \text{…} \end{array}\right]\] indicates the blocks of byte ranges to be read. These are appended to a buffer and then hashed by \(h\).

Blocks

First, a block size \(B\) needs to be set. This size has a direct effect on how long the hashing of each file will take, as it determines the number of bytes in each block. The maximum number of blocks read from a file is three, so the maximum number of bytes (worst-case) read from any file is \(3B\). After that’s set, a number of blocks is read from the file. The block source location within the file depends on it’s size. There are three cases to consider (see Figure 1):

  1. If the file is smaller than or exactly as large as the block size, the entire file is hashed.
  2. If the file is smaller than or exactly the size of two blocks, one block is read from the end. The reasoning behind this is that for many types of files (e.g. HTML files), two files might have the same header, but probably not the same footer. If that’s not true for your dataset, move the block offset to minimize deduplication error.
  3. In all other cases (the file is larger than two blocks), three blocks are read: the header, the footer and a block starting at half the file size. Notice that in this case, when \(2B < f_\text{size} < 3B\), the mid and footer block might overlap. In my application, handling this special case doesn’t decrease error rates nor does not handling it affect performance negatively. YMMV.

Salting

If there are data points which can aid in differentiating between two files that would otherwise end up with the same hash, they are passed to the hash function as “salts”, meaning their bytes will be hashed after the initial hash buffer constructed from reading the file blocks as if they were part of the same. In this example, \(f_\text{size}\) is passed as a salt, indicating that two files of different sizes must not have the same hash.

Hashing

In my case, I have chosen an FNA variant hash, because it’s iterations consist of only two operations, both well suited for compiler optimization. FNA is orders of magnitudes faster than self-proclaimed fast-hash algorithms like xxHash, but also less accurate.

ulong HashInit = 0xcbf29ce484222325;
ulong HashStep = 0x100000001b3;

var hashFnaVal = HashInit;
                
for (ulong i = 0; i < realBufferLength - 1; i++)
{
  unchecked
  {
    hashFnaVal ^= BlockPool[i];
    hashFnaVal *= HashStep;
  }
}

foreach (var b in BitConverter.GetBytes(salt))
{
  unchecked
  {
    hashFnaVal ^= b;
    hashFnaVal *= HashStep;
  }
}

return hashFnaVal;

The final output is a ulong value, which is convenient in that it can be stored in most data stores.

Error Rate Comparison

To be done.

Licenses (unless indicated otherwise) are CC-BY-NC-4.0 for prose, Apache-2.0 for code.