Monday, March 2, 2015

Hashing - Summarization at its Extreme

Summarization is often the most boring part of a language course, as it basically requires you to sift through dozens, or maybe hundreds or even thousands, of pages to come up with a concise representation of the whole thing, often constrained by the number of words.

Interestingly, an equivalent concept--hashing--has become the core of modern secure information systems, and an essential element of integrity and hence the mere existence of security.

Hashing is essentially an attempt to summarize an arbitrarily large amount of data into a fixed-length sequence of bits. While human summarization strives to preserve essential information, hashing strives to preserve the uniqueness of data.

The fundamental property of a hash (i.e. result of hashing) is:

For two identical pieces of data, their hashes under a given hashing technique are identical.

The "hashing technique" usually boils down to a hash function. Any function that accepts an arbitrarily long content as input and produces a fixed-length output while adhering to the above rule, can be considered a hash function.

For example, a function f, which adds up literal positions of characters in its input text and takes its modulus over 1000 (so that the output will be finitely bound to the range of integers from 0 to 999), is a simple hash function:

f(a) = 1
f(b) = 2
f(aa) = 1 + 1 = 2
f(ab) = 1 + 2 = 3
f(cryptography) = 3 + 18 + 25 + 16 + 20 + 15 + 7 + 
                  18 + 1 + 16 + 8 + 25 = 172

You may have noticed that aa and b result in the same value over f; this is not a problem because the contract of a hash function never denies the possibility of two different inputs being mapped to the same hash value.

Still, it is highly encouraged for hash functions to produce unique hashes for each unique input, to minimize the risk of false positives (two unequal inputs being interpreted as equal, based on their hash values). Although this is theoretically impossible (an infinite input space cannot be uniquely mapped to a finite output space), advanced hash functions can significantly reduce the chance of false positives.

Fair enough, but how does hashing help security at all?

We can use the fact that a good hash can concisely and almost uniquely represent an arbitrarily long input, in order to generate a small "digest" of a large volume of data (say, a disk image). Now, if we share this file with someone else, the recipient can calculate the hash of the file himself and compare it with the said digest.

If the digests do not match, it's an indication that the data has been modified, either accidentally or maliciously, during transfer; if not, the hashes should have been identical, thanks to the fundamental rule of hashing.

But what if the digests match? Well, we cannot say anything for sure. Maybe some ingenious guy has changed the data such that the hash remains unchanged, or the data is still in its pristine form. The "unlikeliness" of the first situation (and hence the certainty of the second) is what decides the "strength" of the hash function.

Stay tuned for more on hashing!

No comments:

Post a Comment