This site uses advanced css techniques
With the recent news of weaknesses in some common security algorithms (MD4, MD5, SHA-0), many are wondering exactly what these things are: They form the underpinning of much of our electronic infrastructure, and in this Guide we'll try to give an overview of what they are and how to understand them in the context of the recent developments.
But note: though we're fairly strong on security issues, we are not crypto experts. We've done our best to assemble (digest?) the best available information into this Guide, but we welcome being pointed to the errors of our ways.
A "hash" (also called a "digest", and informally a "checksum") is a kind of "signature" for a stream of data that represents the contents. The closest real-life analog we can think is "a tamper-evident seal on a software package": if you open the box (change the file), it's detected.
Let's first see some examples of hashes at work.
Many Unix and Linux systems provide the md5sum program, which reads a stream of data and produces a fixed, 128-bit number that summarizes that stream using the popular "MD5" method. Here, the "streams of data" are "files" (two of which we see directly, plus one that's too large to display).
$ cat smallfile This is a very small file with a few characters $ cat bigfile This is a larger file that contains more characters. This demonstrates that no matter how big the input stream is, the generated hash is the same size (but of course, not the same value). If two files have a different hash, they surely contain different data. $ ls -l empty-file smallfile bigfile linux-kernel -rw-rw-r-- 1 steve steve 0 2004-08-20 08:58 empty-file -rw-rw-r-- 1 steve steve 48 2004-08-20 08:48 smallfile -rw-rw-r-- 1 steve steve 260 2004-08-20 08:48 bigfile -rw-r--r-- 1 root root 1122363 2003-02-27 07:12 linux-kernel $ md5sum empty-file smallfile bigfile linux-kernel d41d8cd98f00b204e9800998ecf8427e empty-file 75cdbfeb70a06d42210938da88c42991 smallfile 6e0b7a1676ec0279139b3f39bd65e41a bigfile c74c812e4d2839fa9acf0aa0c915e022 linux-kernel
This shows that all input streams yield hashes of the same length, and to experiment, try changing just one character of a small test file. You'll find that even very small changes to the input yields sweeping changes in the value of the hash, and this is known as the avalanche effect.
The avalanche effect can be best seen by hashing two files with nearly identical content. We've changed the first character of a file from T to t, and when looking at the binary values for these ASCII characters, we see that they differ by just one bit:
T -> 0x54 -> 0 1 0 1 0 1 0 0 t -> 0x74 -> 0 1 1 1 0 1 0 0
This single bit of change in the input produces a very large change in the output:
$ cat file1 This is a very small file with a few characters $ cat file2 this is a very small file with a few characters $ md5sum file? 75cdbfeb70a06d42210938da88c42991 file1 6fbe37f1eea0f802bd792ea885cd03e2 file2
Expanding the hex to binary (in 32-bit chunks to avoid really long lines), we can see that our one bit of input change produced 50 different output bits, about 39%:
First 32 bits 0111 0101 1100 1101 1011 1111 1110 1011 75cdbfeb file1 0110 1111 1011 1110 0011 0111 1111 0001 6fbe37f1 file2 ...X X.X. .XXX ..XX X... X... ...X X.X. 13 bits different Second 32 bits 0111 0000 1010 0000 0110 1101 0100 0010 70a06d42 file1 1110 1110 1010 0000 1111 1000 0000 0010 eea0f802 file2 X..X XXX. .... .... X..X .X.X .X.. .... 10 bits different Third 32 bits 0010 0001 0000 1001 0011 1000 1101 1010 210938da file1 1011 1101 0111 1001 0010 1110 1010 1000 bd792ea8 file2 X..X XX.. .XXX .... ...X .XX. .XXX ..X. 14 bits different Fourth 32 bits 1000 1000 1100 0100 0010 1001 1001 0001 88c42991 file1 1000 0101 1100 1101 0000 0011 1110 0010 85cd03e2 file2 .... XX.X .... X..X ..X. X.X. .XXX ..XX 13 bits different
But even though the value of the hash changes, its size does not.
This is a common confusion, especially because all these words are in the category of "cryptography", but it's important to understand the difference.
Encryption transforms data from a cleartext to ciphertext and back (given the right keys), and the two texts should roughly correspond to each other in size: big cleartext yields big ciphertext, and so on. "Encryption" is a two-way operation.
Hashes, on the other hand, compile a stream of data into a small digest (a summarized form: think "Reader's Digest"), and it's strictly a one way operation. All hashes of the same type - this example shows the "MD5" variety - have the same size no matter how big the inputs are:
"Encryption" is an obvious target for attack (e.g., "try to read the encrypted text without the key"), but even the one-way nature of hashes admits of more subtle attacks. We'll cover them shortly, but first we must see for what purposes hashes are commonly used.
We'll note here that though hashes and digests are often informally called "checksums", they really aren't. True checksums, such as a Cyclic Redundancy Check are designed to catch data-transmission errors and not deliberate attempts at tampering with data. Aside of the small output space (usually 32 bits), they are not designed with the same properties in mind. We won't mention true checksums again.
There are quite a few uses for hashes, and we'll mention them here so we have something to build on when we try to subvert them for evil purposes, and to see the ramifications of the recent developments.
Important!
When considering hash values, close does not count!If the hashes being compared differ in any way, even by just a single bit, the data being digested is not the same! There is no equivalent of "roundoff error" or "almost" in cryptographic hashes.
The thoughtful reader may wonder how this could work: if it's possible to uniquely represent every possible stream of data in 128 bits - that's 16 bytes -- then why would one ever need a file larger than that? If this isn't the case, then it seems obvious that many input streams are available that can produce any given hash.
When different chunks of data produce the same hash value, this is known as a collision, and it follows from the previous paragraph that they inherently must exist:
If so, this seems to undermine the whole premise of cryptographic hashes until one learns that for industrial-strength hashes like MD5, nobody has found a collision yet (well, almost nobody, but we're getting to that) .
This astonishing fact is due to the astonishingly large number of possible hashes available: a 128-bit hash can have 3.4 x 1038 possible values, which is:
340,282,366,920,938,463,463,374,607,431,768,211,456 possible hashes
If the hash algorithm is properly designed and distributes the hashes uniformly over the output space, "finding a hash collision" by random guessing is exceedingly unlikely (it's more likely that a million people will correctly guess all the California Lottery numbers every day for a billion trillion years).
Other hashes have even more bits: the SHA-1 algorithm generates 160 bits, whose output space is four billions times larger than that produced by MD5's 128 bits.
The first answer is "it depends on the kind of hash", but the second answer usually starts with "a lot of math". A colloquial explanation is that all the bits are poured into a pot and stirred briskly, and this is about as technical we care to delve into here.
There are plenty of resources that show the internal workings of a hash algorithm, almost all of which involve lots of shifting and rotating through multiple "rounds".
A good explanation for the SHA-1 algorithm (a popular one) can be found at Wikipedia by clicking on this image:
Illustration courtesy Wikipedia
Fig. 8 - One iteration within the SHA-1 compression function. A, B, C, D and E are 32-bit words of the state;
F is a nonlinear function that varies;
<<< denotes left circular shift.
Kt is a constant.
Though we're not going to elaborate on the internals, it does seem appropriate to mention some of the popular hash algorithms:
Each has its own advantages in terms of performance, several variations of collision resistance, how well its security has been studied professionally, and so on. "Picking a good hash algorithm" is beyond the scope of this Tech Tip.
As we've mentioned several times, "collisions" play a central role in the usefulness of a cryptographic hash, mainly in the sense that the easier it is to find a collision, the less useful the hash is. Some algorithms are better than others at avoiding collisions, and this is measured by three related attributes.
Some very bright researchers in China presented a paper, Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD, at the Crypto 2004 conference in August 2004, and it's shaken up the security world considerably. This was some outstanding cryptography research.
They have found ways to reliably generate collisions in four hash functions much faster than brute-force time, and in one case (MD4, which is admittedly obsolete), with a hand calculation. This has been a stunning development.
These are all of the "we control both inputs" type - the first of our three kinds of collisions - and it holds the most promise in the compromise of digital signatures where the bad guy can create two contradictory documents and pull a switcheroo later.
We've heard there are already tools emerging that can reliably generate collision-pairs in "reasonable" amounts of CPU time, and with just a bit of searching found one example for MD5:
This can be checked with this small perl program that was posted on Edward Felton's weblog by Greg Buchholz (origin unknown):
#!/usr/bin/perl -w use strict; my $v1=<<END_V1; d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 2f ca b5 87 12 46 7e ab 40 04 58 3e b8 fb 7f 89 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 71 41 5a 08 51 25 e8 f7 cd c9 9f d9 1d bd f2 80 37 3c 5b d8 82 3e 31 56 34 8f 5b ae 6d ac d4 36 c9 19 c6 dd 53 e2 b4 87 da 03 fd 02 39 63 06 d2 48 cd a0 e9 9f 33 42 0f 57 7e e8 ce 54 b6 70 80 a8 0d 1e c6 98 21 bc b6 a8 83 93 96 f9 65 2b 6f f7 2a 70 END_V1 my $v2=<<END_V2; d1 31 dd 02 c5 e6 ee c4 69 3d 9a 06 98 af f9 5c 2f ca b5 07 12 46 7e ab 40 04 58 3e b8 fb 7f 89 55 ad 34 06 09 f4 b3 02 83 e4 88 83 25 f1 41 5a 08 51 25 e8 f7 cd c9 9f d9 1d bd 72 80 37 3c 5b d8 82 3e 31 56 34 8f 5b ae 6d ac d4 36 c9 19 c6 dd 53 e2 34 87 da 03 fd 02 39 63 06 d2 48 cd a0 e9 9f 33 42 0f 57 7e e8 ce 54 b6 70 80 28 0d 1e c6 98 21 bc b6 a8 83 93 96 f9 65 ab 6f f7 2a 70 END_V2 my $p=join("",map {chr(hex($_))} split /\s+/, $v1); my $q=join("",map {chr(hex($_))} split /\s+/, $v2); print `echo -n \'$p\'|md5sum`; print `echo -n \'$q\'|md5sum`;
There is also a buzz about weaknesses in the SHA-0 and SHA-1 algorithms, but this is much more preliminary.
A researcher has managed to find a collision in SHA-0, which is not in such widespread use, and another has found a collision in "reduced SHA-1". A "reduced" algorithm typically eliminates some steps in the hashing process (40 rounds rather than 80), and even though finding a collision in what amounts to 1/2 of SHA-1 doesn't have any immediate impact, it's like a neon sign to the researchers saying "look here".
There is no evidence that preimage resistance or second preimage resistance are at risk in any of the algorithms (e.g., if you have an existing hash, find an input that produces that hash), but many of the cryptorati believe that blood is in the water and the sharks are on their way.
Once a weakness in an algorithm has been identified - even if it's theoretical or only on a reduced version - it's often the start of the whole thing unraveling.
In the short term, this will have only a limited impact on computer security. The bad guys can't suddenly start tampering with software that can fool published checksums, and they can't suddenly start cracking hashed passwords. Previously-signed digital signatures are just as secure as they were before, because one can't retroactively generate new documents to sign with your matched pair of inputs.
What it does mean, though, is that we've got to start migrating to better hash functions. Even though SHA-1 has long been thought to be secure, NIST (the National Institute of Standards and Technology) has a standard for even longer hash functions which are named for the number of bits in their output: SHA-224, SHA-256, SHA-384, and SHA-512.
Five hundred twelve bits of hash holds 1.34 x 10154 possible values, which is far, far more than the number of hydrogen atoms in the universe [ref]. This is likely to be safe from brute-force attacks for quite a while.
As we conclude this Tech Tip, we remind the reader that we are not crypto experts, and have merely collected information from far and wide in a manner that we believe covers the topic with clarity and fidelity. But we could be wrong, and in the case of predicting the future, probably are.
Those wishing the authoritative word on this are encouraged to legendary security expert Bruce Schneier, who has written an analysis that appeared in Computerworld:
Opinion: Cryptanalysis of MD5 and SHA: Time for a new standard
Crypto researchers report weaknesses in common hash functions
As his was an opinion piece and not a Tech Tip, he hasn't covered the background, but his big-picture analysis are certainly much more likely to be correct than ours.
We also recommend Bruce's monumental work Applied Cryptography, which has been the gold standard of crypto texts for some time. We refer to this constantly when researching security issues and cannot recommend this book highly enough.
Professor Edward Felten wrote about this in his Freedom to Tinker weblog, and his is also an authoritative voice on the subject.
Update: (2005/02/15) - Bruce Schneier's weblog: SHA-1 Broken
Another excellent resource is Doxpara Research, with thoughtful papers on the future of these hashes.
---
Feedback and/or corrections to this paper are gladly accepted.
Last modified: Mon May 9 09:15:29 PDT 2005