Hash

51FEC3B6FCB1E7D5465575BED5DCDC1B8897AE5A

Computer hashing

Computer hash is an encryption algorithm that forms the mathematical foundation of e-discovery. Hashing generates a unique alphanumeric value to identify a particular computer file, group of files, or even an entire hard drive. As an example, the hash of the animated GIFF file is shown above. The unique alphanumeric of a computer file is called its “hash value.” Hash is also known in mathematical parlance as the “condensed representation” or “message digest” of the original message. It is more popularly known today as a “digital fingerprint.” Hash is the bedrock of e-discovery because the digital fingerprint guarantees the authenticity of data, and protects it against alteration, either negligent or intentional. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.

Hash is my favorite e-discovery technology. I became fascinated by its great potential as a safeguard for electronic evidence in the future, and ended up reading and experimenting with this algorithm in depth. Ultimately I wrote a forty-four page law review article on the subject. HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007). Here I discuss hash at length and review just about every case that mentions it. The article has 174 footnotes to provide reference to almost everything on the subject that might be of interest to a lawyer or others in the e-discovery field. As the title suggests, I make a specific proposal in the article for the adoption of an e-discovery file naming protocol based on hash to replace the paper oriented Bates stamp. For more background on this law review article see my prior blog about it. For information on how the use of hash, instead of Bates stamps, is much more efficient and saves money in e-discovery processing, see my other blog The Days of the Bates Stamp Are Numbered.

Technically, hashing is based on the substitution and transposition of data by various mathematical formulas. Thus the process is called “hashing,” in the linguistic sense of “to chop and mix.” The hash value is commonly represented as a short string of random-looking letters and numbers, which are actually binary data written in hexadecimal notation. Hash is commonly called a file’s “fingerprint” because it represents its absolute uniqueness.

If two computer files are identical, then they will have the same hash value. Even if the files have a different name, if their contents are the same, exactly the same, they will have the same hash. This allows for easy identification and elimination of redundant documents, the mentioned deduplication process. But if you so much as change a single comma in a thousand page text, it will have a completely different hash number than the original. There are no similarities in the hash numbers based on similarities in the files. Each number is unique. That is how the math in all hashing works.

Many kinds of effective hash formulas have been invented, but two are in wide use today: the SHA-1 and MD5 algorithms. Both are very effective, in that mathematicians conjecture that it is “computationally infeasible” for two different files to produce the same hash value. That is why hashing is commonly employed in data transmissions to verify that the integrity of a file has been maintained in transmission. If you hash the file received, and it does not produce the same hash value, then it has been corrupted, and at least one byte is not the same as the original. It is a guaranteed way of verifying the integrity of an electronic file.

Software to run both the SHA-1 and MD5 hash analysis of files is widely available, easy to use and free. I use a HashTab Shell Extension to Windows, available for free at http://www.beeblebrox.org/software.php. The hash value of any file can be instantly determined, regardless of the type of electronic file, including graphics. For instance, the hash values of a Word document I am working on now are:
MD5: 588BCBD1845342C10D9BBD1C23294459
SHA-1: C24AE3125BFDBCE01A27FDDA21B3A7E83FAFF69E
If I only change one comma in this multipage document, all else remaining the same, the hash values are now:
MD5: 5F0266C4C326B9A1EF9E39CB78C352DC
SHA-1: 4C37FC6257556E954E90755DEE5DB8CDA8D76710
Although the two files have only this trivial difference, there are no similarities in these hash values, proving that hashing will detect even the slightest file alteration.

Hashing can also be used to determine when fields or segments within files are identical, even though the entire file might be quite different. This requires special software, but again is commonly available from many e-discovery vendors, for a price. This software allows you to hash only portions of a file. Thus, for instance, you can hash only the body of an email, the actual message, to determine whether it is identical with another email, even when the “reference” or the “to” and “from” fields are different. This allows for an important filtering process called “near de-duplication.”

9 Responses to “Hash”

  1. Jr Says:

    Nice BLOG!

  2. Sherlock Holmes in the Twenty-First Century: Definitions and Limits of Computer Forensics, Forensic Copies and Forensic Examinations « e-Discovery Team Says:

    [...] Hash [...]

  3. AS Says:

    Has anyone found lots of issues trying to decide which e-mail message properties to be used to create MD5 hash? Not to mention the timezone issues, where communication happened internationally…

  4. Venue Analysis Transformed by e-Discovery and the Digitization of Society « e-Discovery Team Says:

    [...] Hash [...]

  5. Trade Secrets Case Uses MD5 Hash and Keyword Search to Protect Defendants’ Rights - Magistrate’s Privilege Waiver Order Is Reversed « e-Discovery Team Says:

    [...] Hash [...]

  6. New Case Denies Both Production Under Rule 26(b)(2)(B) and Sanctions for Spoliation Under Unspoken Rule 37 « e-Discovery Team Says:

    [...] Hash [...]

  7. Jeff Says:

    Why are they so trusting of MD5/SHA1 both are circumventable. There are now ways to make the hashes match even if there are TOTALLY different files. REF:

    http://www.schneier.com/blog/archives/2005/06/more_md5_collis.html

    There are also several tools now to modify MD5/SHA1.

    http://www.stachliu.com/collisions.html

  8. Ralph Losey Says:

    Thanks for the comment. The articles you reference are interesting, but the false collisions engineered by experts which are discussed at these sites, and elsewhere, are not a cause for any real concern in the e-discovery arena. Here we primarily use hash to verify that ESI has not been altered, and to determine if two files stored on systems are identical. e-Discovery is not using hash for encryption purposes. These studies do, however, explain why the spy agencies are moving to new hash formulas.

  9. The Days of the Bates Stamp Are Numbered « e-Discovery Team Says:

    [...] Hash [...]

Leave a Reply