This paper examines the role of hash values and cryptographic hash functions in digital forensic investigation. It outlines the mathematical properties that define a valid hash function — including fixed-length output, one-way computation, and collision resistance — and explains how these properties underpin two primary forensic applications: digital content integrity checking and file identification and classification. The paper discusses how investigators use hash values to detect unauthorized modifications to digital evidence, and how databases of known hash values enable efficient identification of files such as those containing child pornographic content. It concludes by noting that tools employing SHA and MD5 algorithms remain standard in forensic practice for ensuring evidence reliability.
Hash values are condensed representations of digitized or binary content within digital material; however, they offer no additional information pertaining to the contents of any material interpretable by a human reader. The hash function is an algorithm that converts variable-sized text into hash values — fixed-sized outputs. Also called cryptographic hash functions, they facilitate the development of digital signatures, short textual condensations, and hash tables for the purpose of analysis (Fang et al., 2011; Kumar et al., 2012). This paper addresses hash functions and their significance in digital forensic practice.
H (hash function) represents a transformation taking variable-sized input m and returning a fixed-sized string h (i.e., h = H(m)) (Kumar et al., 2012). Hash functions possessing only this property can be applied to various broad computational uses; however, when applied to cryptography, they normally possess several additional properties.
The fundamental prerequisites for any cryptographic hash function (H) are as follows:
A one-way hash function means the function cannot be easily inverted — that is, given any hash value h, it is not computationally feasible to find an input x such that H(x) = h. Further, if, given input x, finding y ≠ x is computationally infeasible such that H(x) = H(y), then H represents a weakly collision-free hash function (Kumar et al., 2012; Rasjid et al., 2017). On the other hand, a strongly collision-free H is a hash function for which finding any two messages x and y such that H(x) = H(y) is not computationally feasible.
Hash values are a concise representation of the longer document or message from which they were calculated; a single message digest may be considered a larger document's "digital fingerprint." Possibly the key application of cryptographic hash functions is providing digital signatures. Because hash functions often operate more quickly than digital signature algorithms, digital signatures are typically computed for documents by working out the signature of the document's hash value — which is smaller than the actual document — rather than signing the document itself (Kumar et al., 2012). In addition, digests may be made publicly available without revealing the content of the actual document from which they are derived. This proves crucial within the context of digital time-stamping, as the application of hash functions here allows a document to be time-stamped without revealing its contents to the service provider.
So long as the target collision-resistance and one-way resistance properties of hash algorithms remain intact, hash values computed for forensic purposes may be applied to: (1) digital content integrity checks, and (2) effective file grouping and identification.
From a forensic science perspective, it does not matter whether random collision-resistant hash algorithm properties are compromised or not. Using the remaining properties, one of the hash values will be provided as either a hash value or digital content from which to compute a hash value. Accordingly, other digital content must have identical hash values. In the case of random collision resistance, hash values are not offered beforehand — it suffices to develop or identify two distinct pieces of digital content having identical hash values (Netherlands Forensic Institute, 2018a). The latter does not occur within the forensic domain, as digital content is always received or provided with computed hash values already affixed to it.
The chief goal of digital content integrity checks is to detect unintentional modifications in copies of digital content. In addition, the integrity check is also capable of identifying certain kinds of intentional manipulation.
Altering digital information is a fairly straightforward task, whether accidentally or deliberately. By employing hash values, individuals may inform one another of what digital content they have worked on, and can ascertain whether they worked on the same content as a fellow user or analyst (Rasjid et al., 2017). For example, a digital detective reports a digital content hash value to a Biometrics and Digital analyst. The analyst re-computes the hash value of the supplied copy (Netherlands Forensic Institute, 2018a) and compares the result with the previously reported figure. If the outcome is not precisely identical to the previous value, it implies the file was modified at some point in the intervening time. Hash values cannot reveal where or how the file differs from the original. However, if the resulting value proves to be perfectly identical to the previous value, it is highly likely that no change has been made to the digital content since the original hash value was calculated.
Generally, using hash values for classifying and identifying files is an effective process, as hash values are quite small. Files today are often exceedingly large — at least several gigabytes (GBs), where 1 GB equals approximately 1 billion bytes. However, the maximum hash value length that Biometrics and Digital investigators currently use is a mere 32 bytes (64 characters), making it far quicker and simpler to compare file hash values than to compare actual file contents. Moreover, communicating file hash values is far quicker and simpler than transmitting file contents themselves.
"Using hash databases to classify and filter forensic files"
Kumar, K., Sofat, S., Jain, S. K., & Aggarwal, N. (2012). Significance of hash value generation in digital forensic: A case study. International Journal of Engineering Research and Development, 2(5), 64–70.
Netherlands Forensic Institute. (2018a). Technical supplement forensic use of hash values and associated hash algorithms. Ministry of Justice and Security.
Rasjid, Z. E., Soewito, B., Witjaksono, G., & Abdurachman, E. (2017). A review of collisions in cryptographic hash function used in digital forensic tools. Procedia Computer Science, 116, 381–392.
You’re 68% through this paper. Sign up to read the remaining 1 section.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.