How does a hashing algorithm actually work? And what makes it secure? In this article, we will look at an implementation of SHA-1 (Secure Hashing Algorithm 1) in Python and discuss each step used to create a hash digest. While there is plenty of information about how to use hashing algorithms generally, and about the security properties of specific hashing algorithms, this article is for anyone who would like to dive deeper into understanding exactly how a hashing algorithm maps an arbitrary array of bits to a unique hash digest.
We will look at Python for simplicity and legibility (we are not focused on performance) and will walk step by step through the process of producing a SHA-1 hash digest from an array of bits. We are using SHA-1 because of its relatively short digest. Note that SHA-1 has been deprecated because of known issues with uniqueness and has been superseded by newer hashing algorithms, like SHA-256. SHA-1 is still a great example of the construction of a secure hashing algorithm and has continued to be used (for now) in applications like git.
Note: A “hash digest” is the name for the output of a hashing algorithm.
1. What is a hashing algorithm?
First things first: what is a hashing algorithm? At its core, a hashing algorithm is a function that takes any input in bits and transforms it into a standard length representation. As an example, SHA-1 takes the input “foo” and transforms it to:
SHA1(“foo”) = “0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33”
SHA-1 is a cryptographic hashing algorithm. This means that, in addition to the basic transformation operation, SHA-1 also guarantees certain security properties. Cryptographic hashing algorithms have two core characteristics:
(1) Hash digests are unique
(2) Hashes cannot be reversed
(1) A cryptographic hashing algorithm must map unique inputs to unique digests. Because these algorithms transform all inputs (of arbitrary length) into a standard length representation, it would not be useful if they returned the same hash digest for different input data. When distinct inputs are mapped to the same digest, it is called a “collision”. SHA-1 actually has this problem and has therefore been deprecated (researchers found two different PDF documents that produced the same digest).
(2) It must not be possible to reverse a hash digest. This means that, given some digest, it must be impossible to work backwards to identify the input data that created that digest. This is important for cryptographic hashing algorithms to ensure that once some data is hashed, the original data cannot be retrieved. Think of it like one-way, irreversible encryption. To achieve irreversibility, hashing algorithms repeatedly perform one-way mathematical operations. An easy way to intuit one-way operations is to imagine that you have the result of an addition. Although you know the result, it would be impossible to identify the operands that created that result. In the case of hashing algorithms, it is computationally infeasible to calculate the original input used to create a specific hash digest.
2. A crash course in computer internals
To understand exactly how a hashing algorithm works, it is important to have an understanding of the internal representation of data in a computer: bits. Hashing algorithms operate by performing bitwise operations on sections of input data, so it is also helpful to understand the ways that arrays of bits can be transformed and rearranged.
In our example of hashing the string “foo” — the string is actually an array of characters and these characters are uniquely identified by an integer (each integer is represented by an array of bits). SHA-1 rearranges and compresses the bits in the input message to produce the hash digest. The hash digest is simply a collection of 160 bits, represented by 40 hexadecimal characters.
3. Implementing SHA-1 in Python
To understand SHA-1, we will use an existing implementation in Python by TheAlgorithms. See the full implementation here. This program accepts either a string or a file path and hashes the corresponding bits. Readers are encouraged to clone the file and run it from a directory using the command line:
python sha1.py -string “foo”
This will print the hash digest to your command line. You can check that the resulting hash is the same as the standard SHA-1 implementation from hashlib:
4. Hashing a message
So, what is happening internally when we produce a SHA-1 digest? As we have said, a hash function is concerned with the rearrangement of bits to compress and transform any input into a unique, standard length representation. We do this using a series of operations on our input:
(4) Transformations with constants
(1) The first step for transforming an array of bits to a hash value is to ensure that the incoming data has a standard size. For SHA-1, the input length of our message must be a multiple of 512 (64 bytes). The message is padded with a one and zeros, and is followed by 64-bits describing the original message length. The number of zeros we choose to use ensures that the input message has an appropriate size.
(2) Next, we divide the input bits into constant length sections of 64 bytes. We do this by splitting the message into appropriate length “blocks” of data. We will iterate through these blocks, with each block contributing to our final hash digest.
(3) In order to rearrange the bits themselves, we need to represent them as integers. We “unpack” the 64 bytes into 16 integers, and append 64 zeros, giving us a total of 80 integers. We then iterate through each integer, performing operations defined in the SHA-1 specification on the entire array of integers. These operations transform the array into an “unpacked” array of 80 integers that we will use to create our final hash digest.
(4) After preprocessing the input message to constant width “blocks” of 80 integers, we are now ready to produce the hash digest. We do this by performing operations on the data using a series of constants. Which operations and which constants we use depends on the index of the integer we are processing (0–19, 20–39, 40–59, 60–79; see here §6.1.2). The constants we use form the initial hash value and each integer of each block of the message continuously transforms these constants. Once we have iterated through each integer in every message block, we concatenate the transformed constants to create our final hash digest.
Note: The constants we use are called “Nothing-up-my-sleeve numbers” and are used to initialise a hash function. These constants are unique to specific hash functions and are thoroughly researched to ensure that they produce appropriate hash digests for all (in practice, very many) inputs. SHA-1 uses the first 32 bits of the square roots of 2, 3, 5, and 10 (represented in binary).
The most intuitive way to think about a hash function is to imagine chopping an input message into constant length sections and iteratively updating a hash value based on the content of each message section. Each section of the message makes an “impression” on the final hash digest, resulting in a digest that is totally unique to the original message.
SHA-1 and newer hash functions in the “Secure Hashing Algorithm” family are carefully researched to guarantee their security properties. It is never a good idea to write your own cryptographic hash functions, but it is still fascinating to dig in and understand how they work. Hopefully you are better able to appreciate the security properties and performance of the cryptographic hash function of your choice.