Hashing
What is hashing?
Hashing transforms data of any size to an alphanumeric string of fixed and predetermined length. A hash function is irreversible: It’s not possible to determine the original input data based on the result of the hash function. This makes hashing ideal for secure data storage. Hashing can convert any type or volume of data, such as the title of a book, the entire text of a book, or the illustration file for the cover artwork. Each of these data items can be hashed to strings of the same fixed length. A hash function may aim to maximize the probability of uniqueness of the transformed data.
Input data is often called the “key,” while a hash function is the set of steps (the algorithm) performed on the key. The results of the hash function are called several different names: hash values, hash codes, or just hashes. For purposes of this article, we’ll use the terms input data, hash function, and hash value. Hash values are used for different purposes, including efficient database management, data integrity, and security.
What is hashing used for?
Data storage and retrieval
One of the first uses of hashing was for efficient data storage and retrieval systems. A large data set can be time consuming to search through. Performing a search on the hash value of a search term instead of the input term shortens search response times and thus improves user experience. Let’s take the example of a library database of books that divides all of the entries into smaller groups based on the hash value of the titles. When a user searches for a book title, the hash value of the requested title will quickly point to the correct group. Then, a quick search is done for the original exact title within the smaller group. In other words, the time it takes to run the hash function and identify the correct data subset, added to the time required to find the title in the right subset, is still less than the time needed to find the book title in the full dataset.
Password security
Storing the hash value of a password is more secure than storing the actual password. Because no actual password is stored (encrypted or unencrypted), the individual’s account is much more secure in the event of a data breach.
Data integrity
Comparing two files by using their hash values can determine if the files are identical. Running a hash function on two documents is quicker than comparing each document character by character. If two files are supposed to be the same but are different, different hash values will indicate that one of the files was altered. This test can indicate changes and updates to files, or can expose if a file has been corrupted by malware.
Comparing hash values of documents happens in places like message authentication and blockchain. Message authentication, sometimes paired with digital signatures, uses both hashing and encryption. In this situation, encryption is used to protect the message content during transmission and hashing is used to verify that the content wasn’t tampered with while en route.
Blockchain uses hashing to independently verify that the data in the blockchain hasn’t been altered. A blockchain is built on layering transaction data and associated hash values, with the hash values acting as confirmation that the preceding data in the chain hasn’t been tampered with.
How does hashing work?
A hash function uses a variety of operations. They might be arithmetic, involve conversions or transformations, or be procedures that manipulate the actual bits (basic units of computer data) of the input file. While the input data can be of any size, a given hash function will always return a hash value of the same size, often from 32 to 64 characters depending on the set of characters used. Ideally, the hash function runs quickly and results in an even distribution of all possible hash values.
With large input data (like our example of hashing a complete document), it might take several passes through the hash function to get to a final hash value that represents the entire input data. The hash function is first run on a small block of the input data. This resulting preliminary hash value is then combined with another small block of original input data and the result is run through the hash function. Combining and running continue until all the original input data is processed.
A hash function needs to be reproducible and repeatable. If given the same input data, a hash function must return the same hash value every time. This criteria affects not only the operations performed as part of the hash function, but also the order in which the data is processed. If a large input data were to be divided into small blocks differently before being hashed, the resulting hash values would be different. In this way, very similar input data could produce very different hash values—something known as the Avalanche Effect.
A hash function is irreversible—you can’t directly solve for the original input from the hash value. This makes storing a hash value of a password far more secure than an encrypted version, which can be decrypted by a hacker. Although the original password can’t be determined from the stored hash value (because a hash function is irreversible), the system can confirm the individual’s login by hashing the submitted password and comparing the result to the stored hash value. A match indicates that the submitted password is the same as the password that created the stored hash value.
What are collisions in hashing?
A rule of hash functions is that all hash values must be of the same fixed length. This means there’s a maximum number of hash values that can be generated by a given hash function. Having shorter fixed lengths for hash values will result in fewer possible hashing outcomes. If the number of possible hash value outcomes is large enough, you can expect with reasonable mathematical certainty (but maybe not quite guarantee) that each input data will convert to a unique hash value.
A “collision” occurs when two different inputs run through a hash function and return the same hash value. Some applications of hashing are okay with collisions, such as in the library database example—in this case hashing is used to divide a database into smaller groups for improved search response. The hashing of the database groups the data by collisions.
Other applications—such as storing secure versions of sensitive information like passwords—need hash functions designed to minimize, or even eliminate, collisions. These hash functions have longer fixed length results, thus creating more possible hash value outcomes and reducing the chance of collisions. If the hash function used to hash values of passwords were to have only a small number of possible outcomes, collisions would be more likely. This means there would be a greater chance that a wrong password could generate a hash value that matches the stored hash value.
What is salting a hash?
Salting a hash refers to adding extra data (called “salt”) to the actual input data before running the hash function. There are many ways to salt input data—one example is to put a string of random characters at the beginning of the actual input data. Salting reduces collisions that result from identical data originating from different sources. Each data source is assigned its own unique salt that can be used any time its input data is run through the hash function.
Any data can be salted, but the most common use is to salt a password to add an extra layer of complexity and security. If two people happen to have the same password (say, JSmith123), without salting they’ll have the same hash value. If a hacker cracks one of these accounts, they automatically know that everyone else in the database with the same hash value also has a password of JSmith123. But salting every JSmith123 with a unique salt guarantees each hash value will be unique, and one cracked password won’t lead to multiple cracked passwords.
Is hashing the same as encryption?
Hashing and encryption can seem similar—both convert data into an unreadable state, and protect data from being used in some way it shouldn’t be used. However, they’re different processes that are used for different situations. The key difference between hashing and encryption is that hashing is irreversible, while encryption must be reversible.
Encryption is often used to securely transmit and store data that ultimately will be read and used again. This is often referred to as maintaining the “confidentiality” of the data. Common places to find encrypted data include email transmission and sensitive data stored in databases.
With hashing, there’s no intention of ever reading the hashed data again. The purpose of a stored hash value is to act as test data to match against other hash values. When this matching is done to make sure the data hasn’t been tampered with, it’s called maintaining the “integrity” of the data. A hash value has little value by itself—its value lies in its ability to aid in handling and verifying data securely or efficiently.
Message authentication and digital signatures use both hashing and encryption. The original message is hashed. The hash value and the original message are both encrypted and sent separately. At the destination, the encrypted hash value is decrypted and the received message decrypted and hashed. These two hash values are then compared. If they match, the message is considered unaltered. A digital signature can also verify the sender’s identity by requiring and confirming that the sender used an encryption key specific to them.