Imagine being able to transform a non-trivial data object like an item in a video game into a trivialized number that is able to uniquely (or tries to uniquely) identify that specific instance of the object. You may be thinking, "what good would that do?" As it turns out, there is a lot that can be done with this kind of data. One application is in a BST that is composed of non-trivial values. By deriving a number from an object, it is possible to compare them and store them in an efficient manner. What if, instead, we had an array that was indexed by these uniquely identifiable numbers; if we knew how to quickly derive that number from an object, it would make finding the object in an array nearly instantaneous! The best part is, this kind of practice does exist, and it's called hashing.

Assumptions

Proficiency in linked lists
Knowledge of Big-O notation

What is Hashing?

Hashing, with respect to data structures, is the process of deriving a numeric value - known as a Hash Code - from an object, most commonly a string. Hashing has applications ranging from data access to security tokenization and cryptography.

The hash function accepts the object as input and gives back the hash code as output. Theoretically speaking, the ideal hashing function maps input to output in a one-to-one relation; for every unique item input, there is a unique hash code returned.

In practice, however, it is very difficult to achieve this. Every now and then, two or more distinct objects may generate matching hash codes in a many-to-one relation. The occurrence of identical hashes being generated is known as a collision.

Collisions can be problem points, however with an effective hashing function the programmer should be highly confident that there is a very small probability of having collisions.

How to Hash (Briefly)

Though the method will not be gone over in great detail as it can vary for each implementation, it is worth learning the basics of how a hashing function operates under the hood, though many modern IDE's such as Eclipse offer built in support for generating hashCode, and equals methods. > Also, it should be known that if objectA "equals" objectB, that is, all their values equal each other, Object A and Object B should both generate the same hash code.

For this example, we will generate a hash code of a string type object. When breaking down the definition of a string, we see that it is an array of characters. Luckily, characters in programming languages can be represented by an ascii value. By performing some combination of operations on the character numbers, we can come up with a numeric value that can be used to identify a string. From here we could come up with a very simple version of a HashCode function that works like so:

/* pseudo-code */
function notGreatHashCode(){
    int hash = 0;
    for each ( character c in string ) {
        // In this example, the hash is the sum of the characters.
        hash = hash + c;
    }
    return hash;
}

While this example is very simple, it is still easy for collisions to occur; because it is only a summation of the character ascii values, two strings will collide if they have the same characters. This means "hello says world", "world says hello", and even "olleh syas dlrow" will collide! There must be some way that we can make this better.

The best thing that can be done here is to take care of the case that was just given. A simpler case that would also collide is "AB" and "BA". By introducing a multiplication into the equation as well as the current addition, this case can be eliminated. Many implementations of hashCode will compute in a similar fashion for a string:

/* pseudo-code */
function betterHashCode(){
    int hash = 0;
    // Important!
    int constant = 31;
    for each character c in string {
        hash = (hash * constant) + c;
    }
    return hash;
}

Lets break this down:

Like before, each character is iterated on to contribute to the overall hash value.
This time, instead of simply adding the current character value to the running hash, the hash is first multiplied by a constant value, and then the character value is added onto that new hash value.

Why 31?

In that example, 31 seems like a random number to pick. However, 31 is actually a very well-thought out number to place there. 31 is a prime number, meaning that it does not behave the same way that nice, power-of-two numbers might such as 32.

The result of multiplying by 31 in practice makes for not only much more diverse hash codes with low chances of colliding, but also has been shown in practice to produce fairly evenly-distributed hash codes.

This last sentence might have raised a question: if the only concern with generating hash codes is to make sure they are not the same, why should we care if they are evenly distributed or not?

Calling back to one of the earlier ideas mentioned in this article, uniform distribution means that if these hash code values were to be placed into an array S, the elements would be able to be inserted still at a low chance of colliding with one another, even if the array was small in comparison to the hash code value. This idea is the basic premise of a well-used data structure called a Hash Table.

Basic Hash Table Implementation

A Hash Table is a data structure that is made up of an array of size S, where the placement of each element is based on it's own hash code. The indices of the hash table array are often referred to as buckets. The data within the hash table is represented by key value pairs, which are pairs of values comprised of a key which is used to identify an object, and a value which is the relevant data that is associated with that key.

For instance, the name of a book is a key; a value that acts as an identifier for a specific book. The value that is associated with that key is the book contents, which can be represented as a string of text.

Conceptualization

In this example, suppose the hash table has 16 "buckets" (an array of size 16).

Each bucket contains an object representing a book with the book's title acting as the key, and a string of the book's context as the value. To find out which bucket to place a book into, we can perform a hash code function on the key of the object, in this case being the book title.

The first book being inserted into the hash table is "Sherlock Holmes". To find out which bucket to place this into, we'll first obtain the hash code of the title.

For the sake of example, the hash code output from the title is 547. Now that we have that number identifying the object title, we need to insert it into a bucket in the hash table. We only have 16 buckets, so we can perform a modulus operation to find a number between 0 and 15:

<div style="width:100%;text-align:center;font-size:120%;">547 % 16 = 3 </div> From this operation we have derived that a new object will be placed into bucket 3 with a key of "Sherlock Holmes", and a value of the book's contents.

Let's insert another book now. This time, "Treasure Island" is inserted. Once more for example purposes, say the hash code of the book title is 732. Performing another modulus operation:

<div style="width:100%;text-align:center;font-size:120%;">732 % 16 = 12</div> We insert a new object into bucket 12 with a key of "Treasure Island" and a value of the book's contents.

After performing the operations, the two objects are placed into bucket's 3 and 12 respectively. If we wanted to retrieve an item from a hash table given the book's title, all we need to do is this same operation, and we can retrieve the item from the array, on average, in O(1) time!

> The worst case scenario found when a poorly implemented hash function is used can turn into O(n) time on average, and it can bee seen why next.

What about collisions?

You might have reasoned that if there are only 16 buckets in the hash table example, there's bound to be at least one collision, especially if a 17th book is inserted into the hash table. When implementing a hash table, you would want to have more than 16 buckets, however that still does not address the question properly. In reality, collisions happen, and there is a way to handle them: using linked lists!

Let's modify the hash table we've defined up to this point. Instead of each index being a simple array of a key value pair, let's make it an array of hash table nodes. To summarize, our hash table will effectively be an array of linked lists. To put this in code, let's observe:

/* pseudo-code */
// The hash Node structure
class HashNode {
    // The node key
    String key;
    // The value associated with the key
    String value;
    // The next node in that chain.
    HashNode next;
}

// Using the hash node to form a Hash Table structure.
class SimpleHashTable{
    // The number of buckets in the table. 
    int numBuckets;
    // The actual Hash Table, an array of Hash Nodes. 
    HashNode[] buckets;
}

Now back to the big question: utilizing linked lists, we can insert the colliding node as the next node in a bucket's sequence when two objects get assigned to the same bucket.

To help visualize this, let's go back to the example. Say a third book is inserted into the hash table with the title "Meditations" as the key. Once more, suppose the hash code works out to 643. Performing a modulus operation on this hash code results in the book being placed into bucket 3. But wait! We've already placed "Sherlock Holmes" into bucket 3!

Utilizing our new HashNode structure, we can resolve this conflict by inserting "Meditations" as the next book in the sequence of HashNodes in bucket 3. Let's put this into a visual to get a better view:

Now that we've seen this visually, let's see what this looks like in code:

/* pseudo-code */
int numBuckets = 16;
// Our hash table representation
HashNode[] buckets = new HashNode[numBuckets];

// insert a key value pair into the buckets.
function put( String key, String value ){
    // (1)
    int bucket = key.hashCode() % numBuckets;
    // (2)
    if ( buckets[bucket] == null ){
        Buckets[bucket] = new HashNode(key, value);
    }
    // (3)
    else {
        HashNode tmp = buckets[bucket];
        while ( tmp.next != null ){
            // (4)
            if (tmp.key.equals(key)) {
                tmp.value = value;
                return;
            }
            tmp = tmp.next;
        }
        // (5)
        tmp.next = new HashNode(key, value);
    }
}

Key points from the code:

Here, we find the bucket that the new item should be put into by calculating its hash code and dividing it into the number of buckets.
If the designated bucket doesn't have any HashNode in it then we insert a new node at that index.
If there is already a Hash Node present in that bucket, as was the case from the example, then we begin iterating the linked list in that bucket until we are at the end of the linked list
While iterating the linked list in a specific bucket, we check if the key being inserted is already present in the linked list. If it is, then we update the value associated with that key and call it a day.
Once the end of the linked list is found, if the key was not already present in the list, then we insert it onto the end.

> To touch back on a point, if our hash code method always returned 1, then all the nodes would be inserted into the same bucket, resulting in an average get case of O(n), which is why an effective hash code method is important!

Now that we've seen how to insert into a Hash Table, we can use the same structure to search, get, and remove elements!

When to use Hashing

Hash Tables are only one of the many use cases of hashing algorithms. Overall the use cases of hashing include

Storing objects in Hash Tables and other Hash-based structures.
A loose, efficient method for deriving object equivalence, with the caveat of wanting to check further that it is actually equal, but not just a collision.
Creating bloom filters in post processing stacks that can be applied to photos and in video games and other digital media.
Certain hashing algorithms are well suited for cryptography to protect data that can only be feasibly decrypted using a secret key.