EECS 311: Hashtables Home Course Info Links Grades Lectures Newsgroup Homework Exams

## Terminology

Terms related to hashtables you should be familiar with:

• hash functions
• keys
• collisions and collision resolution
• synonyms
• linear probing
• double hashing

## The Basic Idea

The basic idea is simple. To store data under a given key (usually an integer or string),

• use a simple one-dimensional array for the data, A
• to store a record r containing a key k:

A[hash(k)] = r

hash is a function that takes any key and returns a valid index into A.

Any function will work, but to get the desired features of hashtables, hash must

• be cheap to calculate
• effectively make a hash of any "clumpiness" in the distribution of the keys

If both of these properties are true, and the table is not very full (at least 20% empty) then

• storage and retrieval are constant time operations

This obviously is better than the best binary search tree algorithms we had (AVL and red-black), which were O(log N) time operations.

There is no free lunch, of course. The price we pay is:

• in the worst case, O(N) time, but this is highly unlikely as long as the table is kept from filling
• wasted space
• no traversal algorithm to get the stored data out in sorted order

Hashtables aren't really constant time. As the textbook clearly points out, access time is a function of density, i.e., how full the table is. This means that we can get as close to a constant time retrieval as we want, if we're willing to make the table big enough.

## Making It Work

There are two key problems to solve in making hashtables work well:

• finding a cheap hash function that generates few collisions
• handling cheaply the collisions that do occur

### Minimizing Collisions

Getting cheap hashing functions is easy. Getting low-collision functions is trickier.

There are two basic approaches:

• mess with the key, e.g., shift folding or reverse shift folding
• use a random number generator

In the first case, we move characters in the key around and add parts together, not unlike shuffling a deck of cards to get rid of patterns caused by the previous round of play.

In the second case, we use the calculation formula that converts an internal seed into a random number in a pseudo-random number generator, using the key (converted to an integer in some simple way if need be) as the seed.

In either case, the resulting number has to be turned into a valid index with

value % size(A)

and it's best to have size(A) be a prime number.

### Handling Collisions

It's impossible to avoid collisions unless we have a perfect hashing function, and we can only get one of those if we know exactly what keys we're going to get in advance and have a table big enough to hold them all.

Therefore, we need collision resolution strategies, which answer the question "when hash(k1) = hash(k2) where should the record with k2 go?" k1 and k2 are called synonyms.

#### Hash to Lists

One simple approach is to keep lists in A:

• To store a record, you hash the key to the appropriate list, and insert it
• To retrieve a record, you hash the key to the appropriate list, and scan the list for the record

As long as there are few collisions, there will usually be only 1 or 2 items in any list, and so retrieval will still be effectively constant time.

This approach is simple and often used. It is wasteful of space, however, compared to the following approaches, and there is no bound on how big the hashtable may get.

#### Linear Probing

Linear probing is a straw man approach. You never want to use it, but it's a good starting point for understanding the other approaches.

The idea is simple:

• To store a record,
• hash the key to the appropriate index i
• while A[i] is not empty, increment i (wrap around if need be)
• store the record in A[i]
• To retrieve a record,
• hash the key k to the appropriate index i
• while A[i] is not empty and contains a record whose key is not k, increment i (wrap around if need be)
• if an empty A[i] is found, return "not found," otherwise return A[i]

The problem is that collisions pile up and form clusters or chains of adjacent synonyms. Furthermore, these clusters can merge with other clusters and make even larger clusters.

We can avoid the merging of clusters if we "hopscotch" around more looking for places to put collisions.

Quadratic probing modifies linear probing as follows:

• Let skip = 0
• While A[i+skip*skip] is not empty,
• skip = skip + 1

This will try i, i + 1, i + 4, i + 9, and so on.

Problems with this approach are:

• you might miss some of the empty spots in the table (but note that you can prove that if the table size is prime, you'll always eventually find an empty spot if there is one)
• synonyms will go to the same places -- you avoid cluster merging but not clusters

#### Double Hashing

To stop synonyms from chasing each other, you can use a secondary hash function to calculate the skip amount. First you hash the key k as usual. If you get a collision, you modify linear probing thus:

• Let skip = h'(k)
• While A[i] is not empty, i = i + skip

Assuming you've picked an independent secondary hash function, the odds of two keys have the same values for both h and h' are small enough to make clustering a non-issue. If the table size is a prime, we also know that eventually we have to find an empty spot, if any exists.

## Applications of Hashtables

### Keyed Collections/Associative Tables

A number of programming languages, particularly scripting languages like Hypertalk, JavaScript, awk, and Perl, provide what are called associative tables or keyed collections.

This is a very simple and useful abstract data type. A keyed collection C lets us store and retrieve data using keys -- especially strings -- for indices, the way arrays let us use integers for indices:

• Storage: C[key] = data
• Retrieval: C[key]

For example, to count all the words in a file, assuming we have a function that reads the next word as a string, we'd write something like this:

while not eof(stream) do

Hard to beat that for simplicity.

A keyed collection would also be ideal for storing the course and section data for our course scheduler for easy retrieval by course name.

Compilers make heavy use of keyed collections to store data about variable and function names.

### Rabin-Karp String Search

An interesting use of hashing that doesn't involve storing data is the Rabin-Karp probabilistic string search algorithm.

The key idea is this. Normal string search in the worst case is O(N * M) where N is the length of the string and M is the length of the text being searched. This is because you have to compare up to M characters, then, if the comparison fails, you shift right one character in the text, and start again.

Rabin-Karp's algorithm is O(N + M) -- big difference. The algorithm is (roughly)

Let k = h(search-string)
Let m = length(search-string)
Let i = 0
while ( h(substring(text, i, m) != k ) i++

That is, we keep hashing M-long segments of the text to see if they are a synonym of the search string. Notice that we never store anything in a table, we're just using the hashing idea. If we find a synonym, we check to see if it really is a match.

For this to work, we need a hash function that is very cheap and has a low collision rate. The hashing function used in Rabin-Karp takes advantage of the fact that each substring is only one character different than the previous one, so that recalculating the next hash is a cheap operation.

### Spell Checkers

Ever wonder how spell checkers can read thousands of words and quickly tell if they're in a dictionary with tens of thousands of words? There are lots of pieces to the trick but one element is the hashtable.

The idea, similar to the Rabin-Karp algorithm, is this:

• In advance, hash every word in the dictionary into N hashtables (say 10), each with a different hashing function.
• Now, given a text, hash it to each table. If it's in every one of them, it's probably a word, because the odds of N collisions is infinitesimal.

Obviously there's more to this, including handing word morphology and efficiently compacting N hashtables into something much smaller. The clever compaction comes from noting that the table contains only bits and that we only really care about something that has all N bits true.

But wait! If these spell checkers don't have a dictionary of words available, how do they find words to suggest as corrections? Simple. They apply a set of transformation rules to the misspelled word (transpose every adjacent pair of letters, for every letter c from A - Z, insert c once in each position in the word, drop one letter, etc.). Then they return those transformations that hash successfully into the tables.