EECS 311: Hashtables |
HomeCourse InfoLinks Grades |
Lectures Newsgroup Homework Exams |

Terms related to hashtables you should be familiar with:

- hash functions
- keys
- collisions and collision resolution
- synonyms
- linear probing
- quadratic probing
- double hashing

The basic idea is simple. To store data under a given key (usually an integer or string),

- use a simple one-dimensional array for the data,
**A** - to store a record
**r**containing a key**k**:**A[hash(k)] = r**

**hash** is a function that takes any key and returns a valid
index into **A**.

Any function will work, but to get the desired features of
hashtables, **hash** must

- be cheap to calculate
- effectively make a hash of any "clumpiness" in the distribution of the keys

If both of these properties are true, and the table is not very full (at least 20% empty) then

- storage and retrieval are
**constant**time operations

This obviously is better than the best binary search tree algorithms we had (AVL and red-black), which were O(log N) time operations.

There is no free lunch, of course. The price we pay is:

- in the worst case, O(N) time, but this is highly unlikely as long as the table is kept from filling
- wasted space
- no traversal algorithm to get the stored data out in sorted order

Hashtables aren't really constant time. As the textbook clearly
points out, access time is a function of **density**, i.e., how
full the table is. This means that we can get as close to a constant
time retrieval as we want, if we're willing to make the table big
enough.

There are two key problems to solve in making hashtables work well:

- finding a cheap hash function that generates few collisions
- handling cheaply the collisions that do occur

Getting cheap hashing functions is easy. Getting low-collision functions is trickier.

There are two basic approaches:

- mess with the key, e.g., shift folding or reverse shift folding
- use a random number generator

In the first case, we move characters in the key around and add parts together, not unlike shuffling a deck of cards to get rid of patterns caused by the previous round of play.

In the second case, we use the calculation formula that converts an internal seed into a random number in a pseudo-random number generator, using the key (converted to an integer in some simple way if need be) as the seed.

In either case, the resulting number has to be turned into a valid index with

value % size(A)

and it's best to have **size(A)** be a prime number.

It's impossible to avoid collisions unless we have a perfect hashing function, and we can only get one of those if we know exactly what keys we're going to get in advance and have a table big enough to hold them all.

Therefore, we need collision resolution strategies, which answer
the question "when **hash(k1) = hash(k2)** where should the record
with **k2** go?" **k1** and **k2** are called synonyms.

One simple approach is to keep lists in **A**:

- To store a record, you hash the key to the appropriate list, and insert it
- To retrieve a record, you hash the key to the appropriate list, and scan the list for the record

As long as there are few collisions, there will usually be only 1 or 2 items in any list, and so retrieval will still be effectively constant time.

This approach is simple and often used. It is wasteful of space, however, compared to the following approaches, and there is no bound on how big the hashtable may get.

Linear probing is a straw man approach. You never want to use it, but it's a good starting point for understanding the other approaches.

The idea is simple:

- To store a record,
- hash the key to the appropriate index
**i** - while
**A[i]**is not empty, increment**i**(wrap around if need be) - store the record in
**A[i]**

- hash the key to the appropriate index
- To retrieve a record,
- hash the key
**k**to the appropriate index**i** - while
**A[i]**is not empty and contains a record whose key is not**k**, increment**i**(wrap around if need be) - if an empty
**A[i]**is found, return "not found," otherwise return**A[i]**

- hash the key

The problem is that collisions pile up and form clusters or chains of adjacent synonyms. Furthermore, these clusters can merge with other clusters and make even larger clusters.

We can avoid the merging of clusters if we "hopscotch" around more looking for places to put collisions.

Quadratic probing modifies linear probing as follows:

- Let
**skip**= 0 - While
**A[i+skip*skip]**is not empty,**skip = skip + 1**

This will try **i**, **i + 1**, **i + 4**, **i + 9**, and so on.

Problems with this approach are:

- you might miss some of the empty spots in the table (but note that you can prove that if the table size is prime, you'll always eventually find an empty spot if there is one)
- synonyms will go to the same places -- you avoid cluster merging but not clusters

To stop synonyms from chasing each other, you can use a secondary
hash function to calculate the **skip** amount. First you hash the
key **k** as usual. If you get a collision, you modify linear
probing thus:

- Let
**skip**=**h'(k)** - While
**A[i]**is not empty,**i = i + skip**

Assuming you've picked an independent secondary hash function, the
odds of two keys have the same values for both **h** and **h'**
are small enough to make clustering a non-issue. If the table size is
a prime, we also know that eventually we have to find an empty spot,
if any exists.

A number of programming languages, particularly scripting languages like Hypertalk, JavaScript, awk, and Perl, provide what are called associative tables or keyed collections.

This is a very simple and useful abstract data type. A keyed
collection **C** lets us store and retrieve data using keys --
especially strings -- for indices, the way arrays let us use integers
for indices:

- Storage:
**C[key] = data** - Retrieval:
**C[key]**

For example, to count all the words in a file, assuming we have a function that reads the next word as a string, we'd write something like this:

while not eof(stream) do counts[readWord(stream)]++

Hard to beat that for simplicity.

A keyed collection would also be ideal for storing the course and section data for our course scheduler for easy retrieval by course name.

Compilers make heavy use of keyed collections to store data about variable and function names.

An interesting use of hashing that doesn't involve storing data is the Rabin-Karp probabilistic string search algorithm.

The key idea is this. Normal string search in the worst case is O(N * M) where N is the length of the string and M is the length of the text being searched. This is because you have to compare up to M characters, then, if the comparison fails, you shift right one character in the text, and start again.

Rabin-Karp's algorithm is O(N + M) -- big difference. The algorithm is (roughly)

Let k = h(search-string) Let m = length(search-string) Let i = 0 while ( h(substring(text, i, m) != k ) i++

That is, we keep hashing M-long segments of the text to see if they are a synonym of the search string. Notice that we never store anything in a table, we're just using the hashing idea. If we find a synonym, we check to see if it really is a match.

For this to work, we need a hash function that is very cheap and has a low collision rate. The hashing function used in Rabin-Karp takes advantage of the fact that each substring is only one character different than the previous one, so that recalculating the next hash is a cheap operation.

Ever wonder how spell checkers can read thousands of words and quickly tell if they're in a dictionary with tens of thousands of words? There are lots of pieces to the trick but one element is the hashtable.

The idea, similar to the Rabin-Karp algorithm, is this:

- In advance, hash every word in the dictionary into N hashtables (say 10), each with a different hashing function.
- Now, given a text, hash it to each table. If it's in every one of them, it's probably a word, because the odds of N collisions is infinitesimal.

Obviously there's more to this, including handing word morphology and efficiently compacting N hashtables into something much smaller. The clever compaction comes from noting that the table contains only bits and that we only really care about something that has all N bits true.

But wait! If these spell checkers don't have a dictionary of words
available, how do they find words to suggest as corrections? Simple.
They apply a set of transformation rules to the misspelled word
(transpose every adjacent pair of letters, for every letter **c
**from A - Z, insert **c** once in each position in the word,
drop one letter, etc.). Then they return those transformations that
hash successfully into the tables.

*Comments? Send mail to
* c-riesbeck@northwestern.edu.