Wiki: VERC: Hash Tables

Welcome to the wonderful world of hash tables. Curiosly enough, hash has nothing to do with this article at all, and I have no idea why they call it a hash table anyway. Ironically enough, Webster's Dictionary defines hash as a synonym for muddle or "to confuse"; hodgepodge was the exact wording if I recall. But don't let that discourage you. Let me get back on topic.

Hash tables, hash tables... just what the heck are they? There are many different organization styles for a hash table, but this article will focus on one: chaining. It's a hybrid between an array and a linked list, and its efficiency over the latter two depends on the programmer's ability to create a good hashing function - don't worry, I'll explain everything.

A hash table is an array of pointers, each which points to the beginning of a linked list. That's it! The efficiency of the system comes from the fact that array lookups are performed at godly speeds compared to linked lists iterations, but if the first array lookup doesn't satisfy the job, the linked list at that location is iterated until the requested information is found! If your array is designed well, you mimimize the number of lookups. Basically, it tackles the issue of speed associated with linked lists and the issue of memory usage associated with arrays. Here's a picture for you:

I mentioned earlier that the programmer plays a big role in determining the efficiency of the hash table. Hash tables are commonly used to store records and other database related things. It follows that a hash table must have some way of quickly finding and storing data. But how? This is where the hashing function comes in. It's sole job is to translate raw data into an index for the hash array. So you can see that this function must be well designed in order to be fast, and have an algorithm that produces varied indexes - remember that we fall back on linked lists if there is more than one record at a certain index.

The hashing function need not be overly complicated - just enough to quickly generate a nice range of indexes. Let's look at an example. Suppose we have a dictionary program that stores the definitions of words. One record consists of a word string and a definition string. The most logical organization of this array would be alphabetacally; thus our initial hash table will have 26 pointers, one for each letter of the alphabet. The following function will take a word (a.k.a "raw data") and convert it into an index:

int HashIt(const char *word)
{
    return ((*word) - 'a');
}

This makes the assumption that the word is in lowercase, or at least the first letter, but you get the idea - the proper index is returned. Of course, you might want to do some error checking for characters that would return negative indexes or those larger than the array, but it should be kept outside our hashing function.

At any rate, once you have the index, you jump to that point in the array (lookup #1) and iterate through the linked list at that position until you find the word you are looking for.

That last example wasn't that practical. For example, I can think of hundreds (ok, maybe only a couple dozen) of words off the top of my head that start with the letter 's', and they'll all be linked at the same index. The chances that the desired word will be near the beginning of the list get slim, and it would degenerate into your everyday linked list iteration. But one thing's for sure. It beats a regular linked list for things such as word associations (or whatever the term I should use is).

Hash table indexes don't really need to rely upon a logical relationship with the data it's "hashing". As long as it does its job, go ahead and use it. Lastly, there's the issue of how large to initialize the hash array to (hey, stop staring at my dangling participle!!). For something as simple as a dictionary, 26 works for basic alphabetizing. But let's suppose you want to store these words based on the first two letters. That would make our array 26 2 elements long, or 676. It works out that this array would require about 2.5 K more in memory for the pointers. That's nothing, yet we would optimize the speed of the hash table by reducing the number of words per index. Sweet. And if we go for three letters? 17576 entries, 66 K more - again, additional optimization with an unoticable increase in pointer memory.

Well there you have it - hash tables. Not as notorious as linked lists or arrays (because they require a bit more in terms of implementation), but they serve their purpose well. Remember, they can be used for more than just character data. However, don't use them too much. They should be used primarily when record-style information (which is also referred to as "associated data" in the context of a hash table) is being stored and high-speed access in required. Like everything else in C++, and programming in general, they aren't the new solution to all our data storage needs. However, when used appropiatly, you can really whip up some nice applications.

Recommended Reading: Hash Tables

VERC: Hash Tables Last edited 2 years ago2022-09-29 07:54:53 UTC

Comments