Title: What is a Hash Table?

Author:

Date: 03 October 2002 13:15:55 +01:00 or Thu, 03 October 2002 13:15:55 +01:00

Summary:

Body:

A hash table is a data structure used for associating keys with values, with the goal of efficient storage space and fast access. Each item has a key, which the table's hash function will use to calculate the "hash value" for that item. The hash value indicates where in the table the item should be. The use of the hash function means that indexing into the table can be performed in constant time, regardless of the number of items in the table. A key can be of any type, but a hash value will always be an integer, as it represents the item's position in the table.

A hash table where each key results in a unique hash value is called directly accessible. This is the "ideal" hash table, because searching for items will always take O(1) time. However, in most cases this type of hash table is not practical, as it tends to be wasteful of memory. More common is the chained hash table, where many keys will result in the same hash value.

A chained hash table can be organised as an array of linked lists. The array positions correspond to the hash values. The linked lists are called buckets, and store items with keys that all translate to the same hash value. When a bucket contains more than one item, it is known as a collision.

As an example, a chained hash table is used below to store city information. A structure is defined to hold the city's name as a string, and the population in thousands as an integer. The pointer to another city structure is used for chaining cities in the one bucket. A very simple hash function has been chosen, where the key is the city's name, and the hash value is the length of the name. The buckets are not sorted, so each item is simply added at the start of the linked list.

The print_table function prints the name and population of every city in the table, one bucket at a time. Five cities are added to the table, and then print_table is called. The output of the program is:

Hash Table Contents:
Bucket 0: End.
Bucket 1: End.
Bucket 2: End.
Bucket 3: End.
Bucket 4: End.
Bucket 5: End.
Bucket 6: Dundee, population 147000; End.
Bucket 7: Glasgow, population 607000; End.
Bucket 8: Aberdeen, population 190000; End.
Bucket 9: Inverness, population 41000; Edinburgh, population 402000; End.
Bucket 10: End.
Bucket 11: End.
Bucket 12: End.
Bucket 13: End.
Bucket 14: End.

This is a perfectly valid hash table, but it is not a very good one. Even if more cities are added, the bucket lengths will not be balanced, with most city names colliding in buckets 6 to 9. In addition, buckets 0 to 2 are not likely to contain any cities! This sort of behaviour is poor for two reasons: buckets 0 to 2 are a waste of memory; and searching will be closer to O(n) than the O(1) ideal, because the concentration of items in 4 buckets makes the searching behaviour more like that of a linked list than a hash table.

This problem can be solved by choosing a hash function that will give a more even spread of items between buckets. In addition, it may be advisable to change the number of buckets. If a good hash function is selected that results in a relatively even distribution of items in buckets, then the number of items expected in each bucket will be roughly n/m, where n is the number of items in the table, and m is the number of buckets. If this is a realtively large number, increasing the number of buckets is advisable. In order to help achieve an even spread, m should not be a power of 2. A prime number is an ideal value for m.

As this example only contains five cities, a good value for m is 5. With a good hash function, the expected distribution will be one city in each bucket. However, it is tolerable to have one bucket with two cities, and one with none, as a hash function will never (well, rarely) be perfect.

The most difficult aspect of creating a hash table is deciding on the hash function. Very often the simplest functions will produce extremely uneven results, as in the example above. If the key is not an integer, it is a good idea to massage it in some way, so it can be treated as an integer. This technique was used in the previous example, where the key was a string, but the length of the string was used to produce the hash value. However, to produce better results something more random than the string length must be used to calculate hash values.

A relatively simple method is to add up the ASCII values of every character in the city's name, dividing by a prime number after each addition. This algorithm may not produce good results if the keys are very similar. For example, hashing "malloc", "calloc", and "realloc" with the new function produces very poor results, with all three hashing to the same value. However, there is no obvious pattern in the city names, and the function produces reasonable results, with only one collision.

One final stage must be performed before the integer is a valid hash value. As there are five buckets, the value must be taken modulo 5. This ensures that the hash value is a bucket number. The new hash function looks like:

int hash_func(struct city *item) { 
  int sum, i; 
  for( i = 0, sum = 0; i < strlen (item->name); ++i) 
    sum = (sum + item->name[i]) / 3; 
  return (sum % NUM_BUCKETS); 
}

The new table looks like:

Hash Table Contents:
Bucket 0: Dundee, population 147000; End.
Bucket 1: Inverness, population 41000; End.
Bucket 2: Edinburgh, population 402000; Glasgow, population 607000; End.
Bucket 3: Aberdeen, population 190000; End.
Bucket 4: End.

And, of course, NUM_BUCKETS is now defined as 5, instead of 15.

It should be noted that the example hash table is efficient in terms of memory and indexing speed, but it is not very scalable. If the number of cities were to increase to 25, the time to find an item would be relatively poor. The average number of items in a bucket would be 5, and consequently the search time will tend more towards O(n) than O(1).

Generally, if it is known roughly how many items the table will accommodate, the number of buckets should be tailored to maximise efficiency, as in the given example. However, if the number of items is not known, or may vary greatly, a trade-off should be considered between indexing time and memory usage. More buckets can result in wasted memory, but faster indexing time.

Increasing the number of cities to 10 produces the following results:

Hash Table Contents:
Bucket 0: Havana, population 2241000; Paris, population 9469000; End.
Bucket 1: Bombay, population 15093000; Tokyo, population 26836000; End.
Bucket 2: Moscow, population 9233000; Beijing, population 12362000; End.
Bucket 3: Johannesburg, population 1849000; Perth, population 1220000; End.
Bucket 4: San Francisco, population 3866000; Lima, population 7452000; End.

This is the optimal number of items per bucket. Increasing the number of buckets to 9 gives good results, also:

Hash Table Contents:
Bucket 0: San Francisco, population 3866000; End.
Bucket 1: Paris, population 9469000; End.
Bucket 2: Bombay, population 15093000; Tokyo, population 26836000; End.
Bucket 3: Moscow, population 9233000; End.
Bucket 4: Lima, population 7452000; End.
Bucket 5: Havana, population 2241000; End.
Bucket 6: End.
Bucket 7: Beijing, population 12362000; End.
Bucket 8: Johannesburg, population 1849000; Perth, population 1220000; End.

This shows that the hash function chosen appears to be scalable, whether the number of buckets is increased or not. It can therefore be said that it is a good hash function for these data sets.

Another type of hash table is the Open Addressed hash table. This stores all items in the table array itself; items are never chained. When the hash function results in a collision, an algorithm is used to suggest another bucket. This algorithm is used repeatedly until an empty bucket is found, or all possible buckets have been inspected. This process of trying different buckets is known as probing the table.

The simplest open addressed hash table uses linear probing. This involves probing buckets in numeric order until an empty one is found. The bucket indicated by the hash function is the starting point, and each bucket is inspected in order until the collision is resolved.

Another probing technique is called double hashing. This involves two hash functions: the first to determine the starting bucket; and the second to determine the "stride". The stride is the value added to the bucket number to determine the next bucket to inspect when a collision occurs.[algorithm]

A disadvantage of this sort of table is that the maximum number of items to be inserted must be known in advance. While the buckets of the chained hash table can be increased in size to accommodate items until there is no more memory available, the open addressed hash table's size is always fixed.

An advantage of the open addressed hash table is that the worst case search time is better than that of the chained hash table. In the first table, the worst performance occurs when the hash table is full, but the required item is not in the table. In this situation, searching takes O(m) time, where m is the number of buckets in the table. In the chained table, the worst performance occurs when all items are in one bucket, but the required item is not in the table. In this case, searching takes O(n) time. If n>m, then the open addressed table has the best performance.

Hash tables may seem rather daunting and complex, but this is only because of the many decisions that must be made. Once the type of table, number of buckets, and hash function have been chosen, implementation is quite straight forward. Next time a data structure is required, consider whether a hash table would be effective; it is surprising how often they can be used.

/* hash.c by Victoria Catterson */ 
#include <stdio.h> 
#include <string.h> 
#define MAX_NAME_LEN 15 
#define NUM_BUCKETS 15 

struct city { 
  char name[MAX_NAME_LEN]; 
  int population; /* in thousands */ 
  struct city *next;
}; 
/* This is the array of linked lists; the hash table itself. */ 
struct city buckets[NUM_BUCKETS]; 

int hash_func(struct city *item) { 
  return (strlen (item->name)); 
} 
void add_item(struct city *head, struct city *item) { 
  item->next = head->next; 
  head->next = item; 
}
void print_table(void) { 
  int i; 
  struct city *current; 
  printf("Hash Table Contents:\n"); 
  for(i = 0; i < NUM_BUCKETS; ++i) { 
    current = buckets[i].next; 
    printf("Bucket %d: ", i); 
    while(current != NULL) { 
      printf("%s, population %d000;", current->name, current->population); 
      current = current->next; 
    }
    printf("End.\n"); 
  } 
  fflush(stdout); 
} 

int main (void) { 
  struct city glasgow, edinburgh, aberdeen, dundee, inverness; 
  int hash_val; 

  strncpy(glasgow.name, "Glasgow", MAX_NAME_LEN); 
  glasgow.population = 607; 
  hash_val = hash_func(&glasgow); 
  add_item(&buckets[hash_val], &glasgow); 

  strncpy(edinburgh.name, "Edinburgh", MAX_NAME_LEN); 
  edinburgh.population = 402; 
  hash_val = hash_func(&edinburgh); 
  add_item(&buckets[hash_val], &edinburgh); 

  strncpy(aberdeen.name, "Aberdeen", MAX_NAME_LEN); 
  aberdeen.population = 190; 
  hash_val = hash_func(&aberdeen); 
  add_item(&buckets[hash_val], &aberdeen); 

  strncpy(dundee.name, "Dundee", MAX_NAME_LEN); 
  dundee.population = 147; 
  hash_val = hash_func(&dundee); 
  add_item(&buckets[hash_val], &dundee); 

  strncpy(inverness.name, "Inverness", MAX_NAME_LEN); 
  inverness.population = 41; 
  hash_val = hash_func(&inverness); 
  add_item(&buckets[hash_val], &inverness); 

  print_table();
  return 0; 
}

References

[algorithm] For more information on double hashing and other probing algorithms see http://www.cise.ufl.edu/~sahni/dsaaj/enrich/cll/overflow.htm

[scottish-city-population] Scottish city population statistics: 1991 Census. http://www.siliconglen.com/scotfaq/1_6.htm

[world-city-population] World city population statistics: The Times Atlas of the World, 2001

Notes:

More fields may be available via dynamicdata ..