    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: What is a Hash Table?</title>
        <link>https://members.accu.org/index.php/articles/1202</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + CVu Journal Vol 14, #5 - Oct 2002</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c112/">145</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-112/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+112/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;What is a Hash Table?</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 03 October 2002 13:15:55 +01:00 or Thu, 03 October 2002 13:15:55 +01:00</p>
<p><strong>Summary:</strong>&nbsp;</p>
<p><strong>Body:</strong>&nbsp;<p>A hash table is a data structure used for associating keys with
values, with the goal of efficient storage space and fast access.
Each item has a key, which the table's hash function will use to
calculate the &quot;hash value&quot; for that item. The hash value indicates
where in the table the item should be. The use of the hash function
means that indexing into the table can be performed in constant
time, regardless of the number of items in the table. A key can be
of any type, but a hash value will always be an integer, as it
represents the item's position in the table.</p>
<p>A hash table where each key results in a unique hash value is
called directly accessible. This is the &quot;ideal&quot; hash table, because
searching for items will always take O(1) time. However, in most
cases this type of hash table is not practical, as it tends to be
wasteful of memory. More common is the chained hash table, where
many keys will result in the same hash value.</p>
<p>A chained hash table can be organised as an array of linked
lists. The array positions correspond to the hash values. The
linked lists are called buckets, and store items with keys that all
translate to the same hash value. When a bucket contains more than
one item, it is known as a collision.</p>
<p>As an example, a chained hash table is used below to store city
information. A structure is defined to hold the city's name as a
string, and the population in thousands as an integer. The pointer
to another city structure is used for chaining cities in the one
bucket. A very simple hash function has been chosen, where the key
is the city's name, and the hash value is the length of the name.
The buckets are not sorted, so each item is simply added at the
start of the linked list.</p>
<p>The <tt class="literal">print_table</tt> function prints the
name and population of every city in the table, one bucket at a
time. Five cities are added to the table, and then <tt class=
"literal">print_table</tt> is called. The output of the program
is:</p>
<div class="literallayout">
<p>Hash Table Contents:<br>
  Bucket 0: End.<br>
  Bucket 1: End.<br>
  Bucket 2: End.<br>
  Bucket 3: End.<br>
  Bucket 4: End.<br>
  Bucket 5: End.<br>
  Bucket 6: Dundee, population 147000; End.<br>
  Bucket 7: Glasgow, population 607000; End.<br>
  Bucket 8: Aberdeen, population 190000; End.<br>
  Bucket 9: Inverness, population 41000; Edinburgh, population
402000; End.<br>
  Bucket 10: End.<br>
  Bucket 11: End.<br>
  Bucket 12: End.<br>
  Bucket 13: End.<br>
  Bucket 14: End.</p>
</div>
<p>This is a perfectly valid hash table, but it is not a very good
one. Even if more cities are added, the bucket lengths will not be
balanced, with most city names colliding in buckets 6 to 9. In
addition, buckets 0 to 2 are not likely to contain any cities! This
sort of behaviour is poor for two reasons: buckets 0 to 2 are a
waste of memory; and searching will be closer to O(n) than the O(1)
ideal, because the concentration of items in 4 buckets makes the
searching behaviour more like that of a linked list than a hash
table.</p>
<p>This problem can be solved by choosing a hash function that will
give a more even spread of items between buckets. In addition, it
may be advisable to change the number of buckets. If a good hash
function is selected that results in a relatively even distribution
of items in buckets, then the number of items expected in each
bucket will be roughly n/m, where n is the number of items in the
table, and m is the number of buckets. If this is a realtively
large number, increasing the number of buckets is advisable. In
order to help achieve an even spread, m should not be a power of 2.
A prime number is an ideal value for m.</p>
<p>As this example only contains five cities, a good value for m is
5. With a good hash function, the expected distribution will be one
city in each bucket. However, it is tolerable to have one bucket
with two cities, and one with none, as a hash function will never
(well, rarely) be perfect.</p>
<p>The most difficult aspect of creating a hash table is deciding
on the hash function. Very often the simplest functions will
produce extremely uneven results, as in the example above. If the
key is not an integer, it is a good idea to massage it in some way,
so it can be treated as an integer. This technique was used in the
previous example, where the key was a string, but the length of the
string was used to produce the hash value. However, to produce
better results something more random than the string length must be
used to calculate hash values.</p>
<p>A relatively simple method is to add up the ASCII values of
every character in the city's name, dividing by a prime number
after each addition. This algorithm may not produce good results if
the keys are very similar. For example, hashing &quot;malloc&quot;, &quot;calloc&quot;,
and &quot;realloc&quot; with the new function produces very poor results,
with all three hashing to the same value. However, there is no
obvious pattern in the city names, and the function produces
reasonable results, with only one collision.</p>
<p>One final stage must be performed before the integer is a valid
hash value. As there are five buckets, the value must be taken
modulo 5. This ensures that the hash value is a bucket number. The
new hash function looks like:</p>
<pre class="programlisting">
int hash_func(struct city *item) { 
  int sum, i; 
  for( i = 0, sum = 0; i &lt; strlen (item-&gt;name); ++i) 
    sum = (sum + item-&gt;name[i]) / 3; 
  return (sum % NUM_BUCKETS); 
}
</pre>
<p>The new table looks like:</p>
<div class="literallayout">
<p>Hash Table Contents:<br>
  Bucket 0: Dundee, population 147000; End.<br>
  Bucket 1: Inverness, population 41000; End.<br>
  Bucket 2: Edinburgh, population 402000; Glasgow, population
607000; End.<br>
  Bucket 3: Aberdeen, population 190000; End.<br>
  Bucket 4: End.</p>
</div>
<p>And, of course, <tt class="literal">NUM_BUCKETS</tt> is now
defined as 5, instead of 15.</p>
<p>It should be noted that the example hash table is efficient in
terms of memory and indexing speed, but it is not very scalable. If
the number of cities were to increase to 25, the time to find an
item would be relatively poor. The average number of items in a
bucket would be 5, and consequently the search time will tend more
towards O(n) than O(1).</p>
<p>Generally, if it is known roughly how many items the table will
accommodate, the number of buckets should be tailored to maximise
efficiency, as in the given example. However, if the number of
items is not known, or may vary greatly, a trade-off should be
considered between indexing time and memory usage. More buckets can
result in wasted memory, but faster indexing time.</p>
<p>Increasing the number of cities to 10 produces the following
results:</p>
<div class="literallayout">
<p>Hash Table Contents:<br>
  Bucket 0: Havana, population 2241000; Paris, population 9469000;
End.<br>
  Bucket 1: Bombay, population 15093000; Tokyo, population
26836000; End.<br>
  Bucket 2: Moscow, population 9233000; Beijing, population
12362000; End.<br>
  Bucket 3: Johannesburg, population 1849000; Perth, population
1220000; End.<br>
  Bucket 4: San Francisco, population 3866000; Lima, population
7452000; End.</p>
</div>
<p>This is the optimal number of items per bucket. Increasing the
number of buckets to 9 gives good results, also:</p>
<div class="literallayout">
<p>Hash Table Contents:<br>
  Bucket 0: San Francisco, population 3866000; End.<br>
  Bucket 1: Paris, population 9469000; End.<br>
  Bucket 2: Bombay, population 15093000; Tokyo, population
26836000; End.<br>
  Bucket 3: Moscow, population 9233000; End.<br>
  Bucket 4: Lima, population 7452000; End.<br>
  Bucket 5: Havana, population 2241000; End.<br>
  Bucket 6: End.<br>
  Bucket 7: Beijing, population 12362000; End.<br>
  Bucket 8: Johannesburg, population 1849000; Perth, population
1220000; End.</p>
</div>
<p>This shows that the hash function chosen appears to be scalable,
whether the number of buckets is increased or not. It can therefore
be said that it is a good hash function for these data sets.</p>
<p>Another type of hash table is the Open Addressed hash table.
This stores all items in the table array itself; items are never
chained. When the hash function results in a collision, an
algorithm is used to suggest another bucket. This algorithm is used
repeatedly until an empty bucket is found, or all possible buckets
have been inspected. This process of trying different buckets is
known as probing the table.</p>
<p>The simplest open addressed hash table uses linear probing. This
involves probing buckets in numeric order until an empty one is
found. The bucket indicated by the hash function is the starting
point, and each bucket is inspected in order until the collision is
resolved.</p>
<p>Another probing technique is called double hashing. This
involves two hash functions: the first to determine the starting
bucket; and the second to determine the &quot;stride&quot;. The stride is the
value added to the bucket number to determine the next bucket to
inspect when a collision occurs.[<a href=
"#algorithm">algorithm</a>]</p>
<p>A disadvantage of this sort of table is that the maximum number
of items to be inserted must be known in advance. While the buckets
of the chained hash table can be increased in size to accommodate
items until there is no more memory available, the open addressed
hash table's size is always fixed.</p>
<p>An advantage of the open addressed hash table is that the worst
case search time is better than that of the chained hash table. In
the first table, the worst performance occurs when the hash table
is full, but the required item is not in the table. In this
situation, searching takes O(m) time, where m is the number of
buckets in the table. In the chained table, the worst performance
occurs when all items are in one bucket, but the required item is
not in the table. In this case, searching takes O(n) time. If
n&gt;m, then the open addressed table has the best performance.</p>
<p>Hash tables may seem rather daunting and complex, but this is
only because of the many decisions that must be made. Once the type
of table, number of buckets, and hash function have been chosen,
implementation is quite straight forward. Next time a data
structure is required, consider whether a hash table would be
effective; it is surprising how often they can be used.</p>
<pre class="programlisting">
/* hash.c by Victoria Catterson */ 
#include &lt;stdio.h&gt; 
#include &lt;string.h&gt; 
#define MAX_NAME_LEN 15 
#define NUM_BUCKETS 15 

struct city { 
  char name[MAX_NAME_LEN]; 
  int population; /* in thousands */ 
  struct city *next;
}; 
/* This is the array of linked lists; the hash table itself. */ 
struct city buckets[NUM_BUCKETS]; 

int hash_func(struct city *item) { 
  return (strlen (item-&gt;name)); 
} 
void add_item(struct city *head, struct city *item) { 
  item-&gt;next = head-&gt;next; 
  head-&gt;next = item; 
}
void print_table(void) { 
  int i; 
  struct city *current; 
  printf(&quot;Hash Table Contents:\n&quot;); 
  for(i = 0; i &lt; NUM_BUCKETS; ++i) { 
    current = buckets[i].next; 
    printf(&quot;Bucket %d: &quot;, i); 
    while(current != NULL) { 
      printf(&quot;%s, population %d000;&quot;, current-&gt;name, current-&gt;population); 
      current = current-&gt;next; 
    }
    printf(&quot;End.\n&quot;); 
  } 
  fflush(stdout); 
} 

int main (void) { 
  struct city glasgow, edinburgh, aberdeen, dundee, inverness; 
  int hash_val; 

  strncpy(glasgow.name, &quot;Glasgow&quot;, MAX_NAME_LEN); 
  glasgow.population = 607; 
  hash_val = hash_func(&amp;glasgow); 
  add_item(&amp;buckets[hash_val], &amp;glasgow); 

  strncpy(edinburgh.name, &quot;Edinburgh&quot;, MAX_NAME_LEN); 
  edinburgh.population = 402; 
  hash_val = hash_func(&amp;edinburgh); 
  add_item(&amp;buckets[hash_val], &amp;edinburgh); 

  strncpy(aberdeen.name, &quot;Aberdeen&quot;, MAX_NAME_LEN); 
  aberdeen.population = 190; 
  hash_val = hash_func(&amp;aberdeen); 
  add_item(&amp;buckets[hash_val], &amp;aberdeen); 

  strncpy(dundee.name, &quot;Dundee&quot;, MAX_NAME_LEN); 
  dundee.population = 147; 
  hash_val = hash_func(&amp;dundee); 
  add_item(&amp;buckets[hash_val], &amp;dundee); 

  strncpy(inverness.name, &quot;Inverness&quot;, MAX_NAME_LEN); 
  inverness.population = 41; 
  hash_val = hash_func(&amp;inverness); 
  add_item(&amp;buckets[hash_val], &amp;inverness); 

  print_table();
  return 0; 
}
</pre>
<div class="bibliography">
<div class="titlepage">
<h2><a name="d0e92" id="d0e92"></a>References</h2>
</div>
<div class="bibliomixed"><a name="algorithm" id="algorithm"></a>
<p class="bibliomixed">[algorithm] For more information on double
hashing and other probing algorithms see <span class=
"bibliomisc"><a href=
"http://www.cise.ufl.edu/~sahni/dsaaj/enrich/cll/overflow.htm"
target=
"_top">http://www.cise.ufl.edu/~sahni/dsaaj/enrich/cll/overflow.htm</a></span></p>
</div>
<div class="bibliomixed"><a name="scottish-city-population" id=
"scottish-city-population"></a>
<p class="bibliomixed">[scottish-city-population] Scottish city
population statistics: 1991 Census. <span class=
"bibliomisc"><a href="http://www.siliconglen.com/scotfaq/1_6.htm"
target=
"_top">http://www.siliconglen.com/scotfaq/1_6.htm</a></span></p>
</div>
<div class="bibliomixed"><a name="world-city-population" id=
"world-city-population"></a>
<p class="bibliomixed">[world-city-population] World city
population statistics: The Times Atlas of the World, 2001</p>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
