home

hashmaps

Table of Contents

Hashmaps, dictionaries, maps, associative arrays, whatever you want to call them they are a fantastic data structures. They let you lookup data in an instant. Want to store something in a database and instantly look it up based on an ID? You can use them to map IDs to data locations. Want to cache function calls? You can use hashmaps to map input values to the output values. Or how about a symbol table for your compiler? You can map variable names to their values. Hashmaps can be used almost anywhere you need to store data and be able to look it up based on a key.

Let's make our own hashmap. We'll start with a simple case and work our way up to a more robust (and basic) hashmap.

1. Overview

We're going to write a program that finds the word count for a given list of words from STDIN. To do this we'll write our own hashmap. A hashmap allows us to store values with a key and then look up those values given the key. For our program our hashmap will map each word (the key) to the number of times that word is in the text (the value). Our program will look like the following:

////////////////////////////////////////////////////////////////////////////////
// Type Aliases
////////////////////////////////////////////////////////////////////////////////
typedef char* str;
typedef uint32_t frequency_t;

////////////////////////////////////////////////////////////////////////////////
// Key Value Pairs
////////////////////////////////////////////////////////////////////////////////
typedef struct kv_pair {
  str         key;
  frequency_t val;
} kv_pair_t;

////////////////////////////////////////////////////////////////////////////////
// Hash Table
////////////////////////////////////////////////////////////////////////////////
#define TABLE_SIZE 1024

typedef struct ht {
  kv_pair_t buffer[TABLE_SIZE];
  size_t    len;
} hash_table_t;

static void
ht_store(hash_table_t *ht, str key, frequency_t freq);

static frequency_t
ht_lookup(hash_table_t *ht, str key);

static void
ht_print(hash_table_t *ht);

////////////////////////////////////////////////////////////////////////////////
// Main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, str argv[]) {
  hash_table_t ht = {{0}, 0};

  for (int i = 0; i < argc; i++) {
    str word = argv[i];
    frequency prev_freq = ht_lookup(&ht, word);
    ht_store(&ht, word, prev_freq+1);
  }

  ht_print(ht);
  return 0;
}

2. A naive approach

Let's start with a simple approach: we'll store key value pairs in a continuous chunk of memory. When we want to store a new value we'll create a pair for it and add it to the list. If we want to lookup a value, we iterate over each pair until we find the pair with a matching key and return it's value.

#include <stdint.h>
#include <stdio.h>

////////////////////////////////////////////////////////////////////////////////
// Type Aliases
////////////////////////////////////////////////////////////////////////////////
typedef char* str;
typedef uint32_t frequency_t;

////////////////////////////////////////////////////////////////////////////////
// Key Value Pairs
////////////////////////////////////////////////////////////////////////////////
typedef struct kv_pair {
  str         key;
  frequency_t freq;
} kv_pair_t;

////////////////////////////////////////////////////////////////////////////////
// Hash Table
////////////////////////////////////////////////////////////////////////////////
#define TABLE_SIZE 1024

typedef struct hash_table {
  kv_pair_t buffer[TABLE_SIZE];
  size_t    len;
} hash_table_t;

static void
map_store(hash_table_t *ht, str key, frequency_t freq)
{
  int i = 0;
  for (; i < ht->len; i++) {
    if (ht->buffer[i].key == key) {
      ht->buffer[i].freq = freq;
      return;
    }
  }
  ht->buffer[i].key = key;
  ht->buffer[i].freq = freq;
  ht->len = (ht->len + 1) % TABLE_SIZE;
}

static frequency_t
map_lookup(hash_table_t *ht, str key)
{
  for (int i = 0; i < ht->len; i++) {
    if (ht->buffer[i].key == key) {
      return ht->buffer[i].freq;
    }
  }
  return 0;
}

////////////////////////////////////////////////////////////////////////////////
// Main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char *argv[]) {
  hash_table_t ht = { {0}, 0};
  map_store(&ht, "foo", 3);
  printf("foo=%d\n", map_lookup(&ht, "foo"));

  return 0;
}

It works! But can we make it better?

3. Finding a needle in a haystack

"But that implementation is slow. It doesn't even use hashing so technically it's not even a hash table…" I know but this naive approach will give us insight into what is fundamentally going on with hash tables so stick with me. Obviously this implementation is slow but why is it slow?

It's slow because of the worst case scenario. In this scenario the key we are interested in is stored in the back of the data. We need to search through all of the data before finding it. "Why is it stored in the back of our data?" Because the location of our keys is based on the order that they are stored. The first key we store is the first in the buffer, the second stored is the second, etc… This is a problem because we don't know the order the keys were stored when we want to look up a key. The only information we have is the key itself.

But, if we did know the order we inserting all of our data we could look it up instantly because we would be using the same method to lookup our data that we use to store our data. This is the key insight:

If you want to find your data quickly you need to search for it using the same method you used to store it. It's just like loosing your keys. Don't look all over your house for them, go back to where you were before you lost them and repeat the same process. 

Let's try storing the key value pairs using the same approach as we use to look them up. So how do we store them?

4. Memory Constraints - A (less) naive approach

Since we want to be able to lookup values based on keys that means we will need to store them based on the key. Remember, the keys are the words we are counting the frequencies of. Well, why not use the word we want to store as the location for where we store it?

Words are data, data can represent numbers and numbers can be used to index into our array…

Word Number
a 1
b 2
 
altitude 43300921411

It would take a lot of memory to create an array with room for every possible word! This brings us to our first constraint:

  1. Often (but not always) it isn't feasible to store all possible keys in your hash table.

5. Collisions - A (similarly) naive approach

Maybe we can't store all words in our hashtable. But what if we try using less information? For example, what if we use only the first letter of the words as the storage location?

Word First Letter Number
a a 1
b b 2
   
altitude a 1

We run into the issue that there are many words that start with the same letter. This brings us to our second constraint.

  1. If you aren't able to store all keys in your hash table then some of your keys will want to be stored in the same location.

6. Probing - A (slightly better) naive approach

But we can work around this! If we try to store a key and something is already in the location we would like to store it in we can just place it in the next spot (or the next one if thats full, or the next next, etc…). This is called linear probing.

#define TABLE_SIZE 26 // 26 letters in the alphabet.

static int
  ht_store(hash_table_t *ht, str key, frequency_t freq)
  {
    /** Search for open space in the buffer. Base the first
        location on the first letter of the key and look at the
        next spot until we:
          a) Find an open spot. Then store the key.
          b) Can't find a spot. Return. */
    int i = key[0] - 'a';
    int final = (i - 1) % TABLE_SIZE;
    while (i != final) { 
      if (ht->buffer[i].key == key || !ht->buffer[i].key) {
        ht->buffer[i].key = key;
        ht->buffer[i].freq = freq;
        return 1;
      }
      i = (i + 1) % TABLE_SIZE;
    }
    return 0;
  }

  static frequency_t
  ht_lookup(hash_table_t *ht, str key)
  {
    /** Use the same strategy as our lookup. If we find
     a matching key then return the freq. If not, return 0.*/
    int i = key[0] - 'a';
    int final = (i - 1) % TABLE_SIZE;
    while (i != final) { 
      if (ht->buffer[i].key == key) {
        return ht->buffer[i].freq;
      }
      i = (i + 1) % TABLE_SIZE;
    }
    return 0;
  }

Probing looks familiar because it's what we were doing in our original example! We always tried to place a key value pair in the front of the buffer. If something was there we tried the next spot, then the next, etc…

7. Double Hashing - A (decent) naive approach

Can we do better than linear probing? Well yes, we already learned that we can base the location of our data on the data itself. So if the spot that we want to place our data is full why not generate a new spot based on the data and the fact that the first spot is taken? This is called double hashing.

For example, a naive but simple approach is to try the location associated with the second letter in the word. If thats taken we can try the third, etc…

Word 1st Letter Taken? 2nd Letter Taken? Number
a a F       1
b b F       2
           
altitude a T l F   12

8. Breaking the naive approach

We've done it! We have a naive approach. Let's see if we can break it.

We can see that the table works just fine if there aren't collisions. But if there is, we need to search for a new spot. What if we fill up the hash table in such a way that a collision will always lead to a new collision? Here is an example

This is a bit of an extreme example. In most cases a hash table won't run into these issues. A lot of thought has gone into ensuring this won't happen. But, even if it's garaunteed that data will always be able to find a spot lots of collisions will lead to bad performance (ie our first approach).

This is where hashing algorithms come in. A lot of research has been put into them and there are many different algorithms that do well in different situations. Here is a general (and easy to code) hashing algorithm (FNV-1) taken from a skeeto article:

int hash(str key) {
  uint_fast64_t h = 0x100;
  for (int i = 0; key[i]; i++) {
    h ^= key[i] & 255;    
    h *= 1111111111111111111;
  }
  return h;
}

9. Final Example

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

////////////////////////////////////////////////////////////////////////////////
// Key Value Pairs
////////////////////////////////////////////////////////////////////////////////
typedef struct kv_pair {
  char     *key;
  uint32_t freq;
} kv_pair_t;

////////////////////////////////////////////////////////////////////////////////
// Hash Table
////////////////////////////////////////////////////////////////////////////////
#define TABLE_SIZE 1024

typedef struct hash_table {
  kv_pair_t buffer[TABLE_SIZE];
} hash_table_t;

uint64_t hash(char *key) {
  uint_fast64_t h = 0x100;
  for (int i = 0; key[i]; i++) {
    h ^= key[i] & 255;    
    h *= 1111111111111111111;
  }
  return h;
}

static int
  ht_store(hash_table_t *ht, char *key, uint32_t freq)
  {
    int i = hash(key) % TABLE_SIZE;
    int final = (i - 1) % TABLE_SIZE;
    while (i != final) {
      if ((ht->buffer[i].key && strcmp(ht->buffer[i].key, key) == 0)
          || !ht->buffer[i].key) {
        ht->buffer[i].key = key;
        ht->buffer[i].freq = freq;
        return 1;
      }
      i = (i + 1) % TABLE_SIZE;
    }
    return 0;
  }

  static uint32_t
  ht_lookup(hash_table_t *ht, char *key)
  {
    /** Use the same strategy as our lookup. If we find
     a matching key then return the freq. If not, return 0.*/
    int i = hash(key) % TABLE_SIZE;
    int final = (i - 1) % TABLE_SIZE;
    while (i != final) {
      if ((ht->buffer[i].key && strcmp(ht->buffer[i].key, key) == 0)
          || !ht->buffer[i].key) {
        return ht->buffer[i].freq;
      }
      i = (i + 1) % TABLE_SIZE;
    }
    return 0;
  }

////////////////////////////////////////////////////////////////////////////////
// Helpers
////////////////////////////////////////////////////////////////////////////////
static void
ht_inc(hash_table_t *ht, char *key)
{
  int n = ht_lookup(ht, key);
  ht_store(ht, key, n + 1);
}

static int
read_words(int fd, char *buffer) {
  int nbytes = 100;
  int n = 0;
  while (1) {
    int bytes_read = read(fd, buffer + n, nbytes);
    n += bytes_read;
    if (bytes_read != nbytes) break;
  }
  buffer[n] = '\0';

  int num_words = 0;
  for (int i = 0; i < n; i++) {
    if (buffer[i] == ' ' || buffer[i] == '\n') {
      buffer[i] = '\0';
      num_words++;
    }
  }
  return num_words;
}


////////////////////////////////////////////////////////////////////////////////
// Main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char *argv[]) {
  // Read words from STDIN.
  char words[1<<20];
  int num_words = read_words(STDIN_FILENO, words);

  // Iterate over each word and increment it's count in the
  // hash table.
  hash_table_t ht = {{0}};
  int i = 0;
  while (words[i] != '\0') {
    ht_inc(&ht, &words[i]);
    i += strlen(&words[i]) + 1;
  }

  // For ever word passed into the command print it's count.
  for (i = 1; i < argc; i++) {
    char* key = argv[i];
    int freq = ht_lookup(&ht, key);
    printf("%s %d\n", key, freq);
  }

  return 0;
}

An example of running the program:

$ echo "foo bar baz foo" | word_count foo bar
foo 2
bar 1

Date: 01.02.24

Author: Zach Dingels

license