CS50 PSet 5: Speller

A guide to the ‘Speller’ problem in CS50 Week 5.

Goal: To implement a spell checker program in C. The program is partially written, but we must write functions that both load and unload the dictionary into memory, as well as checking if each word in the given text is in the dictionary and thus spelled correctly.

The dictionary should be loaded into a hash table, which can be looked up when checking if each word is spelt correctly. The aim is to minimise this lookup time by using an appropriate hash function.

The functions to be defined within dictionary.c are as follows:

load() must take the dictionary and load it into a hash table using an appropriate hash function.

hash() is the aforementioned hash function that determines the hash code for each entry to the hash table.

size() must return the number of words in the dictionary, if it is successfully loaded.

check() must check if a word is in the dictionary and thus spelt correctly.

unload() remove the dictionary data structure from memory once the entire document has been spell checked.

As usual, the first step is to import the libraries we will be using.

// Implements a dictionary's functionality#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <strings.h>
#include <ctype.h>
#include "dictionary.h"

Next we want to define the node struct and initiate any variables we will be using. Here node has two attributes: a word and a pointer to the next node in the linked list.

We also define the number of ‘buckets’ in the hash table here. This is the number of separate linked lists stored within the table. This value can be whatever you want and should depend on the hash function used. For example, if your hash function just used the first letter of each word you would set N=26, as there are 26 different possibilities for each word. Here I have used an arbitrary value of 100,000 but I encourage you to find your own hash function and modify the number of buckets to optimise the performance of the spell checker.

Finally we must initiate the hash table, which has N number of nodes, and an integer for the size of the dictionary to be loaded.

// Represents a node in a hash table
typedef struct node
{
char word[LENGTH + 1];
struct node *next;
}
node;
// Number of buckets in hash table
const unsigned int N = 100000;
// Hash table
node *table[N];
int dict_size = 0;

load()

The purpose of this function is to take a dictionary file, which is the sole argument, open it, and add each word into the hash table at an appropriate index using the hash() function.

The first step is to open the dictionary file using fopen() and assign it to a FILE variable, printing an error message if this does not work. A blank char array is then defined to store each word before it gets added to the hash table.

To scan through the file, fscanf() can be used, which takes three arguments:

  1. The file to be read from.
  2. The format of each item to be read. Here %s is used to tell it to read one string at a time.
  3. Where each item read is to be stored, the next_word array in this case.

By using a while loop until the end of file, each word from the dictionary will be read one at a time for loading into the hash table.

For each word, a new node n should be defined and memory allocated using malloc(). If there isn’t enough memory this will equal NULL, so this error should be handled.

Otherwise, strcopy() can then be used to copy the value of next_word into the word field of n.

We can then get the hash value for the word using the hash() function, which will be defined later. This hash value is then used to add n to the linked list at the corresponding index in the hash table. Recall that when adding a node to a linked list, we first set the pointer of the node to be inserted to the current head of the list, then set the node as the new head of the list.

Finally increment the count of words in the dictionary and when the loop finishes, close the file and return true to signify that a dictionary has been successfully loaded.

// Loads dictionary into memory, returning true if successful else false
bool load(const char *dictionary)
{
// Open dictionary file
FILE *dict_pointer = fopen(dictionary, "r");
// Check if null
if (dictionary == NULL)
{
printf("Unable to open %s\n", dictionary);
return false;
}
// Initialise word array
char next_word[LENGTH + 1];
// Read strings from file one at a time
while (fscanf(dict_pointer, "%s", next_word) != EOF)
{
// Create new node for each word
node *n = malloc(sizeof(node));
if (n == NULL)
{
return false;
}
// copy word into node using strcopy
strcpy(n->word, next_word);
// Hash word to obtain hash value
int hash_value = hash(next_word);
// Insert node into hash table at that location
n->next = table[hash_value];
table[hash_value] = n;
dict_size++;
}
// Close file
fclose(dict_pointer);
return true;
}

hash()

Here we define our hash function. There are infinite possibilities for this and you are encouraged to research online to find your own. I have opted for a fairly simple one that adds the ASCII values of each character in the word together. Remember to modulus divide the value outputted by the number of buckets in your hash table to get an integer that is within the bounds of the hash table.

// Hashes word to a number
unsigned int hash(const char *word)
{
// Function should take a string and return an index
// This hash function adds the ASCII values of all characters in the word together
long sum = 0;
for (int i = 0; i < strlen(word); i++)
{
sum += tolower(word[i]);
}
return sum % N;
}

size()

This function must return the size of the dictionary. By tracking the number of words as we were loading them in under dict_size, this one becomes very easy!

// Returns number of words in dictionary if loaded else 0 if not yet loaded
unsigned int size(void)
{
return dict_size;
}

check()

This function must check if a given word in the text being checked is in the dictionary and thus spelled correctly. It takes one argument, the word being checked.

The first step is to get the hash value for the word being checked, and access the linked list at the corresponding index in the hash table. The strcasecomp() function can then be used to traverse the linked list and check if the word is present. If it is present, return true, otherwise check the next node until the end of the list is reached and we can return false.

// Returns true if word is in dictionary else false
bool check(const char *word)
{
// Hash word to obtain hash value
int hash_value = hash(word);
// Access linked list at that index in hash table
node *n = table[hash_value];
// Traverse linked list, looking for word (strcasecomp)
while (n != NULL)
{
if (strcasecmp(word, n->word) == 0)
{
return true;
}
n = n->next;
}
return false;
}

unload()

The final step, and an important one when using a language like C that requires manual memory management, is to remove the dictionary from memory.

The entire hash table must be iterated through to unload each linked list one by one, so the iteration must go up to the value of N. For each index in the hash table, we define a new node n pointing to the head of the linked list. Then traverse the linked list and remove each element from memory one by one. This is done by setting a tmp variable equal to n, setting n equal to the next node in the list, and finally using free(tmp) to release it from memory. The steps must be followed in that order to prevent a memory leak.

When we have reached the last bucket and the final linked list has been emptied (n == NULL), we can return true.

// Unloads dictionary from memory, returning true if successful else false
bool unload(void)
{
// Iterate over hash table and free nodes in each linked list
for (int i = 0; i < N; i++)
{
// Assign cursor
node *n = table[i];
// Loop through linked list
while (n != NULL)
{
// Make temp equal cursor;
node *tmp = n;
// Point cursor to next element
n = n->next;
// free temp
free(tmp);
}
if (n == NULL && i == N - 1)
{
return true;
}
}return false;
}

And that’s it for speller and for the C section of the course. By this stage we are dealing with some fairly advanced concepts relative to a few weeks ago, such as linked lists and hash tables. These are data structures that come up time and time again in interview questions so it’s important to have a good understanding of their use.