EECS 311: DATA STRUCTURES

Assignment 2

Goal

The goals of this assignment are:

learn how to design and implement an efficient tree-based algorithm to replace an inefficient brute-force approach
work with recursive algorithms on tree structures
analyze and empirically verify the space and time performance of the algorithm
manage dynamically allocated memory in tree structures

The problem is the same as in Assignment 1, but the algorithm for finding answers is very different. Instead of

for every word in wordList
  if word in grid
    collect word

the algorithm is

for every cell in grid
  collect every word that starts at the cell

The trick is to make finding every word that starts at a cell as cheap as possible. The brute force approach would be

for every cell in grid
  for every legal sequence of letters from cell
    if sequence is in wordList
       collect sequence

Legal sequences are every path from the cell, 3 letters or longer, that don't visit a cell twice. Consider doing that with this small grid:

M	M	G
E	U	S
K	D	E

For the first cell, this try things like MMG, MMGS, MMGSU, MMGSUE, ... But there are no words that start with MM. None of these sequences can work. On the other hand, even though the sequence DUK is not a word, adding an E makes the word DUKE.

This suggests that you want a data structure that quickly tells you "are there any words starting with ..." for any given sequence of letters. The standard data structure for this is called a trie, pronounced "tree" as in "reTRIEval." The book has a section on tries, but I think the opening of the the Wikipedia entry is clearer. That article is the source of this image.

If you build a trie to hold the words in the word list, then the Boggle solver can quickly tell when there's no point to going any further along a path in the grid.

Tasks

Go slowly. Be sure the first subtask is working before going to the next. Get code for the first subtask approved in the Code Critic before submitting the code for the next subtask.

Create a new project directory called boggle2. Copy all of the old Boggle code from Assignment 1 into it. The Boggle API is not going to change very much, but internally you're going to change it to use a trie for the word list instead of a vector.

PA2 Task 1: add branches to tries

A trie is a tree, though not a binary search tree. Like most tree data structures, the tree node is the heart of the data structure. In fact, you won't need anything else for this problem.

A trie node needs to let you:

add words to the trie, automatically extending the tree as necessary, without creating duplicate branches
follow a sequence of letters from node to node in a trie

Here's an API to do that. I've made it as minimal as possible, so that you can focus on the core algorithms.

Trie() -- constructs an empty trie node
Trie *Trie::next(char ch) const -- returns a pointer to the child node for ch, or null if there is none
Trie *Trie::addBranch(char ch) -- adds a branch to a child node for ch; does nothing if such a branch already exists; returns a pointer to the child node (never null)

Define these in the files trie.h and trie.cpp.

Add the following tests (and others, if you want) to your tests.cpp file. While they won't catch every bug, these tests will verify that child branches get added in the right place, and that a branch is not created for a letter if one already exists. Notice how the tests use chains of pointers to go down the tree.

TEST(addBranch)
{
  Trie * trie = new Trie();
  // verify that branches do and do not exist as expected
  // when adding the sequence a + b + c
  CHECK( !trie->next('a') );
  CHECK( trie->addBranch('a') );
  CHECK( trie->next('a') );

  CHECK( !trie->next('a')->next('b') );
  CHECK( trie->next('a')->addBranch('b') );
  CHECK( trie->next('a')->next('b') );
  
  CHECK( !trie->next('a')->next('b')->next('b') );
  CHECK( trie->next('a')->next('b')->addBranch('b') );
  CHECK( trie->next('a')->next('b')->next('b') );

  // make sure b was not added to the root
  CHECK( !trie->next('b') );

  // adding a branch twice shouldn't affect anything
  // test by adding a again and the result should be
  // the previous node that had a child for b
  CHECK( trie->addBranch('a')->next('b') );
}

Submit to Code Critic: your trie.h and trie.cpp code, clearly marked which is which. Don't send test code, but if you added any new tests that you think are important, send just them.

PA2 Task 2: add words to tries

Trie::addBranch() is useful for testing the trie code, but what you actually want to use is a function that can add and store whole words. Add the following to your Trie class:

std::string *Trie::word -- a public data member that points to either a stored word or null.
void Trie::addWord(std::string word) -- adds branches to the trie, as necessary, and stores word in the leaf node.

Public data members are bad. They make it to easy for someone to corrupt the data structure, e.g., by changing a stored word to something inconsistent with the trie. I do it here just to keep you focused on the algorithmic issues.

Add the following to your test code. Make sure you understand what these tests are checking for. Post a question if you don't.

TEST(addWord)
{
  Trie * trie = new Trie();
  trie->addWord( "and" );
  trie->addWord( "ant" );
  trie->addWord( "bee" );
  trie->addWord( "an" );

  CHECK( trie->next('a') );
  CHECK( !trie->next('a')->word );

  CHECK( trie->next('a')->next('n') );
  CHECK_EQUAL( "an", *(trie->next('a')->next('n')->word) );

  CHECK( trie->next('a')->next('n')->next('d') );
  CHECK_EQUAL( "and", *(trie->next('a')->next('n')->next('d')->word) );

  CHECK( trie->next('a')->next('n')->next('t') );
  CHECK_EQUAL( "ant", *(trie->next('a')->next('n')->next('t')->word) );

  CHECK( !( trie->next('a')->next('n')->next('x') ) );

  CHECK( trie->next('b') );
  CHECK( !trie->next('b')->word );

  CHECK( !trie->next('c') );
}

Submit to Code Critic: whatever header or implementation code you added or changed. Don't send the entire files or the test code. Be sure to mark the header code versus the implementation code.

PA2 Task 3: load word list

Change your Boggle class to load the words in crosswd.txt into a trie, instead of a vector.

You'll need to make just two small changes to your public Boggle API and the test cases:

delete Boggle::contains(), the functions that implement it, and the code that tests it. This method is not needed in this approach to solving problems.
change Boggle::solve() to take a set instead of a vector. The new algorithm will generate solutions in non-alphabetical order and may generate duplicates. Passing a set as the container will automatically and efficiently sort the results and ignore duplicates.

Test your code using the same Boggle tests you had before. You should get exactly the same answers.

Submit to Code Critic: whatever header or implementation code you added or changed. Don't send the entire files or the test code. Be sure to mark the header code versus the implementation code.

PA2 Task 4: analyze performance

Do a formal analysis of the complexity of your new algorithm. Compare your analysis to the observed performance. Also compare the runtime to the runtime of your previous version. Use the same notational conventions you used in Assignment 1.

PA2 Task 5: fix memory leak

This task is not algorithm-related. It is not required in languages like Lisp, Java, Javascript or C# that have garbage collectors. It is critical though in C and C++ where you have to make sure your program doesn't "leak memory." A program leaks memory if it fails to give back memory that was added using the new operator, once the program is done using the memory.

Chances are, unless you know more C++ than is expected for this class, your Trie class leaks memory like a sieve. To see if this is the case, change your trie class thus:

add a public static data member nodeCount of type std::size_t initialized to 0.
add ++nodeCount; to your Trie constructor
define Trie::~Trie() to call --nodeCount;

nodeCount will then contain how many trie nodes have been allocated and not given back.

See this page at IBM on how to initialize static data members. It's not at all obvious. The code to set nodeCount to 0 should be in your trie.cpp file.

Add the following test (and others if you want) to your tests.cpp:

TEST(memory)
{
  std::size_t base = Trie::nodeCount;
  // many nodes in the global Boggle trie
  CHECK( base > 0 );
  Trie *trie = new Trie();
  CHECK_EQUAL( base + 1, Trie::nodeCount );
  trie->addWord( "and" );
  CHECK_EQUAL( base + 4, Trie::nodeCount );
  trie->addWord( "ant" );
  CHECK_EQUAL( base + 5, Trie::nodeCount );
  delete trie;
  CHECK_EQUAL( base, Trie::nodeCount );
}

This creates a couple of nodes, then deletes the tree. Your code should pass all the tests except the last one. If it fails an earlier test, you're not updating nodeCount correctly. If it fails, the last test, you have a memory leak.

Now fix the memory. Change your trie destructor to delete all storage that has been dynamically allocated in that node and any child nodes.

PA2 Task 6: fix copy and assignment

You've fixed the memory leak, but now add the following tests to tests.cpp:

TEST(copyTrie)
{
  Trie trie;
  trie.addWord( "and" );
  trie.addWord( "ant" );
  CHECK_EQUAL( "and", *(trie.next('a')->next('n')->next('d')->word) );
  CHECK_EQUAL( "ant", *(trie.next('a')->next('n')->next('t')->word) );

  Trie copy( trie );
  CHECK_EQUAL( "and", *(copy.next('a')->next('n')->next('d')->word) );
  CHECK_EQUAL( "ant", *(copy.next('a')->next('n')->next('t')->word) );

  copy.addWord( "bee" );
  trie.addWord( "cat" );
  CHECK_EQUAL( "bee", *(copy.next('b')->next('e')->next('e')->word) );
  CHECK_EQUAL( "cat", *(trie.next('c')->next('a')->next('t')->word) );
  CHECK( !trie.next('b') );
  CHECK( !copy.next('c') );
}

TEST(assignTrie)
{
  Trie trie;
  trie.addWord( "and" );
  trie.addWord( "ant" );
  CHECK_EQUAL( "and", *(trie.next('a')->next('n')->next('d')->word) );
  CHECK_EQUAL( "ant", *(trie.next('a')->next('n')->next('t')->word) );
  Trie copy;
  CHECK( !copy.next('a') );

  copy = trie;
  CHECK_EQUAL( "and", *(copy.next('a')->next('n')->next('d')->word) );
  CHECK_EQUAL( "ant", *(copy.next('a')->next('n')->next('t')->word) );

  copy.addWord( "bee" );
  trie.addWord( "cat" );
  CHECK_EQUAL( "bee", *(copy.next('b')->next('e')->next('e')->word) );
  CHECK_EQUAL( "cat", *(trie.next('c')->next('a')->next('t')->word) );
  CHECK( !trie.next('b') );
  CHECK( !copy.next('c') );
  
  trie = trie;
}

Compile and run. Your program should crash. It may crash with an exception, it may freeze up, you may have to kill it manually with control-C, and/or find and kill the boggle2.exe process. All in all, quite a mess.

What happened? Section 1.5.5 of the textbook gives some details, but that section doesn't make it clear just how bad things get if you forget the Rule of three: if your class needs a non-default destructor, copy constructor or assignment operator, it needs non-default versions of all three.

For tips on defining copy constructors and assignment operators, see the Rule of Three discussion on the C++ topics page.

Note that just like your destructor, your copying code will need to copy the internal tree recursively. Refactor! If you write the same code in two places, define a private subfunction that both places call.

Submit to Code Critic: whatever header or implementation code you added or changed. Don't send the entire files or the test code. Be sure to mark the header code versus the implementation code.

Wrap Up

When all of the above has been completed, and approved as necessary via the Code Critic, check over your code and send it in.

Make sure your program runs standalone, with no user input of any kind. That includes not stopping at the end with a "hit any key to continue" message. If I can't run your program automatically from a script, it will be returned.

Be sure your test code file has an opening header comment that identifies the authorship of the code, as follows:

// EECS311 PA2 solution: Trie approach to solving Boggle
// Date: October 10, 2009
// Authors: T.H.E.Hacker
// 
// Except for code provided by the instructor or textbook, this code was written
// solely by the authors listed above, and not copied from another student
// or external source.

Run make handin to generate a Zip archive of all your source code. By design, the Magic Makefile does not include the word list file. Do not include the word list file.

Email your analysis document and the Zip archive produced by the Magic Makefile to me. Use the Subject EECS 311 PA 2 Boggle Trie.