EECS 311: Trees

Terminology

There's a lot of terminology associated with trees. You should be familiar with the following basic terms:

tree
node
link
root or root node
leaf or leaf node
internal node
height vs. depth
level (level 0 is the root)
child nodes of a parent node
sibling nodes
edge or branch
ancestors and descendants
subtree (a very important concept!), e.g., left and right subtrees
path from leaf to root
traversals: preorder, postorder, and inorder
balanced trees

There are also many kinds of trees, including:

n-ary trees (at most n branches from a node, at most n children)
binary trees (at most 2 branches or children)
heaps
binary search trees or ordered trees
parse trees (expression trees)

Implementing Binary Trees

Linear or Array Representation

This method is easy to understand and implement. It's very useful for certain kinds of tree applications, such as heaps, and fairly useless for others. It's typically used on dense binary trees.

The idea is simple:

take a complete binary tree and number its nodes from top to bottom, left to right.
The root is 0, the left child 1, the right child 2, the left child of the left child 3, etc.
Put the data for node I of this tree in the Ith element of an Array.
If you have a partial (incomplete) binary tree, and node I is absent, put some value that represents "no data" in the Ith position of the array.

Three simple formulae allow you to go from the index of the parent to the index of its children and vice versa:

if index(parent) = N, index(left child) = 2*N+1
if index(parent) = N, index(right child) = 2*N+2
if index(child) = N, index(parent) = (N-1)/2 (integer division with truncation)

The advantage of the linear representation is this easy traversal up and down, and efficient use of space if the tree is complete. The disadvantage is inefficient use of space if the tree is sparse.

Linked Representation

Again, the idea is simple. A node in the tree has

a data field
a left child field with a pointer to another tree node
a right child field with a pointer to another tree node
optionally, a parent field with a pointer to the parent node

The most important thing to remember about the linked representation is this:

A tree is represented by the pointer to the root node, not a node.

The empty tree is simply the NULL pointer, not an empty node.

Traversal and Recursion

Traversing a binary tree in any of the three orders, even in a linked representation without parent links, is trivial with recursion.

For example, here's the pseudo-code for inorder traversal:

inorderTraversal(bt, fn):
 
if (bt != NULL)
  inorderTraversal(bt->leftChild)
  fn(bt->info)
  inorderTraversal(bt->rightChild)

Here,

bt is a binary tree, i.e., a pointer to a binary tree node
fn is a function that operates on the kind of data stored in the tree
bt->leftChild, bt->rightChild, and bt->info access the left child pointer, right child pointer, and data field, respectively

It should be obvious how to code the other two traversals.

Binary Search Trees

Binary search trees have the property that

all data in the left subtree of every node are less than the data in the node
all data in the right subtree of every node are greater than the data in the node

What's the Point?

Consider the combinatorial complexity of the data structures we've seen so far. We give best and worse case Big O values:

	Add data	Retrieve data

Array (unordered)	constant (just add to end)	O(N) (linear search)
Array (ordered)	O(log₂N) compares, O(N) data transfers (shift data)	O(log₂N) (binary search)
Linked list (unordered)	constant (assuming end pointer)	O(N) (linear search)
Linked list (ordered)	O(log₂N) compares, constant data transfers (change a few pointers), O(N) nodes visited sequentially	O(log₂N) (binary search), visit O(N) nodes
Binary search tree (best case)	O(log₂N) compares, constant data transfers, O(log₂N) nodes visited	O(log₂N)
Binary search tree (worst case)	O(N) compares, constant data transfers, O(N) nodes visited	O(N)

On the average, then, a binary search tree will be

as fast as a sorted array with less work to add data,
faster than a linked list, because it will pass through far few nodes

Binary search tree algorithms for adding and retrieving are also very simple.

In the worst case, however, a binary search tree will be as bad as a linked list. Many of the variations of binary search trees that we'll see will be attempts to get the best of both worlds: fast access and fast storage, albeit using more complex algorithms.

Adding Data

It's easy to add new data and "grow" the tree:

addData(&bt, x):
 
if (bt == NULL)
  bt = new BtNode
  bt->info = x
  bt->left = bt->right = NULL
else
if (x < bt->info)
  addData(bt->left, x)
else 
if (x > bt->info)
  addData(bt->right, x)
else // already in tree

Adding a new node takes O(log₂n) steps, where n is the number of nodes in the tree. This is best and average case behavior. If the tree is very unbalanced, e.g., we passed it a sorted list of numbers, then adding a new node will take O(n²) steps.

Deleting Data

Deleting data is complicated when the data being removed is not in a leaf node. We can't just delete the node, because then our tree would "fall apart." We have to promote one of the children to become the new parent. The child has to be

bigger than all the other children in the left tree, and
smaller than all the other children in the right tree

There are at most two possible candidates:

the rightmost child of the left subtree
the leftmost child of the right subtree

It doesn't matter which one we pick. If neither subtree exists, we have a leaf node, which can be just deleted and it's pointer removed from the parent node.

The following algorithm deletes a node in a binary search tree:

if there's a left subtree, use the rightmost child of the left subtree
otherwise, if there's a right subtree, use the leftmost child of the right subtree
otherwise, this is a leaf node, just delete it from its parent

Note that

no data has to be moved, only links -- the tricky part is doing this in the right order so as not to lose a link you need for a later step
for choice 1, if the rightmost child has a left subtree, that subtree has to be moved (relinked) to replace the child that was promoted
for choice 2, if the leftmost child has a right subtree, that subtree has to be moved (relinked) to replace the child that was promoted
for choice 2, there's less pointer updating because there's no left subtree to reink to

Consider removing TALBOT from the tree in Figure 7.18. If we pick SELIGER, then we have to move SEFTON to where SELIGER was.

Heaps

A heap is a binary tree (not a binary search tree) with the following properties:

for every node, all nodes in all subtrees have data smaller than the data in that node
the tree is full and dense

Full and dense means that only the bottom level of the tree has gaps, and the gaps are all on the right.

Being dense means the linear array representation is the most appropriate. A linked representation would waste space.

What's the Point?

A heap is a great data structure for implementing a priority queue, because

The next item to do (the item with the highest priority) is right at the top of the heap, i.e., at the root
Adding a new item takes at most log₂N swaps and compares (as shown below), which is better than the O(N) data transfers a sorted array would require.
Removing an item and updating the tree takes at most log₂N steps (as shown below).

It also turns out we can use the routines for adding and removing items to create an N log₂N sort algorithm called heap sort.

Adding Data

The algorithm for adding data to a heap is simple, with one unintuitve aspect: we add an item to the bottom first then move it upwards, if necessary:

add a new data item to the next empty gap in the bottom of the tree
"walk" the data item up the tree, i.e.,
while the data is larger than its parent, swap the item with its parent, and repeat, checking the data item with its new parent

Walking up the tree will take at most log₂N steps (compares and data transfers), because the tree is always balanced, so the height of the tree is log₂N.

Deleting Data

Both uses of heaps (priority queues and heap sort) only need to delete the root, so that's the only case we'll consider here. Deleting is somewhat similar to adding in that we'll replaced the deleted root with another item from the tree and bubble it down. The algorithm is made only slightly more complex because we have two children to worry about instead of just one parent.

Remove the root.
Replace it with the rightmost item in the bottom level of the tree.
The new item is almost certainly in the wrong place because it's one of the smaller data elements, but this keeps the heap full and dense.
If the new root has no children bigger than it is, we're done.
If it has just a left child that is bigger, swap it with the child and go back to step 2 with the left subtree.
If it has two children that are bigger, swap it with the larger child and go back to step 2 with the affected subtree.

Note that

In choice 5, the child promoted is known to be bigger than the other child, so it must be the largest item in the two subtrees. We don't need to bubble it down.

Walking down the tree will take at most log₂N steps (compares and data transfers). We may have to do three comparisons (parent with each child and child against child) but 3 log₂N is still O(log₂N).

Implementing Heaps

As mentioned above, heaps are best implemented using a linear array. The move up and down the tree are simple jumps around the array and the array is mostly filled. The only downside is that data has to be moved around, unless, of course, what we really have is an array of pointers to data, which is the best way to go with non-numeric data.

Heap Sort

There are two versions of heap sort that I know about. Both

are "in-place" algorithms, that is, that sort an array directly without needing a second array
start with an array full of presumably unsorted numbers
have two phases, a heap building phase, and a heap to sorted array phase
use walkDown to do the second phase

The versions differ on how the first phase is done.

Version 1 of Heap Sort

Phase 1

Let N be the number of data elements.
Let M be the index of the rightmost node with children.
For I from M down to 1:
- Walk the node at I down to its proper position.

Phase 2

For I from N to 2:
- Swap element 1 (the root) with element I.
- Remove element I from future consideration as a part of the heap.
- Walk the new root down to its proper position.

The first phase takes no more than O(N log₂N) swaps and compares, because each walk down takes log₂N steps (N starting small and growing to the number of elements) and it's called N times. In fact, it's actually better than O(N log N), it's O(N), although proving this requires some mathematics.

Similarly, the second phase takes O(N log₂N) swaps and compares for the same reason. Ths is both best and worst case behavior. In the best case, you have a heap when you start and no swaps are needed for the first phase, but they become necessary in the second phase.

Version 2 of Heap Sort

Phase 1

Let N be the number of data elements.
For I from 2 to N:
- Walk the node at I up to its proper position.

The first phase takes O(N log₂N) swaps and compares, because each walkUp takes log₂N steps and it's called N times.

Version 2 is simpler, especially if you've already implemented walking up and down. Version 1, however, only needs walking down, and Phase 1 of Version 1 is faster (O(N) instead of O(N log N)).

Comments? Send mail to c-riesbeck@northwestern.edu.