EECS 311: DATA STRUCTURES

Programming Assignment 4

Goal

The goal of this assignment is to do some simple bioinformatics, specifically, using Minimum Spanning Trees to find clusters in gene expression data.

The idea is inspired by this paper (PDF) by Xu, Olman and Xu on using MST's to quickly find clusters of similar gene expression data. Their article is quite readable and introduces both MST's and how they are used to find clusters.

Be sure to follow the 311 coding standards. You should also look at the miscellaneous EECS 311 C++ tips, focussing on items particularly useful in various assignments.

Task

There are two parts to this assignment:

Pair-programming is allowed if it is true pair programming, i.e., coding done by both partners together, on one computer, switching every 20 minutes or so. The analyses should be done individually.

Resources and Requirements

Use the code in mst.zip to test your code. This archive includes test data, described below, and the following code files:

To make test_mst.cpp work, you'll need to implement:

The Sample class

The Sample class has to support the code in the test file, which means:

operator>> has to be overloaded to read data into a sample. If you look at the test data files, you can see that

The Graph class

Although we will only make a Graph of Sample objects, your implementation of Graph should be generic and make no assumptions about vertices.

The Graph class will need to support this code from the test file:

Your Graph should use the minimum spanning tree approach described in Xu, Olman and Xu. It should build the MST using Kruskal's algorithm, as described in the book (Figure 9.58). That code should in turn use Weiss' DisjSets class for disjoint sets for fast union/find, included in the archive.

When determining how many clusters, i.e., how many edges to remove from the MST, you can use the simplest rule, described in Section 3.1 of the paper.

I made one change to the class. union(u, v) returns whichever node was selected to be the new root. You may or may not find this useful.

Testing

The test data in mst.zip to test your code includes:

Your code should determine that 4 clusters are optimal for the toy data, i.e., 3 edges should be removed from the MST. With any luck, it will also see that two clusters are optimal for for the cancer data.

Finally, note that the test file defines a preprocessor symbol TRACE, initially set to 0. Use this in your code for samples and graphs to control debugging output. When TRACE is 0, there should be no debugging output other than what is given in the main test file. When you submit, please restore TRACE to 0.

What to Hand in

Submit a zipped or tar-gzipped archive of one directory called last-name_first-initial_pa4, containing:

If you worked alone, call the archive last-name_first-initial_pa4.zip (use .tgz for tar-gzipped files). Example: riesbeck_c_pa4.zip.

If you pair-programmed, call the archive last-name1_last_name2_pa4.zip (use .tgz for tar-gzipped files). Example: kornhauser_tarzia_pa4.zip.

Important! Make sure your archive meets the general code submission requirements.

Every source file should have an opening header comment of the following form:

// Filename: mst.hpp (or whatever)
// EECS311 PA4 solution: Minimum spanning tree
// Date: May 23, 2007
// Authors: Dan Kornhauser, Stephen Tarzia
// Programming sessions: 5/2/07: 4-6pm; 5/6/07: 3-6pm; 5/8/07: 10am-noon
// 
// Except for code provided by the instructor or textbook, this code was written
// solely by the authors listed above, and not copied from another student
// or external source.

Email your code, in a compressed archive, to me with the Subject line EECS 311 PA4.


Valid HTML 4.01 Strict