EECS 311: DATA STRUCTURES

Programming Assignment 4

Goal

The goal of this assignment is to do some simple bioinformatics, specifically, using Minimum Spanning Trees to find clusters in gene expression data.

The idea is inspired by this paper (PDF) by Xu, Olman and Xu on using MST's to quickly find clusters of similar gene expression data. Their article is quite readable and introduces both MST's and how they are used to find clusters.

Be sure to follow the 311 coding standards. You should also look at the miscellaneous EECS 311 C++ tips, focussing on items particularly useful in various assignments.

Task

There are two parts to this assignment:

Implement the MST-based clustering algorithm in Section 3.1 of the article by Xu, Olman and Xu.
Test your implementation on some toy data and some real cancer data.

Pair-programming is allowed if it is true pair programming, i.e., coding done by both partners together, on one computer, switching every 20 minutes or so. The analyses should be done individually.

Resources and Requirements

Use the code in mst.zip to test your code. This archive includes test data, described below, and the following code files:

A program test_mst.cpp that creates a Graph of Sample objects in order to create clusters of data for the above files.
Weiss' implementation of disjoint sets in DisjSets.hpp and DisjSets.cpp

To make test_mst.cpp work, you'll need to implement:

a Sample class for sample data in sample.hpp and sample.cpp
a Graph<typename T> template class, whose vertices are objects of type T, with weights of type double

The `Sample` class

The Sample class has to support the code in the test file, which means:

Sample sample(rank) should construct a sample object where rank is an integer, indicating how many pieces of data, of type double, a sample has.
sample.id returns the ID number of the sample
sample.euclideanDistance(sample2) should return the Euclidean distance between the two samples, as a double.
in >> sample should read a line of data from an input stream in into a sample.

operator>> has to be overloaded to read data into a sample. If you look at the test data files, you can see that

The first line is a header line and can be ignored.
Every other line has one sample in tab-delimited form.
The first data field on each line has an integer id number, plus possibly other text data that should be ignored.
The first data field may or may not be surrounded by double quotation marks.

The `Graph` class

Although we will only make a Graph of Sample objects, your implementation of Graph should be generic and make no assumptions about vertices.

The Graph class will need to support this code from the test file:

Graph sampleGraph(samples.begin(), samples.end()) should be a templated constructor that builds a graph with vertices taken from the range given by the two iterators.
sampleGraph.addEdge(s1, s2, s1.euclideanDistance(s2)) should add an edge to the graph given two vertices of type T and a weight of type double. Behavior is undefined if either vertex is not in the range given to the constructor.
sampleGraph.partition(clusterLimit) should internally partition the graph's vertices into the optimal number of clusters, between 1 and clusterLimit.
sampleGraph.getPartition(i, back_inserter(nodes)) should be a templated member function that copies the nodes in the i'th partition (zero-based) into iterator given. Behavior is undefined if partition() has not been called.
sampleGraph.edgeSize() and sampleGraph.vertexSize() should return the number of edges and vertices in the graph, respectively. These should not be affected by partitioning.

Your Graph should use the minimum spanning tree approach described in Xu, Olman and Xu. It should build the MST using Kruskal's algorithm, as described in the book (Figure 9.58). That code should in turn use Weiss' DisjSets class for disjoint sets for fast union/find, included in the archive.

When determining how many clusters, i.e., how many edges to remove from the MST, you can use the simplest rule, described in Section 3.1 of the paper.

I made one change to the class. union(u, v) returns whichever node was selected to be the new root. You may or may not find this useful.

Testing

The test data in mst.zip to test your code includes:

Toy data in toy-data.txt, which is my hand-built reconstruction of the data in Figure 1 of the Xu, Olman and Xu article.
A cancer data file of reasonable size and simple format, from this course page. The data is from two different types of cancer cells, so, if all goes well, two clusters should be optimal.

Your code should determine that 4 clusters are optimal for the toy data, i.e., 3 edges should be removed from the MST. With any luck, it will also see that two clusters are optimal for for the cancer data.

Finally, note that the test file defines a preprocessor symbol TRACE, initially set to 0. Use this in your code for samples and graphs to control debugging output. When TRACE is 0, there should be no debugging output other than what is given in the main test file. When you submit, please restore TRACE to 0.

What to Hand in

Submit a zipped or tar-gzipped archive of one directory called last-name_first-initial_pa4, containing:

all your source files
your output in a plain text file called output.txt
a formal analysis of the complexity of the algorithms and data structures you use to optimally partition the samples.

If you worked alone, call the archive last-name_first-initial_pa4.zip (use .tgz for tar-gzipped files). Example: riesbeck_c_pa4.zip.

If you pair-programmed, call the archive last-name1_last_name2_pa4.zip (use .tgz for tar-gzipped files). Example: kornhauser_tarzia_pa4.zip.

Important! Make sure your archive meets the general code submission requirements.

Every source file should have an opening header comment of the following form:

// Filename: mst.hpp (or whatever)
// EECS311 PA4 solution: Minimum spanning tree
// Date: May 23, 2007
// Authors: Dan Kornhauser, Stephen Tarzia
// Programming sessions: 5/2/07: 4-6pm; 5/6/07: 3-6pm; 5/8/07: 10am-noon
// 
// Except for code provided by the instructor or textbook, this code was written
// solely by the authors listed above, and not copied from another student
// or external source.

Email your code, in a compressed archive, to me with the Subject line EECS 311 PA4.