|EECS 311: DATA STRUCTURES||
The goal of this assignment is to do some simple bioinformatics, specifically, using Minimum Spanning Trees to find clusters in gene expression data.
The idea is inspired by this paper (PDF) by Xu, Olman and Xu on using MST's to quickly find clusters of similar gene expression data. Their article is quite readable and introduces both MST's and how they are used to find clusters.
Be sure to follow the 311 coding standards. You should also look at the miscellaneous EECS 311 C++ tips, focussing on items particularly useful in various assignments.
There are two parts to this assignment:
Pair-programming is allowed if it is true pair programming, i.e., coding done by both partners together, on one computer, switching every 20 minutes or so. The analyses should be done individually.
Use the code in mst.zip to test your code. This archive includes test data, described below, and the following code files:
Sampleobjects in order to create clusters of data for the above files.
To make test_mst.cpp work, you'll need to implement:
Sampleclass for sample data in
Graph<typename T>template class, whose vertices are objects of type
T, with weights of type
Sample class has to support the code in the test
file, which means:
Sample sample(rank)should construct a sample object where
rankis an integer, indicating how many pieces of data, of type
double, a sample has.
sample.idreturns the ID number of the sample
sample.euclideanDistance(sample2)should return the Euclidean distance between the two samples, as a
in >> sampleshould read a line of data from an input stream
ininto a sample.
has to be overloaded to read data into a sample. If you look at the
test data files, you can see that
Although we will only make a
objects, your implementation of
Graph should be generic and
make no assumptions about vertices.
Graph class will need to support this code
from the test file:
Graphshould be a templated constructor that builds a graph with vertices taken from the range given by the two iterators.
sampleGraph.addEdge(s1, s2, s1.euclideanDistance(s2))should add an edge to the graph given two vertices of type
Tand a weight of type
double. Behavior is undefined if either vertex is not in the range given to the constructor.
sampleGraph.partition(clusterLimit)should internally partition the graph's vertices into the optimal number of clusters, between 1 and
sampleGraph.getPartition(i, back_inserter(nodes))should be a templated member function that copies the nodes in the
i'th partition (zero-based) into iterator given. Behavior is undefined if
partition()has not been called.
sampleGraph.vertexSize()should return the number of edges and vertices in the graph, respectively. These should not be affected by partitioning.
Graph should use the minimum spanning tree approach
described in Xu, Olman and Xu. It should build the MST using
Kruskal's algorithm, as described in the book
(Figure 9.58). That code should in turn use Weiss'
DisjSets class for disjoint sets
for fast union/find, included in the archive.
When determining how many clusters, i.e., how many edges to remove from the MST, you can use the simplest rule, described in Section 3.1 of the paper.
I made one change to the class.
union(u, v) returns
whichever node was selected to be the new root. You may or may not find this useful.
The test data in mst.zip to test your code includes:
Your code should determine that 4 clusters are optimal for the toy data, i.e., 3 edges should be removed from the MST. With any luck, it will also see that two clusters are optimal for for the cancer data.
Finally, note that the test file defines a preprocessor
TRACE, initially set to 0. Use this in your
code for samples and graphs to control debugging output. When
TRACE is 0, there should be no debugging output
other than what is given in the main test file. When you submit,
TRACE to 0.
Submit a zipped or tar-gzipped archive of one directory called last-name_first-initial_pa4, containing:
If you worked alone, call the archive
for tar-gzipped files). Example: riesbeck_c_pa4.zip.
If you pair-programmed, call the archive
for tar-gzipped files). Example: kornhauser_tarzia_pa4.zip.
Important! Make sure your archive meets the general code submission requirements.
Every source file should have an opening header comment of the following form:
// Filename: mst.hpp (or whatever) // EECS311 PA4 solution: Minimum spanning tree // Date: May 23, 2007 // Authors: Dan Kornhauser, Stephen Tarzia // Programming sessions: 5/2/07: 4-6pm; 5/6/07: 3-6pm; 5/8/07: 10am-noon // // Except for code provided by the instructor or textbook, this code was written // solely by the authors listed above, and not copied from another student // or external source.
Email your code, in a compressed archive, to me with the Subject line EECS 311 PA4.