EECS 311: DATA STRUCTURES |
Home Course Info Links Grades |
Lectures Newsgroup Homework Exams |
The goal of this assignment is to do some simple bioinformatics, specifically, using Minimum Spanning Trees to find clusters in gene expression data.
The idea is inspired by this paper (PDF) by Xu, Olman and Xu on using MST's to quickly find clusters of similar gene expression data. Their article is quite readable and introduces both MST's and how they are used to find clusters.
Be sure to follow the 311 coding standards. You should also look at the miscellaneous EECS 311 C++ tips, focussing on items particularly useful in various assignments.
There are two parts to this assignment:
Pair-programming is allowed if it is true pair programming, i.e., coding done by both partners together, on one computer, switching every 20 minutes or so. The analyses should be done individually.
Use the code in mst.zip to test your code. This archive includes test data, described below, and the following code files:
Graph
of Sample
objects in order to create clusters of data for the above files.DisjSets.hpp
and
DisjSets.cpp
To make test_mst.cpp work, you'll need to implement:
Sample
class for sample data in sample.hpp
and
sample.cpp
Graph<typename T>
template class, whose vertices are
objects of type T
, with weights of type double
Sample
classThe Sample
class has to support the code in the test
file, which means:
Sample sample(rank)
should construct a sample object
where rank
is an integer, indicating how many pieces of data, of
type double
, a sample has.sample.id
returns the ID number of the samplesample.euclideanDistance(sample2)
should return the Euclidean distance
between the two samples, as a double
.in >> sample
should read a line of data from
an input stream in
into a sample.operator>>
has to be overloaded to read data into a sample. If you look at the
test data files, you can see that
Graph
classAlthough we will only make a Graph
of Sample
objects, your implementation of Graph
should be generic and
make no assumptions about vertices.
The Graph
class will need to support this code
from the test file:
Graph sampleGraph(samples.begin(), samples.end())
should be a templated constructor that
builds a graph with vertices taken from the range given
by the two iterators.sampleGraph.addEdge(s1, s2, s1.euclideanDistance(s2))
should add an edge to the graph given two vertices of type T
and a weight of type double
.
Behavior is undefined if either vertex is not in the range given to the
constructor.sampleGraph.partition(clusterLimit)
should
internally partition the graph's vertices into
the optimal number of clusters, between 1 and clusterLimit
.sampleGraph.getPartition(i, back_inserter(nodes))
should be a templated member function that
copies the nodes in the i
'th partition (zero-based) into iterator given.
Behavior is undefined if partition()
has not been called.sampleGraph.edgeSize()
and sampleGraph.vertexSize()
should return the number of edges and vertices in the graph, respectively.
These should not be affected by partitioning.Your Graph
should use the minimum spanning tree approach
described in Xu, Olman and Xu. It should build the MST using
Kruskal's algorithm, as described in the book
(Figure 9.58). That code should in turn use Weiss' DisjSets
class for disjoint sets
for fast union/find, included in the archive.
When determining how many clusters, i.e., how many edges to remove from the MST, you can use the simplest rule, described in Section 3.1 of the paper.
I made one change to the class. union(u, v)
returns
whichever node was selected to be the new root. You may or may not find this useful.
The test data in mst.zip to test your code includes:
Your code should determine that 4 clusters are optimal for the toy data, i.e., 3 edges should be removed from the MST. With any luck, it will also see that two clusters are optimal for for the cancer data.
Finally, note that the test file defines a preprocessor
symbol TRACE
, initially set to 0. Use this in your
code for samples and graphs to control debugging output. When
TRACE
is 0, there should be no debugging output
other than what is given in the main test file. When you submit,
please restore TRACE
to 0.
Submit a zipped or tar-gzipped archive of one directory called last-name_first-initial_pa4, containing:
If you worked alone, call the archive
last-name_first-initial_pa4.zip (use .tgz
for tar-gzipped files). Example: riesbeck_c_pa4.zip.
If you pair-programmed, call the archive
last-name1_last_name2_pa4.zip (use .tgz
for tar-gzipped files). Example: kornhauser_tarzia_pa4.zip.
Important! Make sure your archive meets the general code submission requirements.
Every source file should have an opening header comment of the following form:
// Filename: mst.hpp (or whatever) // EECS311 PA4 solution: Minimum spanning tree // Date: May 23, 2007 // Authors: Dan Kornhauser, Stephen Tarzia // Programming sessions: 5/2/07: 4-6pm; 5/6/07: 3-6pm; 5/8/07: 10am-noon // // Except for code provided by the instructor or textbook, this code was written // solely by the authors listed above, and not copied from another student // or external source.
Email your code, in a compressed archive, to me with the Subject line EECS 311 PA4.