EECS 311: DATA STRUCTURES

Assignment 4: Clustering with Minimum Spanning Trees

Goal

The goal of this assignment is to do some simple data analysis, using Minimum Spanning Trees to find clusters in multi-dimensional data.

The idea is inspired by this paper (PDF) by Xu, Olman and Xu on using MST's to quickly find clusters of similar gene expression data. Their article is quite readable and introduces both MST's and how they are used to find clusters.

Test your implementation on their toy example and the provided randomly generated data. If anyone has some real data suitable for clustering, available in an easily read text form, please let me know.

In order to let you focus on just the MST and clustering part of the problem, I've done most of the first two tasks for you.

Tasks

Set up

Create a new project directory called clusters. Download clusters.zip and extract the files in it into that directory. The archive contains the Magic Makefile, some sample data files, some test code, and two classes for holding the data in memory.

PA4 Task 1: Define distance for the Sample class

sample.h and sample.cpp define a Sample class to hold multi-dimensional data with integer ID's. The API is below. Everything is defined already except for distance(). You have to do that one.

Constructor: Sample::Sample( int n )
Creates an empty sample object for an n-dimensional data point with an integer ID.
int Sample::size() const returns the dimensionality, i.e., N, for this sample.
Sample::id is the ID, or -1 if no data has been stored.
std::vector Sample::data is the stored data, if any.
double Sample::distance( const Sample & sample ) const
Returns the Euclidean distance from this sample to sample. An exception is thrown if the samples are not the same dimensionality.
istream & operator>>( istream & in, Sample & sample ) reads an ID and size() doubles into sample, from a stream, assuming all values are separated by whitespace, e.g., tabs or spaces.
bool operator<( const Sample & sample ) const returns true if this sample's ID is less than sample's.

In tests.cpp, comment out all code except for the test readSample.

Make sure all this test passes with your distance() function.

Submit to Code Critic: your distance() function and any other code you added.

PA4 Task 2: Test the SampleSet class

sampleset.h and sampleset.cpp define a SampleSet class to hold a set of samples. The API is:

Constructor: SampleSet( int n )
Creates an empty sample set for samples of size n.
size_t SampleSet::readFile( std::string file )
Reads sample data from the given file into the sample set, and returns the number of samples read. The first line of the file is assumed to be a header line and skipped.
size_t SampleSet::size( ) const
Returns the number of samples in the set.
Sample operator[]( int n ) returns the nth sample. Throws an exception if there is no nth sample.

Uncomment the test readXuToyData in tests.cpp and make sure it passes.

There is nothing to submit to Code Critic unless you added some code.

PA4 Task 3: Add clustering

Implement the following additional member functions of SampleSet. They are already declared in sampleset.h.

int SampleSet::cluster( )
Partitions the samples into clusters. Returns the number of clusters formed.
int SampleSet::clusterCount( ) const
Returns the number of clusters. If cluster() has not been called, returns the number of samples.
size_t SampleSet::clusterSize( int n ) const
Returns the size of the nth cluster (zero-based).

To cluster the samples, follow the ideas in the paper by Xu, Olman and Xu. To build the MST,

implement Kruskal's algorithm, as described in the book (Figure 9.58)
define an Edge structure for edges, holding the indices of the two vertices (samples) connected, and the weight (distance) between them
use a C++ priority_queue of edge objects
define a class for disjoint sets, following the code outlined in Weiss, but improving the API as you see fit (see note about union by size below)

The authors of the paper explore a number of options for deciding when to stop cutting links and making clusters. Brainstorm ideas for doing this with other students. (But don't share code!) For example, I had good luck with the simple idea of adding links until there was a sudden "large" jump in the size of links added to the MST. I observed the pattern of sizes and picked a threshold for "large." This doesn't work if the clusters are fairly close together. I bet there's an even better rule that can tell from the links that are rejected that all intra-cluster links are being selected first. That means the remaining links are going to connect clusters and looping should stop.

If you can find a way to add links until clustering has ended, it's fairly easy to use the internal union/find data structure to implement the cluster count and size functions, especially if you implement union by size. The book shows union by height, but both are simple and have good performance.

Uncomment the code that defines storeSortedSizes() and the remaining tests and make sure they pass.

Submit to Code Critic: all the code you added. Clearly label which items are in header files and which are in implementation files. Clearly label which data members and member functions are public and which are private.

Wrap Up

When all of the above has been completed, and approved as necessary via the Code Critic, check over your code and send it in.

Make sure your program runs standalone, with no user input of any kind. That includes not stopping at the end with a "hit any key to continue" message. If I can't run your program automatically from a script, it will be returned.

Your code should compile without errors or warnings.

Be sure your test code file has an opening header comment that identifies the authorship of the code, as follows:

// EECS311 PA4 solution: Clustering
// Date: November 12, 2009
// Authors: T.H.E.Hacker
// 
// Except for code provided by the instructor or textbook, this code was written
// solely by the authors listed above, and not copied from another student
// or external source.

Run make handin to generate a Zip archive of all your source code.

Email your analysis document and the Zip archive produced by the Magic Makefile to me. Use the Subject EECS 311 PA 4 Clustering.