Assignment 4: Clustering with Minimum Spanning Trees


The goal of this assignment is to do some simple data analysis, using Minimum Spanning Trees to find clusters in multi-dimensional data.

The idea is inspired by this paper (PDF) by Xu, Olman and Xu on using MST's to quickly find clusters of similar gene expression data. Their article is quite readable and introduces both MST's and how they are used to find clusters.

Test your implementation on their toy example and the provided randomly generated data. If anyone has some real data suitable for clustering, available in an easily read text form, please let me know.

In order to let you focus on just the MST and clustering part of the problem, I've done most of the first two tasks for you.


Set up

Create a new project directory called clusters. Download and extract the files in it into that directory. The archive contains the Magic Makefile, some sample data files, some test code, and two classes for holding the data in memory.

PA4 Task 1: Define distance for the Sample class

sample.h and sample.cpp define a Sample class to hold multi-dimensional data with integer ID's. The API is below. Everything is defined already except for distance(). You have to do that one.

In tests.cpp, comment out all code except for the test readSample.

Make sure all this test passes with your distance() function.

Submit to Code Critic: your distance() function and any other code you added.

PA4 Task 2: Test the SampleSet class

sampleset.h and sampleset.cpp define a SampleSet class to hold a set of samples. The API is:

Uncomment the test readXuToyData in tests.cpp and make sure it passes.

There is nothing to submit to Code Critic unless you added some code.

PA4 Task 3: Add clustering

Implement the following additional member functions of SampleSet. They are already declared in sampleset.h.

To cluster the samples, follow the ideas in the paper by Xu, Olman and Xu. To build the MST,

The authors of the paper explore a number of options for deciding when to stop cutting links and making clusters. Brainstorm ideas for doing this with other students. (But don't share code!) For example, I had good luck with the simple idea of adding links until there was a sudden "large" jump in the size of links added to the MST. I observed the pattern of sizes and picked a threshold for "large." This doesn't work if the clusters are fairly close together. I bet there's an even better rule that can tell from the links that are rejected that all intra-cluster links are being selected first. That means the remaining links are going to connect clusters and looping should stop.

If you can find a way to add links until clustering has ended, it's fairly easy to use the internal union/find data structure to implement the cluster count and size functions, especially if you implement union by size. The book shows union by height, but both are simple and have good performance.

Uncomment the code that defines storeSortedSizes() and the remaining tests and make sure they pass.

Submit to Code Critic: all the code you added. Clearly label which items are in header files and which are in implementation files. Clearly label which data members and member functions are public and which are private.

Wrap Up

When all of the above has been completed, and approved as necessary via the Code Critic, check over your code and send it in.

Make sure your program runs standalone, with no user input of any kind. That includes not stopping at the end with a "hit any key to continue" message. If I can't run your program automatically from a script, it will be returned.

Your code should compile without errors or warnings.

Be sure your test code file has an opening header comment that identifies the authorship of the code, as follows:

// EECS311 PA4 solution: Clustering
// Date: November 12, 2009
// Authors: T.H.E.Hacker
// Except for code provided by the instructor or textbook, this code was written
// solely by the authors listed above, and not copied from another student
// or external source.

Run make handin to generate a Zip archive of all your source code.

Email your analysis document and the Zip archive produced by the Magic Makefile to me. Use the Subject EECS 311 PA 4 Clustering.

Valid HTML 4.01 Strict