|EECS 311: DATA STRUCTURES||
The goal of this assignment is to do some simple data analysis, using Minimum Spanning Trees to find clusters in multi-dimensional data.
The idea is inspired by this paper (PDF) by Xu, Olman and Xu on using MST's to quickly find clusters of similar gene expression data. Their article is quite readable and introduces both MST's and how they are used to find clusters.
Test your implementation on their toy example and the provided randomly generated data. If anyone has some real data suitable for clustering, available in an easily read text form, please let me know.
In order to let you focus on just the MST and clustering part of the problem, I've done most of the first two tasks for you.
Create a new project directory called clusters. Download clusters.zip and extract the files in it into that directory. The archive contains the Magic Makefile, some sample data files, some test code, and two classes for holding the data in memory.
sample.h and sample.cpp
Sample class to hold multi-dimensional data with
integer ID's. The API is below. Everything is defined already
except for distance(). You have to do that one.
In tests.cpp, comment out all code except for the test readSample.
Make sure all this test passes with your distance() function.
Submit to Code Critic: your distance() function and any other code you added.
sampleset.h and sampleset.cpp
SampleSet class to hold a set of
samples. The API is:
Uncomment the test readXuToyData in tests.cpp and make sure it passes.
There is nothing to submit to Code Critic unless you added some code.
Implement the following additional member functions of SampleSet. They are already declared in sampleset.h.
To cluster the samples, follow the ideas in the paper by Xu, Olman and Xu. To build the MST,
The authors of the paper explore a number of options for deciding when to stop cutting links and making clusters. Brainstorm ideas for doing this with other students. (But don't share code!) For example, I had good luck with the simple idea of adding links until there was a sudden "large" jump in the size of links added to the MST. I observed the pattern of sizes and picked a threshold for "large." This doesn't work if the clusters are fairly close together. I bet there's an even better rule that can tell from the links that are rejected that all intra-cluster links are being selected first. That means the remaining links are going to connect clusters and looping should stop.
If you can find a way to add links until clustering has ended, it's fairly easy to use the internal union/find data structure to implement the cluster count and size functions, especially if you implement union by size. The book shows union by height, but both are simple and have good performance.
Uncomment the code that defines storeSortedSizes() and the remaining tests and make sure they pass.
Submit to Code Critic: all the code you added. Clearly label which items are in header files and which are in implementation files. Clearly label which data members and member functions are public and which are private.
When all of the above has been completed, and approved as necessary via the Code Critic, check over your code and send it in.
Make sure your program runs standalone, with no user input of any kind. That includes not stopping at the end with a "hit any key to continue" message. If I can't run your program automatically from a script, it will be returned.
Your code should compile without errors or warnings.
Be sure your test code file has an opening header comment that identifies the authorship of the code, as follows:
// EECS311 PA4 solution: Clustering // Date: November 12, 2009 // Authors: T.H.E.Hacker // // Except for code provided by the instructor or textbook, this code was written // solely by the authors listed above, and not copied from another student // or external source.
Run make handin to generate a Zip archive of all your source code.
Email your analysis document and the Zip archive produced by the Magic Makefile to me. Use the Subject EECS 311 PA 4 Clustering.