Homework 4

Due at 11:59 PM Monday, October 31st via e-mail to ddowney at eecs.northwestern.edu. Use EECS 395/495 Homework 4 as the e-mail subject line. PDF format preferred, though Plaintext, Word, and HTML are also acceptable.

(3 points) Below, three machine learning tasks are listed. For each, give the most appropriate graphical model type (Bayes Net or Markov Net), with a one-sentence justification. Also, for each task, exactly one of the following techniques is particularly necessary: (1) Structure learning, (2) MAP estimation, or (3) sampling-based (rather than exact) inference. State which of these techniques you'd use for the task, with a one-sentence justification. You should use each technique exactly once.

Disease diagnosis: you're given a large database of patient data, listing symptoms, test results, and ultimate diagnoses. You also have a compendium of medical knowledge that lists some known associations between several common diseases and their symptoms. Your goal is to build a predictor of disease given symptoms and test results, based on this data.
Image segmentation: You have a large set of images, and would like to segment them into k continguous regions. You want to represent the segment of each pixel as a k-valued discrete random variable, with adjacent pixels more likely to be in the same region when they have similar colors.
Large-scale document classification. Let's say you work at a major search engine and you have a set of four billion pages, of which approximately half have received at least one click in search results, and the other half have not. Your goal is to build a model that predicts, based on the words in a new document, whether it will be clicked on in search results or not.

Here you'll complete an implementation of an EM codebase for performing a "set expansion" task. You'll want to download the code and data files.
The Bayes Net you're performing EM over has the structure word->topic->context, where the topics are never observed. The topic variable is often referred to as "z" in the code. The words (see wordDictionary.txt) are the names of either companies or countries, and the contexts (see contextDictionary.txt) are two-word phrases observed to the left or the right of the words in a set of sentences on the Web (the original text is in corpusSelection.txt). The actual occurrences of words in contexts are listed (using the IDs from the dictionaries) in data.txt.
The code already has routines for reading/writing these files, so you won't need to process them.

You can think of training the model as a dimensionality reduction task, which attempts to summarize all of the contextual information regarding each word w in data.txt in a small vector of numbers P(z | w). Then, we can employ the model for a set-expansion task: we efficiently expand a set of "seed" examples (for example, UK, Libya, Myanmar) by searching for those w' which have P(z | w') similar to the seeds (for the previous example, this is hopefully a list of countries).

Even if you don't know Java, you should be able to complete this assignment. The code is in the file TextMasterEM.java, and can be compiled using
javac TextMasterEM.java
on any machine with a recent Java jdk installation. You can then run the code in two different modes, for example:
java TextMasterEM train data.txt 10000 model.txt 6
java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow
The first example trains a model on data.txt for 10000 iterations and using 6 topics, the second tests the model using the companies.txt as correct ground truth and the three "seed" examples Shell, BT, and Dow. The test script outputs the rank order of the words in decreasing similarity to the seeds, along with the "average precision" performance measure of the list and the baseline perf. of a random list. A sample model is included in the download, so you can try the test script before you fix the training routine.

Exercises:

(6 points) Complete the training code. Search for the string //BEGIN code to be changed in the TextMasterEM.java file; use the comments there and your knowledge of the EM algorithm to fix the algorithm. If you use the skeleton code that's there, only four total lines need to be edited. Output your new code and the training code's output to the screen (giving avg. log likelihood, etc) for at least 500 iterations.
(3 points) Experiment with local maxima and varying topic size. For 4 and 10 topics, execute three training runs of 2000 iterations each, and test the resulting models in the following manner:
java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow java TextMasterEM test model.txt wordDictionary.txt countries.txt Libya United+Kingdom Myanmar
For each run, list the final average log likelihood of the training script, and the "average precision" of the ranked list in testing. Question: are local maxima a problem in training, and is the effect worse for a particular topic size? Hypothesize as to why this might be the case.
(2 points) Do some other experiment you find interesting. Some ideas: try expanding some different subset of the words (e.g., developing countries, word that begin with A, etc.), train for a much longer time and measure performance, see if you can understand the hidden topic's "meaning" in terms of its words or contexts.