Homework 4

Due Monday, March 1 at 11:59 PM via e-mail to BOTH jiangxu2011 at u.northwestern.edu and ddowney at eecs.northwestern.edu. Use EECS 395/495 Homework 4 as the e-mail subject line. PDF format preferred, though Plaintext, Word, and HTML are also acceptable.

1. Consider a Bayes Net A->B->C with three binary variables. Let P(A) = 0.1, P(B | A=0)=0.1, P(B | A=1)=0.8, P(C | B=0)=0.2, P(C | B=1)=0.9. We're attempting to answer the query P(B | C=1) using various samplers.
(i) 1 point. Using a rejection sampler, what fraction of samples will we discard because they don't match the evidence?
(ii) 1 point. Using a likelihood weighting approach, what weight is assigned to the samples with B=0? What fraction of samples have B=0?
(iii) 1 point. Using a Gibbs sampler, on average what fraction of the time will our sampler spend in the state with B=1? How is this quantity related to the P(B=1 | C=1) we are trying to estimate?
(iv) 1 point. In a sentence, state which sampler you'd prefer for this task and why, based on your answers above.

2. Here you'll complete an implementation of an EM codebase for performing a "Google sets"-like task. You'll want to download the code and data files.
The Bayes Net you're performing EM over has the structure word->topic->context, where the topics are never observed. The topic variable is often referred to as "z" in the code. The words (see wordDictionary.txt) are the names of either companies or countries, and the contexts (see contextDictionary.txt) are two-word phrases observed to the left or the right of the words in a set of sentences on the Web (the original text is in corpusSelection.txt). The actual occurrences of words in contexts are listed (using the IDs from the dictionaries) in data.txt.
The code already has routines for reading/writing these files, so you won't need to process them.

You can think of training the model as a dimensionality reduction task, which attempts to summarize all of the contextual information regarding each word w in data.txt in a small vector of numbers P(z | w). Then, we can employ the model for a Google-sets-like task: we efficiently expand a set of "seed" examples (for example, UK, Libya, Myanmar) by searching for those w' which have P(z | w') similar to the seeds (for the previous example, this is hopefully a list of countries).

Even if you don't know Java, you should be able to complete this assignment. The code is in the file TextMasterEM.java, and can be compiled using
javac TextMasterEM.java
on any machine with a recent Java jdk installation. You can then run the code in two different modes, for example:
java TextMasterEM train data.txt 10000 model.txt 6
java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow
The first example trains a model on data.txt for 10000 iterations and using 6 topics, the second tests the model using the companies.txt as correct ground truth and the three "seed" examples Shell, BT, and Dow. The test script outputs the rank order of the words in decreasing similarity to the seeds, along with the "average precision" performance measure of the list and the baseline perf. of a random list. A sample model is included in the download, so you can try the test script before you fix the training routine.

Exercises:
(i) 6 points. Complete the training code. Search for the string //BEGIN code to be changed in the TextMasterEM.java file; use the comments there and your knowledge of the EM algorithm to fix the algorithm. If you use the skeleton code that's there, only four total lines need to be edited. Output your new code and the training code's output to the screen (giving avg. log likelihood, etc) for at least 500 iterations.
(ii) 3 points. Experiment with local maxima and varying topic size. For 4 and 10 topics, execute three training runs of 2000 iterations each, and test the resulting models in the following manner:
java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow
java TextMasterEM test model.txt wordDictionary.txt countries.txt Libya United+Kingdom Myanmar
For each run, list the final average log likelihood of the training script, and the "average precision" of the ranked list in testing. Question: are local maxima a problem in training, and is the effect worse for a particular topic size? Hypothesize as to why this might be the case.
(iii) 2 points. Do some other experiment you find interesting. Some ideas: try expanding some different subset of the words (e.g., developing countries, word that begin with A, etc.), train for a much longer time and measure performance, see if you can understand the hidden topic's "meaning" in terms of its words or contexts.