Due at 11:59 PM Monday, October 31st via e-mail to ddowney at eecs.northwestern.edu. Use EECS 395/495 Homework 4 as the e-mail subject line. PDF format preferred, though Plaintext, Word, and HTML are also acceptable.
The Bayes Net you're performing EM over has the structure word->topic->context, where the topics are never observed. The topic variable is
often referred to as "z" in the code. The words (see wordDictionary.txt) are the names of either companies
or countries, and the contexts (see contextDictionary.txt) are two-word phrases observed to the left or the right of the words in a set of sentences on the Web (the original text is in corpusSelection.txt).
The actual occurrences of words in contexts are listed (using the IDs from the dictionaries) in data.txt.
The code already has routines for reading/writing these files, so you won't need to process them.
You can think of training the model as a dimensionality reduction task, which attempts to summarize all of the contextual information regarding each word w in data.txt in a small vector of numbers P(z | w).
Then, we can employ the model for a set-expansion task: we efficiently expand a set of "seed" examples (for example, UK, Libya, Myanmar) by searching for those w' which have P(z | w') similar to the seeds (for the previous example, this is hopefully a list of countries).
Even if you don't know Java, you should be able to complete this assignment. The code is in the file TextMasterEM.java, and can be compiled using
javac TextMasterEM.java
on any machine with a recent Java jdk installation. You can then run the code in two different modes, for example:
java TextMasterEM train data.txt 10000 model.txt 6
java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow
The first example trains a model on data.txt for 10000 iterations and using 6 topics, the second tests the model
using the companies.txt as correct ground truth and the three "seed" examples Shell, BT, and Dow. The
test script outputs the rank order of the words in decreasing similarity to the seeds, along with the
"average precision" performance measure of the list and the baseline perf. of a random list. A sample model
is included in the download, so you can try the test script before you fix the training routine.
Exercises:
java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow
java TextMasterEM test model.txt wordDictionary.txt countries.txt Libya United+Kingdom Myanmar
For each run, list the final average log likelihood of the training script, and the "average precision" of the ranked list in testing. Question: are local maxima a problem in training, and is the effect worse for a particular topic size? Hypothesize as to why this might be the case.