Final Project

Write-up due at 11:59 PM Friday Dec 6 st via e-mail to ddowney at eecs.northwestern.edu. Use EECS 395/495 Project as the e-mail subject line. PDF format preferred, though Plaintext, Word, and HTML are also acceptable. Note extra credit for including latex source.

This project is aimed at integrating Web Information Extraction systems through the use of word representations. Four teams will utilize four different automatically extracted knowledge bases, and use graphical models to output word representations for a given set of words that reflect the knowledge in the resources.

NELL: Never-ending Language Learner

TextRunner

WikiTables

~3.5M sentences of Web text

Each group will use one resource from the above to produce a set of word representations, i.e. numeric vectors that reflect the meanings of specific words, such that similar vectors correspond to words with similar meanings. These vectors will then be tested on how well they capture the semantics of words. We'll form four teams, one for each resource. The specific tasks for each team are as follows.

Create a technique for generating word representations from your resources. Use ideas from the course! You should limit your representations to at most 100 dimensions.
Evaluate at least two different methods for word representations, and test those on the development set. See the notes on the testing code below.
Turn in your dev and test set word representations. The format is one word per line, first the word and then the representations, space or tab delimited (spaces in the terms themselves must be replaced with "+" signs, see the dev and test sets for examples).
Turn in a report. The report can be brief (suggestion is 1-2 pages) but it should:
1. Succinctly and completely characterize how your word representations are generated. How did you utilize the KB, and what pre-processing did you perform? If you define a graphical model, you should include a graphical representation of the model along with a mathematical statement of the distribution it encodes.
2. Provide experimental results on the development set both before and after at least one enhancement.
3. Briefly state which group members handled which portion of the project.
Note, if you write your report in latex format and include both your latex source and figure files, you will automatically receive one point of extra credit.

You will be graded on the ingenuity of your techniques and the clarity of your report; note, you must make some use of ideas from the course (although you can use other techniques as well).

Testing Code

You should use our java class, RepresentationTester, which takes in set of word representations and returns their performance on the development set, along with the performance of a random baseline.

To get you started, you can try our sample representation file which uses a simple but not entirely ineffective word representation -- a single dimension equal to the word's length in characters.

For example, from a directory that contains the compiled java class, the unzipped devset directory, and the simplerep.txt file, you should be able to execute the following:
java RepresentationTester devset simplerep.txt
On my system, the simple rep averages 0.070 as opposed to the random 0.066 (for reference you can consult the expected output of this run). Oddly, you might get slightly different performance numbers on your system/java version (e.g., I got rep score of 0.069 and random of 0.067 on another system). However, the random seeds are fixed in the code, so performance should be the same across multiple runs on the same system.

Presentations

Each group will present work-in-progress (5-10 minute presentation) in class on Thursday, Dec 5. This will be the last day of class, and no final presentations will be held.