Write-up due at 11:59 PM Friday Dec 6 st via e-mail to ddowney at eecs.northwestern.edu. Use EECS 395/495 Project as the e-mail subject line. PDF format preferred, though Plaintext, Word, and HTML are also acceptable. Note extra credit for including latex source.
This project is aimed at integrating Web Information Extraction systems through the use of word representations. Four teams will utilize four different automatically extracted knowledge bases, and use graphical models to output word representations for a given set of words that reflect the knowledge in the resources.
To get you started, you can try our sample representation file which uses a simple but not entirely ineffective word representation -- a single dimension equal to the word's length in characters.
For example, from a directory that contains the compiled java class, the unzipped devset directory, and the simplerep.txt file, you should be able to execute the following:
You will be graded on the ingenuity of your techniques and the clarity of your report; note, you must make some use of ideas from the course (although you can use other techniques as well).
Note, if you write your report in latex format and include both your latex source and figure files, you will automatically receive one point of extra credit.
Testing Code
You should use our java class, RepresentationTester, which takes in set of word representations and returns their performance on the development set, along with the performance of a random baseline.
java RepresentationTester devset simplerep.txt
On my system, the simple rep averages 0.070 as opposed to the random 0.066 (for reference you can consult the expected output of this run).
Oddly, you might get slightly different performance numbers on your system/java version (e.g., I got rep score of 0.069 and random of 0.067 on another system). However,
the random seeds are fixed in the code, so performance should be the same across multiple runs on the same system.
Presentations
Each group will present work-in-progress (5-10 minute presentation) in class on Thursday, Dec 5. This will be the last day of class, and no final presentations will be held.