Final Project

Presentations in class Dec 4. Write-up due at 11:59 PM Friday Dec 12 via Canvas.

In this project, you will work in groups to contribute to a statistical language model.

Project Objectives

The goal of this project is to experiment with using probabilistic graphical models to model language. The class will work together to train a probabilistic language model. We have constructed a simple model as a starting point, and have provided a server that will take your training input and attempt to improve the model. Your goal is to provide the most helpful training input possible, using techniques you have learned in the course.

In the last week of class you will present what you tried, and tell us what worked and what didn't. You will also produce a final project report summarizing your findings.

You should begin developing techniques to produce training examples right away, and attempt to train our shared language model, accessible through a Web site described below. Ideally your presentation and report will include your results from several training examples submitted throughout the final three weeks of the course.

The Language Modeling Training Interface

You submit your training examples using our language model training Web site. It takes a training example file and returns a score (higher is better) equal to how much your training examples reduced the model's perplexity on held-out text.

Your training example file should include a set of four-word contexts of your choice, with a distribution over words (also your choice) that you are suggesting is likely to follow the context. So you might provide the context "Midwestern cities such as" and suggest a distribution that assigns high probability to Chicago, Omaha, Springfield, Lincoln, Kansas, and so on. (note, "Kansas" is in this list to signify the first token of "Kansas City").

You can use the API as many times as you like, changing the contexts and distributions each time (or not). Our language model will learn from each helpful input you provide, so re-sending the same training examples will probably cease being helpful after some time. Here is an example training input that you can use to get started. The formatting requirements are:

To begin a new example, write a signal line with the phrase "Context Words:", followed by 4 words separated by spaces. The words should be drawn from the corpus vocabulary of 30,001 words. Note that punctuation marks, such as ",", "'", or "...", are perfectly fine "words" to use, and that your file should begin with a signal line.
After each signal line, you can add any number of probability lines. Each probability line should consist of a word followed by a space and then a single floating point number, like 0.1 or 0.04023930183, which corresponds to the probability of that word being the next word given that the 4 previous words are the ones in the signal phrase above.

API HINTS:

There is a special symbol "unk0" in the vocabulary. This indicates any words our model encounters that are not found in the other 30,000 words in the vocabulary. Your training distributions will probably want to assign some non-negligible probability to unk0.
You may want to make sure your probabilities for each context sum to 1, though this is not a requirement.
Your document can have any number of signal lines/distributions, but if you include too few, the effect of your document on the model might not be discernible, while if you include too many, training time could be large. For best results, stick to a number of distributions between 50 and 50,000.

If you notice that the API appears to be not functioning, please e-mail zswitten at gmail dot com AND the professor to alert us to the problem.

Deliverables and Deadlines

Important First Step: As soon as possible, form a group of 2-4 students and e-mail the professor with your team name and team members.

(10 points) A project presentation. These will be delivered on Thursday of the last week of class, i.e. Dec 4. Plan for 4 minutes of content with 3 minutes for Q/A. Summarize what you tried, and what worked and what didn't.
(15 points) A final report, in PDF format via Canvas. The report should consist of about 2 pages of written content or less, plus however many figures and tables are helpful to convey your results.

In both the presentation and the report, but sure to note how much you improved the model in aggregate (this is equal to the sum of your *positive* scores, since we don't keep the changes that result in negative scores), and how your marginal improvements varied over the ~three week project period. Note that the final project report is due a few days after the presentations, so there will be time to make enhancements based on the feedback on your presentation.

Suggestions

Get started right away -- your approaches will surely improve as the quarter proceeds, but because everyone is training the same model, large accuracy improvements will be harder for you to achieve as the model improves over time. We will cover several suggested methods for crafting your input over the coming weeks. One option is to try to use knowledge bases to provide input focused on a single semantic class (as in the cities example above). If you try that strategy, the following knowledge sources may be helpful:

Finally, here are some text corpora that you could use to get started. The Brown Corpus (#5) is a good option. Note, whatever corpus you use, you will need to discard any of its words that don't match our vocabulary (i.e., change them to unk0).

NLTK Text Corpora