Presentations in class Dec 4. Write-up due at 11:59 PM Friday Dec 12 via Canvas.

In this project, you will work in groups to contribute to a statistical language model.

Project Objectives

The goal of this project is to experiment with using probabilistic graphical models to model language. The class will work together to train a probabilistic language model. We have constructed a simple model as a starting point, and have provided a server that will take your training input and attempt to improve the model. Your goal is to provide the most helpful training input possible, using techniques you have learned in the course.

In the last week of class you will present what you tried, and tell us what worked and what didn't. You will also produce a final project report summarizing your findings.

You should begin developing techniques to produce training examples right away, and attempt to train our shared language model, accessible through a Web site described below. Ideally your presentation and report will include your results from several training examples submitted throughout the final three weeks of the course.

The Language Modeling Training Interface

You submit your training examples using our language model training Web site. It takes a training example file and returns a score (higher is better) equal to how much your training examples reduced the model's perplexity on held-out text.

Your training example file should include a set of four-word contexts of your choice, with a distribution over words (also your choice) that you are suggesting is likely to follow the context. So you might provide the context "Midwestern cities such as" and suggest a distribution that assigns high probability to Chicago, Omaha, Springfield, Lincoln, Kansas, and so on. (note, "Kansas" is in this list to signify the first token of "Kansas City").

You can use the API as many times as you like, changing the contexts and distributions each time (or not). Our language model will learn from each helpful input you provide, so re-sending the same training examples will probably cease being helpful after some time. Here is an example training input that you can use to get started. The formatting requirements are:

API HINTS: If you notice that the API appears to be not functioning, please e-mail zswitten at gmail dot com AND the professor to alert us to the problem.

Deliverables and Deadlines

Important First Step: As soon as possible, form a group of 2-4 students and e-mail the professor with your team name and team members.
  1. (10 points) A project presentation. These will be delivered on Thursday of the last week of class, i.e. Dec 4. Plan for 4 minutes of content with 3 minutes for Q/A. Summarize what you tried, and what worked and what didn't.
  2. (15 points) A final report, in PDF format via Canvas. The report should consist of about 2 pages of written content or less, plus however many figures and tables are helpful to convey your results.
In both the presentation and the report, but sure to note how much you improved the model in aggregate (this is equal to the sum of your *positive* scores, since we don't keep the changes that result in negative scores), and how your marginal improvements varied over the ~three week project period. Note that the final project report is due a few days after the presentations, so there will be time to make enhancements based on the feedback on your presentation.

Suggestions

Get started right away -- your approaches will surely improve as the quarter proceeds, but because everyone is training the same model, large accuracy improvements will be harder for you to achieve as the model improves over time. We will cover several suggested methods for crafting your input over the coming weeks. One option is to try to use knowledge bases to provide input focused on a single semantic class (as in the cities example above). If you try that strategy, the following knowledge sources may be helpful: Finally, here are some text corpora that you could use to get started. The Brown Corpus (#5) is a good option. Note, whatever corpus you use, you will need to discard any of its words that don't match our vocabulary (i.e., change them to unk0).