Web Information Retrieval and Extraction (EECS 395/495)

Spring 2011
Electrical Engineering and Computer Science Department
Northwestern University

Class Meets: 12:30PM-1:50PM TTh, Tech M177

Instructor: Doug Downey

Teaching Assistant: Vaibhav Rastogi

Policies

Grading

Discussion Papers and Submitting Summaries

See the list of papers and dates for "debate" sessions to be held in class. Each student must help lead a debate, either in favor of a paper ("defense") or attacking a paper ("offense"). The discussion groups do not necessarily need to be the same as the project groups. Also, you do not need to sign up as a group -- any individual can sign up to present for any team with at most two presenters currently (except the first session on March 31, which the prof and TA will lead). E-mail Vaibhav (vrastogi at u.northwestern.edu) to sign up. The deadline to have signed up to lead a debate is April 4th at 11:59PM.

The debate guidelines specify how the debates will proceed. Each team will prepare slides (defense: 20 minute presentation, offense: 10 minute presentation) and can bring their own laptop or use the professor's. If you want to use the prof's laptop, e-mail your powerpoint slides to ddowney at eecs.northwestern.edu AT LEAST TWO HOURS before class time.

Paper summaries

Each student should submit a two-paragraph summary of each paper AT LEAST TWO HOURS prior to the class time in which the paper is being discussed. E-mail your summaries to BOTH vrastogi at u.northwestern.edu and ddowney at eecs.northwestern.edu, using the subject line: "EECS 395/495 Summary Week #" where # is the numerical week of the quarter (between 1 and 10). The summaries should cover:
  1. A brief summary of what the paper is about, and its contributions
  2. At least one area for improvement in the paper
  3. A suggestion for "next steps" in the same direction that would make interesting future work.
  4. A brief assessment of whether you liked the paper, overall.
See the following tips on: How to read papers.

Course Projects

Project proposals are due Monday, April 11 at 11:59PM, and should be about 1 page in length (single spaced). E-mail your proposal (.pdf or .doc format) to BOTH vrastogi at u.northwestern.edu and ddowney at eecs.northwestern.edu.

Preliminary Reports. Each group must submit a two-page summary of their project progress on May 11. E-mail your summaries to BOTH vrastogi at u.northwestern.edu and ddowney at eecs.northwestern.edu, using the subject line: "EECS 395/495 Progress Report." IMPORTANT: please cc all group members on your e-mail, to make responding to the entire group (for the professor and the peer reviewers) easy. The progress report should be readable by people (like your peer reviewers) who haven't seen your original proposal document. The report should cover:

  1. The project goals and motivation
  2. Steps you have completed, and any results you have obtained so far
  3. The key remaining steps you plan to complete before the end of the quarter
  4. Any questions or concerns you have regarding the project

Review of preliminary reports Each group will be assigned another preliminary report to review. Each group should respond to the team members, as well as vrastogi at u.northwestern.edu and ddowney at eecs.northwestern.edu, with a one-page review of the project progress. You should state your opinion of the project idea, and also attempt to prioritize the remainder of the project effort. What aspects of the project are most important to complete or deserve the most attention? What other questions should the project team be considering? This review should be about one page in length, and is due May 16th.

Final Reports are due Wednesday, June 1, at 11:59PM, and should be about 4 pages in length (single spaced). E-mail your reports (.pdf or .doc format) to BOTH vrastogi at u.northwestern.edu and ddowney at eecs.northwestern.edu. You should include your project goals and motivation, along with a concise and clear statement of what results you obtained. Also mention which aspects of future work would be most interesting. Clarity in your report and presentation will contribute significantly to your grade.

Final Presentations will be on June 2, and will be held in class as well as that EVENING from 6:00-8:00PM. You should sign up for a slot on either time by May 31 -- send e-mail to Vaibhav (vrastogi at u.northwestern.edu). Presentations are to be 10 minutes in length, with 5 minutes for questions.

Helpful Links:
Weka Machine Learning Package

Project Milestones:
Project proposal (~1 pg)Due 11:59PM April 11th (via e-mail; see above)
Meetings with prof. to finalize project planApril 12 and 13
Preliminary Report (~2 pgs)Due 11:59PM May 11
Review of preliminary report (~1 pg)Due 11:59PM May 16
Final Report (~4 pgs)Due 11:59PM June 1
Project PresentationsJune 2 (during class and from 6-8PM)

Reading

Week of March 28Optional: History of Web Search
Optional: ComScore February 2011 Search Statistics
Week of April 4 Introduction to Information Retrieval, Ch.1-2
Week of April 18 Scholarpedia article on Latent Semantic Analysis
Week of May 9 Mercator: A Scalable, Extensible Web Crawler

Lectures

Week of March 28Intro and course objectives
Week of April 4Inverted Indices
Week of April 18 Basics of Machine Learning
Latent Semantic Analysis
Week of April 25 Search Advertising
Music Information Retrieval
Week of May 2 Index Compression, Scalable Indexing and Search
Week of May 9 Web Crawlers
Week of May 16 Document Ranking
See also: Dan Weld's slides on PageRank.