Web Information Retrieval and Extraction (EECS 395/495)

Spring 2011
Electrical Engineering and Computer Science Department
Northwestern University

Class Meets: 3:30PM-4:50PM TTh, Tech L170

Instructor: Doug Downey

Policies

Grading

Discussion Papers and Submitting Summaries

See the list of papers and dates for "debate" sessions to be held in class. Each student must help lead a debate, either in favor of a paper ("defense") or attacking a paper ("offense"). The discussion groups do not necessarily need to be the same as the project groups. Also, you do not need to sign up as a group -- any individual can sign up to present for any team with at most two presenters currently (except "defense" in the first session on October 11th, which the professor will lead). E-mail the prof (ddowney at eecs.northwestern.edu) to sign up. The deadline to have signed up to lead a debate is October 10th at 11:59PM.

The debate guidelines specify how the debates will proceed. Each team will prepare slides (defense: 20 minute presentation, offense: 10 minute presentation) and can bring their own laptop or use the professor's. If you want to use the prof's laptop, e-mail your powerpoint slides to ddowney at eecs.northwestern.edu AT LEAST TWO HOURS before class time.

Paper summaries

Each student should submit a two-paragraph summary of each paper PRIOR TO the class time in which the paper is being discussed. You will submit these via Blackboard, in the DISCUSSION BOARD for the course (each paper has its own thread, simply submit your summary as the next post -- you can read previously submitted summaries, but obviously do not copy other students' work). The summaries should cover:
  1. A brief summary of what the paper is about, and its contributions
  2. At least one area for improvement in the paper
  3. A suggestion for "next steps" in the same direction that would make interesting future work.
  4. A brief assessment of whether you liked the paper, overall.
See the following tips on: How to read papers.

Course Projects

Project proposals are due October 16 at 11:59PM, and should be about 1 page in length (single spaced). PDF format, turned in via e-mail to ddowney at eecs.northwestern.edu. SUBJECT LINE: EECS 395/495 Project Proposal. PLEASE CC all team members on your e-mail, to make grading and replying easier.

Preliminary Reports. Each group must submit a two-page summary of their project progress on Nov 19. PDF Format, turned in via e-mail to ddowney at eecs.northwestern.edu. SUBJECT LINE: EECS 395/495 Preliminary Report. PLEASE CC all team members on your e-mail, to make replying and forwarding to the peer reviewers easier. The progress report should be readable by people (like your peer reviewers) who haven't seen your original proposal document. The report should cover:

  1. The project goals and motivation
  2. Steps you have completed, and any results you have obtained so far
  3. The key remaining steps you plan to complete before the end of the quarter
  4. Any questions or concerns you have regarding the project

Review of preliminary reports Each group will be assigned another preliminary report to review. PDF Format, turned in via e-mail to the group members and the professor. You should state your opinion of the project idea, and also attempt to prioritize the remainder of the project effort. What aspects of the project are most important to complete or deserve the most attention? What other questions should the project team be considering? This review should be about one page in length, and is due Nov 26th.

Final Reports are due Wednesday, Dec 5, at 11:59PM, and should be about 4 pages in length (single spaced). PDF Format, turned in via e-mail to ddowney at eecs.northwestern.edu. SUBJECT LINE: EECS 395/495 Project Report. PLEASE CC all team members on your e-mail, to make grading and replying easier. You should include your project goals and motivation, along with a concise and clear statement of what results you obtained. Also mention which aspects of future work would be most interesting. Clarity in your report and presentation will contribute significantly to your grade.

Final Presentations will be on Dec 6 (in class) and Dec 10 (the finals date for the course). You should sign up for a slot on either time by Friday Nov 30 -- send e-mail to ddowney at eecs.northwestern.edu. Presentations are to be 8 minutes in length, with 3 minutes for questions.

Helpful Links:
Weka Machine Learning Package

Project Milestones:
Project proposal (~1 pg)Due 11:59PM Oct 16th
Meetings with prof. to finalize project planOct 17 and 18
Preliminary Report (~2 pgs)Due 11:59PM Nov 19
Review of preliminary report (~1 pg)Due 11:59PM Nov 21
Final Report (~4 pgs)Due 11:59PM Dec 5
Project PresentationsDec 6 (during class) and Dec 10 (noon-2PM, finals week)

Reading

Week of Sept 24Optional: History of Web Search
Optional: ComScore February 2011 Search Statistics
Week of Oct 8 Introduction to Information Retrieval, Ch.1-2
Week of Oct 22 Mercator: A Scalable, Extensible Web Crawler

Lectures

Week of Sept 24Intro and course objectives
Week of Oct 8Inverted Indexes
Week of Oct 15Scalable Indexing and Searching
Week of Oct 22Web Crawlers
Week of Oct 29Bloom filters and Min Hash
Week of Nov 12Document Ranking
See also: Dan Weld's slides on PageRank.