Problem Set 1: Due 5:00PM Thursday, October 9

Pen and paper (Type your answers, actually)

Mitchell problem 2.5 (4 points)
Note: in problem (c), we are looking for a minimal sequence of queries which always identifies the correct hypothesis (independent of the results of the queries). In other words, find the sequence which is shortest in the worst case.
How many distinct instances are possible for the learning task in question 1? (⅓ point)
How many syntactically distinct hypotheses are in the hypotheses space from question 1? (⅓ point)
How many semantically distinct hypotheses are in the hypotheses space from question 1? (⅓ point)
Let an "Easy Hypothesis Space" (EHS) be one with the following property: immediately after the first training example is observed, the version space will always contain exactly one hypothesis. Give an example of an EHS for the learning task in question 1 above. (1 point)
Extra Credit: Consider a concept learning task defined over an instance space of n distinct instances. What is the minimal number of hypotheses an EHS can contain? What is the maximal number of hypothesis an EHS can contain? For the learning task in problem 1, describe an EHS of maximal size. (2 points)

Programming
Implement a decision tree learning algorithm and apply it to the following dataset: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/mushroom/. The task is to predict whether a mushroom is edible or poisonous based on its attributes (e.g., color, size, shape, etc.). Details on the meanings of the attributes is given in the agaricus-lepiota.names file on the dataset Web page above. For convenience, the data is already split into 3/4 training data and 1/4 test data. Answer the following questions (about 1-3 sentences each).

Describe how you handled missing attributes. (1 point)
What is the termination criterion for your learning process? (1 point)
Apply your learning algorithm to 3/4 of the mushroom dataset given here. Print out a Boolean formula in disjunctive normal form that corresponds to the learned decision tree. Also, explain in English one of the rules that was learned. (3 points)
Test your algorithm on the remaining 1/4 of the data (given here) and report the accuracy on the test. (3 points)
What was the accuracy on the training data? How does the training data accuracy compare with the test data accuracy? Briefly explain any differences you see. (2 points)

Your answers to the first part should be in one document (RTF, Text, PDF, simple DOC) and for the second part, your code should be in one plain-text file, with instructions on how to compile and run it. Zip the two files and attach to an email as directed below.

Submit your homework via email to f-iacobelli@northwestern.edu. Put EECS349-PS<problem set number>-<first name>-<LastName> on the subject of your email and attach a compressed ZIP file with the solution. The ZIP file naming convention is: PS<problem set number>-<first name>-<LastName>.zip. For example, if your name is James Bond and you are submitting your solution to Problem Set 1, you will send the TA an email with EECS349-PS1-James-Bond as the subject and you will attach the file PS1-James-Bond.zip which contains all the files that comprise your solution to Problem Set 1.