Pen and paper (Type your answers, actually)
- Mitchell problem 2.5 (4 points)
Note: in problem (c), we are looking for a minimal sequence of queries
which always identifies the correct hypothesis (independent of the
results of the queries). In other words, find the sequence which is
shortest in the worst case.
- How many distinct instances are possible for the learning task in question 1? (⅓ point)
- How many syntactically distinct hypotheses are in the hypotheses space from question 1? (⅓ point)
- How many semantically distinct hypotheses are in the hypotheses space from question 1? (⅓ point)
- Let an "Easy Hypothesis Space" (EHS) be one with the following property: immediately after the first training example is observed, the version space will always contain exactly one hypothesis. Give an example of an EHS for the learning task in question 1 above. (1 point)
- Extra Credit: Consider a concept learning task defined over an instance space of n distinct instances. What is the minimal number of hypotheses an EHS can contain? What is the maximal number of hypothesis an EHS can contain? For the learning task in problem 1, describe an EHS of maximal size. (2 points)
Programming
Implement a decision tree learning algorithm and apply it to the following dataset: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/mushroom/.
The task is to predict whether a mushroom is edible or poisonous based on its attributes (e.g., color, size, shape, etc.). Details on the meanings of the attributes is given in the agaricus-lepiota.names file on the dataset Web page above.
For convenience, the data is already split into 3/4 training data and 1/4 test data.
Answer the following questions (about 1-3 sentences each).
- Describe how you handled missing attributes. (1 point)
- What is the termination criterion for your learning process? (1 point)
- Apply your learning algorithm to 3/4 of the mushroom dataset given here. Print out a Boolean formula in disjunctive normal form that corresponds to the learned decision tree. Also, explain in English one of the rules that was learned. (3 points)
- Test your algorithm on the remaining 1/4 of the data (given here) and report the accuracy on the test. (3 points)
- What was the accuracy on the training data? How does the training data accuracy compare with the test data accuracy? Briefly explain any differences you see. (2 points)
Your answers to the first part should be in one document (RTF, Text, PDF, simple DOC) and for the second part, your code should be in one plain-text file, with instructions on how to compile and run it. Zip the two files and attach to an email as directed below.
|
|