EECS 349 Problem Set 2

Due 11:59PM Friday Apr 14

Updated Apr 12 17:25:00 CDT 2017

In this assignment, you will work in teams of 2 or 3 to implement decision trees. You should collaborate on the code, but each student must turn in an individual homework write-up.

The algorithm you should implement is the same as that given in the decision tree lecture slides (slide 24, the "ID3" algorithm), except (a) our ``default'' is a class value to output, rather than a Node as in the pseudocode, and (b) we will not use the "attributes" parameter. Instead, you should terminate the tree-building when either the example set is empty OR all the examples have the same class value OR no non-trivial split of the examples is possible (i.e., there is no split that partitions the data into more than one non-empty set, i.e. all examples have the same attribute vector). In the latter case, the node should assign the mode class value of the examples (breaking ties arbitrarily). [Note: to prevent infinite recursion in certain corner cases, you should explicitly avoid making trivial splits -- i.e. splits that result in all the examples sorting to the same child branch. However, since this note was only added on Wednesday April 12, solutions that do not handle these corner cases correctly will not be penalized.]

We have written code to read in the data for you (parse.py). It represents each example as a dictionary, with attributes stored as key:value pairs. The target output is stored as an attribute with the key "Class".

Guidelines

You should use Information Gain for choosing which attribute to split on.
You must handle missing attributes, but exactly how is up to you.
You must implement some kind of pruning. but exactly how is up to you.
You must adhere to the signature in the ID3.py file. In particular, you must implement the four given methods as described in that file. You can and should define additional methods in ID3.py, and additional fields in the Node class.
You do not need to handle numeric attributes, but your code should work for categorical attributes with an arbitrary number of attribute values, and an arbitrary number of output classes.
You do not need to import any additional modules. You are allowed to import general modules if you want (e.g. numpy) but of course, do not import a decision tree implementation (e.g. scikit-learn).

Steps to complete the homework

Get the code and data files.
Complete the decision tree code. Implement the four methods in ID3.py, adding new methods as necessary. You will also need to change node.py. We have included a few tests in unit_tests.py that you can run individually, to check your methods. NOTE: that depending on how you implement pruning, the pruning test may not pass even if your implementation is acceptable.
Create a PDF document with the answers to the following questions:
1. (0.5 points) Which other students are in your group? (either names or netIDs is fine)
2. (0.5 points) Did you alter the Node data structure? If so, how and why? (2 sentences)
3. (1 point) How did you handle missing attributes, and why did you choose this strategy? (2 sentences)
4. (1 point) How did you perform pruning, and why did you choose this strategy? (4 sentences)
5. (2 points) Now you will try your learner on the house_votes_84.data, and plot learning curves. Specifically, you should experiment under two settings: with pruning, and without pruning. Use training set sizes ranging between 10 and 300 examples. For each training size you choose, perform 100 random runs, for each run testing on all examples not used for training (see testPruningOnHouseData from unit_tests.py for one example of this). Plot the average accuracy of the 100 runs as one point on a learning curve (x-axis = number of training examples, y-axis = accuracy on test data). Connect the points to show one line representing accuracy with pruning, the other without. Include your plot in your pdf, and answer two questions:
  1. In about a sentence, what is the general trend of both lines as training set size increases, and why does this make sense?
  2. In about two sentences, how does the advantage of pruning change as the data set size increases? Does this make sense, and why or why not?
  Note: depending on your particular approach, pruning may not improve accuracy consistently or may decrease it. You can still receive full credit for this as long as your approach is reasonable and correctly implemented.

The correct functionality of your code is then worth five points, making a total of ten points for the assignment.

One last suggestion: You may find it helpful to consult the starter code from last year's decision tree homework for reference, but be aware that that assignment involved continuous attributes and used a much more complex design than you will need for this homework.

Submission Instructions

You'll turn in your homework as a single zip file, in Canvas. Specifically:

Create a single pdf file PS2.pdf with the answers to the questions above, and your graphs.
Create a single ZIP file containing:
- PS2.pdf
- All of your .py code files
Turn the zip file in under Problem Set 2 in Canvas.