Updated Apr 12 17:25:00 CDT 2017
In this assignment, you will work in teams of 2 or 3 to implement decision trees. You should collaborate on the code, but each student must turn in an individual homework write-up.
The algorithm you should implement is the same as that given in the decision tree lecture slides (slide 24, the "ID3" algorithm), except (a) our ``default'' is a class value to output, rather than a Node as in the pseudocode, and (b) we will not use the "attributes" parameter. Instead, you should terminate the tree-building when either the example set is empty OR all the examples have the same class value OR no non-trivial split of the examples is possible (i.e., there is no split that partitions the data into more than one non-empty set, i.e. all examples have the same attribute vector). In the latter case, the node should assign the mode class value of the examples (breaking ties arbitrarily). [Note: to prevent infinite recursion in certain corner cases, you should explicitly avoid making trivial splits -- i.e. splits that result in all the examples sorting to the same child branch. However, since this note was only added on Wednesday April 12, solutions that do not handle these corner cases correctly will not be penalized.]
We have written code to read in the data for you (parse.py
).
It represents each
example as a dictionary, with attributes stored as key:value pairs.
The target output is stored as an attribute with the key "Class".
ID3.py
, adding new methods as necessary.
You will also need to change node.py
.
We have included a few tests in unit_tests.py that you can run
individually, to check your methods. NOTE: that depending on how you implement
pruning, the pruning test may not pass even if your implementation is acceptable.house_votes_84.data
, and
plot learning curves.
Specifically, you should experiment under two settings:
with pruning, and without pruning. Use training set sizes ranging between 10 and 300
examples.
For each training size you choose, perform 100 random runs,
for each run testing on all examples not used for training (see
testPruningOnHouseData
from unit_tests.py
for one example of this). Plot the average
accuracy of the 100 runs as one point on a learning curve (x-axis = number of training examples,
y-axis = accuracy on test data).
Connect the points to show one line representing accuracy with pruning, the other without.
Include your plot in your pdf, and answer two questions:
One last suggestion: You may find it helpful to consult the starter code from last year's
decision tree homework for reference,
but be aware that that assignment involved continuous attributes and used a much more complex design than you will need for this homework.
Submission Instructions
You'll turn in your homework as a single zip file, in Canvas. Specifically:
PS2.pdf
.py
code files