EECS/MSAI 349 Problem Set 1

Due 11:59PM Wednesday October 17

Updated Oct 15 21:45:00 CDT 2018

In this assignment, you will work in teams of 1 or 2 to implement decision trees. You can collaborate within your partner on the code and write-up, but each student must turn in an individual homework write-up. You may discuss the homework with other teams, but do not take any written record from the discussions. Also, do not copy any source code from the Web.

The algorithm you should implement is the same as that given in the decision tree lecture slides (slide 24, the "ID3" algorithm).

We have written code to read in the data for you (parse.py). It represents each example as a dictionary, with attributes stored as key:value pairs. The target output is stored as an attribute with the key "Class".

Guidelines

You should use Information Gain for choosing which attribute to split on.
You must handle missing attributes, but exactly how is up to you.
You must implement some kind of pruning. but exactly how is up to you.
You must adhere to the signature in the ID3.py file. In particular, you must implement the four given methods as described in that file. You can and should define additional methods in ID3.py, and additional fields in the Node class.
You do not need to handle numeric attributes, but your code should work for categorical attributes with an arbitrary number of attribute values, and an arbitrary number of output classes.
Do not import any modules from outside the Python standard library. If you need other modules to produce e.g. graphs, write that code in a separate source file (which you should not turn in).
If all available splits have zero information gain, you should prefer a split that is non-trivial (i.e. one where all examples do not have the same attribute value).

Steps to complete the homework

Get the code and data files.
Complete the decision tree code. You should use Python 3.6. Implement the four methods in ID3.py, adding new methods as necessary. You will also need to change node.py. We have included a few tests in unit_tests.py that you can run individually, to check your methods. Further, we have also included a file mini_auto_grader.py that gives you an idea of the kinds of tests we will run to grade your methods. NOTE: that depending on how you implement pruning, it's possible (albeit unlikely) that the pruning test will not pass even if your implementation is acceptable.
Create a PDF document with the answers to the following questions:
1. (0.5 points) Which other student, if any, is in your group? (either names or netIDs is fine)
2. (0.5 points) Did you alter the Node data structure? If so, how and why? (2 sentences)
3. (1 point) How did you handle missing attributes, and why did you choose this strategy? (2 sentences)
4. (1 point) How did you perform pruning, and why did you choose this strategy? (4 sentences)
5. (2 points) Now you will try your learner on the house_votes_84.data, and plot learning curves. Specifically, you should experiment under two settings: with pruning, and without pruning. Use training set sizes ranging between 10 and 300 examples. For each training size you choose, perform 100 random runs, for each run testing on all examples not used for training (see testPruningOnHouseData from unit_tests.py for one example of this). Plot the average accuracy of the 100 runs as one point on a learning curve (x-axis = number of training examples, y-axis = accuracy on test data). Connect the points to show one line representing accuracy with pruning, the other without. Include your plot in your pdf, and answer two questions:
  1. In about a sentence, what is the general trend of both lines as training set size increases, and why does this make sense?
  2. In about two sentences, how does the advantage of pruning change as the data set size increases? Does this make sense, and why or why not?
  Note: depending on your particular approach, pruning may not improve accuracy consistently or may decrease it (especially for small data set sizes). You can still receive full credit for this as long as your approach is reasonable and correctly implemented.

The correct functionality of your code is then worth ten points, making a total of fifteen points for the assignment.

One last suggestion: You may find it helpful to consult the starter code from 2016's decision tree homework for reference, but be aware that that assignment involved continuous attributes and used a much more complex design than you will need for this homework.

Submission Instructions

You'll turn in your homework as a single zip file, in Canvas. Specifically:

Create a single pdf file ps1.pdf with the answers to the questions above, and your graphs.
Create a single ZIP file containing:
- ps1.pdf
- All of your .py code files
Turn the zip file in under Problem Set 1 in Canvas.