Updated Wed Jan 8 1:25:43 CDT 2014
In this assignment you will implement a decision tree learning algorithm and apply it to a synthetic dataset. You will also implement a pruning strategy in your algorithm. You will be given labeled training data, from which you will generate a model. You will be given labeled validation data, for which you will report your model's performance. You will also be given individualized unlabeled test data for which you will generate predictions.
Here is how James Bond would submit the homework. Please adjust for your own name:
PS2-James-Bond.codethat contains your source code.
READMEthat explains how to build and run your code.
The dataset files are here:
This dataset is based on:
R. Agrawal, T. Imielinski, A. Swami (1993). Database Mining: A Performance Perspective. IEEE Transactions on Knowledge and Data Engineering. 5 (6):914-925.
The dataset is from a synthesized (and therefore fictitious) people database where each person has the following attributes:
The class label is given by the group attribute. This is a binary classification problem with numeric and nominal attributes. Some attribute values are missing (as might happen in a real-world scenario). These values are indicated by a "?" in the file. In the test files the class labels are missing, and these missing labels are also indicated by a "?". The test sets are all drawn from the same distribution as the training and validation sets.
If you want, you can imagine that the task is to predict whether a loan application by the given person will be approved or denied. However for this assignment it is not necessary (or even useful) to interpret the task or the attributes.
For this assignment you will implement a decision tree algorithm in the language of your choice. In particular, you should not use Weka or any other existing framework for generating decision trees. You are free to choose how your algorithm works. Your program must be able to:
Note: your algorithm must handle missing attributes.
The data files are provided to you in CSV format so that it will be easier
for you to read them in. One drawback of the CSV format is that it does not
contain metadata (as ARFF does, for example). This means that it is not
possible from the data alone to know which attributes are nominal and which are
numeric. For example,
car are actually
nominal attributes that are represented as integers, as described above.
Therefore you need to represent this information somewhere. You can either put
this information directly in the code that reads in the input files, or you can
generate a metadata file of your own and write code that interprets the input
file based on the contents of the metadata file.
Regardless of how you translate the input file into an internal representation, write your decision tree algorithm to handle a general binary classification problem. The algorithm should be able to handle another binary classification problem with a different composition of numeric and nominal attributes. For example, the algorithm itself should not assume that each example contains exactly 12 attributes, nor for example should it assume that there is an attribute named "elevel" with 5 categories.
Add a pruning strategy to your decision tree algorithm. You are free to choose the pruning strategy, but you SHOULD use the validation set for pruning. Note that you don't, for example, iteratively greedily select the one *best* node to prune, as this might be computationally prohibitive. So feel free to choose an approximation (e.g. any node that improves accuracy on the validation set).
Be sure you can run your algorithm both with and without pruning.
In your code, in contrast to the decision tree pseudo-code in the lecture notes, you may want to split on the same attribute more than once (for numeric attributes). As a result, you do not want to remove attributes when split and recurse, and you don't need to check if the attribute set is empty. You should, however, add a base case in your code to stop when no new split yields non-zero information gain.
Put answers to the following questions in a text or PDF file, as described in the submission instructions.
This assignment is worth 20 points, broken down as follows:
It is possible to get up to 12 points of credit without implementing pruning. (If you do not implement pruning, Questions 11-12 can still receive full credit based on the output of the algorithm without pruning.)