EECS 349 Problem Set 2

Due 11:59PM

Updated Wed Jan 8 1:25:43 CDT 2014

Overview

In this assignment you will implement a decision tree learning algorithm and apply it to a synthetic dataset. You will also implement a pruning strategy in your algorithm. You will be given labeled training data, from which you will generate a model. You will be given labeled validation data, for which you will report your model's performance. You will also be given individualized unlabeled test data for which you will generate predictions.

Submission Instructions

Here is how James Bond would submit the homework. Please adjust for your own name:

Create a single text or PDF file with your answers to the questions below. Name this file PS2-James-Bond.txt or PS2-James-Bond.pdf.
Create a directory (i.e. a folder) named PS2-James-Bond.code that contains your source code.
Create a file named README that explains how to build and run your code.
Run your code on test test file and output a file in the same format, but with your predicted labels in the last column. Name this file PS2-James-Bond.csv.
Create a ZIP file named PS2-James-Bond.zip containing:
- PS2-James-Bond.txt or PS2-James-Bond.pdf
- PS2-James-Bond.code (directory)
- README
- PS2-James-Bond.csv
Ensure that the zip file contains all of your source code.You may have to tell the ZIP utility explicitly to include the contents of the subdirectory containing your code.
Turn in your code under Problem Set 2 in Blackboard.

Download the Dataset

The dataset files are here:

train.csv (labeled training set, 10000 instances)
validate.csv (labeled validation set, 5000 instances)
test-files (unlabeled test set, 5000 instances)

This dataset is based on:

R. Agrawal, T. Imielinski, A. Swami (1993). Database Mining: A Performance Perspective. IEEE Transactions on Knowledge and Data Engineering. 5 (6):914-925.

The dataset is from a synthesized (and therefore fictitious) people database where each person has the following attributes:

salary : numeric
commission : numeric
age : numeric
gender : nominal
- f: female
- m: male
marital (marital status) : nominal
- s: single, no kids
- m: married, no kids
- d: divorced or widowed, no kids
- k: any marital status with kids
elevel (education level) : nominal
- 0: no high school diploma
- 1: high school graduate
- 2: some college
- 3: college graduate
- 4: advanced degree
car : nominal
- value from 1-20 representing make of car
zipcode : nominal
- value from 0-9 representing zip code
creditscore (FICO credit score): numeric
hvalue (house value) : numeric
hyears (years house owned) : numeric
loan (total loan amount) : numeric
group (class label) : binary (0 or 1)

The class label is given by the group attribute. This is a binary classification problem with numeric and nominal attributes. Some attribute values are missing (as might happen in a real-world scenario). These values are indicated by a "?" in the file. In the test files the class labels are missing, and these missing labels are also indicated by a "?". The test sets are all drawn from the same distribution as the training and validation sets.

If you want, you can imagine that the task is to predict whether a loan application by the given person will be approved or denied. However for this assignment it is not necessary (or even useful) to interpret the task or the attributes.

Implementation

For this assignment you will implement a decision tree algorithm in the language of your choice. In particular, you should not use Weka or any other existing framework for generating decision trees. You are free to choose how your algorithm works. Your program must be able to:

Read the training data file and generate a decision tree model.
Output the generated decision tree in disjunctive normal form.
Read the validation data file and report the accuracy of the model on that data (i.e. the percentage of the validation data that was classified correctly).
Read a test data file with missing labels (question marks) in the last column and output a copy of that file with predicted labels in the last column (replacing the question marks).

Note: your algorithm must handle missing attributes.

A Note About Design

The data files are provided to you in CSV format so that it will be easier for you to read them in. One drawback of the CSV format is that it does not contain metadata (as ARFF does, for example). This means that it is not possible from the data alone to know which attributes are nominal and which are numeric. For example, zipcode and car are actually nominal attributes that are represented as integers, as described above. Therefore you need to represent this information somewhere. You can either put this information directly in the code that reads in the input files, or you can generate a metadata file of your own and write code that interprets the input file based on the contents of the metadata file.

Regardless of how you translate the input file into an internal representation, write your decision tree algorithm to handle a general binary classification problem. The algorithm should be able to handle another binary classification problem with a different composition of numeric and nominal attributes. For example, the algorithm itself should not assume that each example contains exactly 12 attributes, nor for example should it assume that there is an attribute named "elevel" with 5 categories.

Pruning

Add a pruning strategy to your decision tree algorithm. You are free to choose the pruning strategy, but you SHOULD use the validation set for pruning. Note that you don't, for example, iteratively greedily select the one *best* node to prune, as this might be computationally prohibitive. So feel free to choose an approximation (e.g. any node that improves accuracy on the validation set).

Be sure you can run your algorithm both with and without pruning.

Re-using attributes

In your code, in contrast to the decision tree pseudo-code in the lecture notes, you may want to split on the same attribute more than once (for numeric attributes). As a result, you do not want to remove attributes when split and recurse, and you don't need to check if the attribute set is empty. You should, however, add a base case in your code to stop when no new split yields non-zero information gain.

Common-Sense Guidelines

Write your program so that you do not have to modify code when switching from one task to another or when turning pruning on or off. For example, you might use command-line parameters to enable or disable pruning and to distinguish between the model generation task, the validation task, etc. An acceptable alternative is to follow the style of LIBSVM and have separate programs for each task, e.g. model-train, model-validate, model-predict, etc.
Do not hardcode the names of input or output files in your program. It should be possible to run your program on another input file.
Document the usage of your program in the README.
While it is not required for this assignment, you may find it useful to have your program be able to output the generated decision tree in a human-readable format similar to that produced by J48 in Weka.

Questions

Put answers to the following questions in a text or PDF file, as described in the submission instructions.

Answer concisely. You may include pseudocode or short fragments of actual code if it helps to answer the question. However, please keep the answer document self-contained. It should not be necessary to look at your source files to understand your answers.

How did you represent the decision tree in your code?
How did you represent examples (instances) in your code?
How did you choose the attribute for each node?
How did you handle missing attributes in examples?
What is the termination criterion for your learning process?
Apply your algorithm to the training set, without pruning. Print out a Boolean formula in disjunctive normal form that corresponds to the unpruned tree learned from the training set. For the DNF assume that group label "1" refers to the positive examples.
Explain in English one of the rules in this (unpruned) tree.
How did you implement pruning?
Apply your algorithm to the training set, with pruning. Print out a Boolean formula in disjunctive normal form that corresponds to the pruned tree learned from the training set.
What is the difference in size between the pruned and unpruned trees?
Test the unpruned and pruned trees on the validation set. What are the accuracies of each tree? Explain the difference, if any.
Which tree do you think will perform better on the unlabeled test set? Why? Run this tree on the test file and submit your predictions as described in the submission instructions.

Grading Breakdown

This assignment is worth 20 points, broken down as follows:

Algorithmic Design (Questions 1-5)
- 8 points
Disjunctive Normal Form (Questions 6-7)
- 3 points
Pruning (Questions 8-10)
- 4 points
Output of Algorithm (Questions 11-12)
- 5 points

It is possible to get up to 12 points of credit without implementing pruning. (If you do not implement pruning, Questions 11-12 can still receive full credit based on the output of the algorithm without pruning.)