EECS 349 Problem Set 1

Due 11:59PM Monday, Apr 3

Updated Mar 26 19:00:00 CDT 2017


This assignment consists of two parts. A Python programming warm-up, and some machine learning experiments with Weka.

Submission Instructions

You'll turn in your homework as a single zip file, in Canvas. Specifically:

  1. Create a text file with your completed code for the Python warm-up below. Name this file PS1.py.
  2. Create a text file with your answers to the questions below. Be sure you have answered all the questions. Name this file PS1.txt.
  3. Create a file containing your Weka model (instructions below). Be sure this file can be loaded into Weka and that it runs. Name this file PS1.model.
  4. Create a text file in ARFF format with your predicted labels for the test set (instructions below). Name this file PS1.arff.
  5. Create a single ZIP file containing:
  6. Turn the zip file in under Problem Set 1 in Canvas.

Python warm-up (2 points)

Complete the three functions in node_hw1.py, using the comments in the file as a guide for what to do. (you can also define helper functions with other names, but make sure you don't change the names of the three functions you're asked to implement). For convenience, the "tester" function provides a rudimentary test of each method. BUT make sure your code works for trees of arbitrary depth. Use Python 2.7. Turn in your .py file in Canvas, as per the submission instructions specified above.

Weka Experiments (8 points)

In this part of the assignment you will run a machine learning experiment using Weka, an open source framework for machine learning and data mining. You will generate a model that predicts the quality of wine based on its chemical attributes. You will train the model on the supplied training data and use the model to predict the correct output for unlabeled test data.

Download and Install Weka

Weka is available for Windows, Mac, and Linux from http://www.cs.waikato.ac.nz/ml/weka/. Click on the "Download" link on the left-hand side and download the Stable GUI version, which is currently 3.8. You may also wish to download a Weka manual from the "Documentation" page.

Download the Dataset

The dataset files are here:

This dataset is adapted from:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

This dataset contains data for 2700 white variants of the Portuguese "Vinho Verde" wine. For each variant, 11 chemical features were measured. Each of these is a numeric attribute. They are:

Each variant was tasted by three experts. Their ratings have been combined into a single quality label: "good" or "bad" Therefore this is a binary classification problem with numeric attributes.

The dataset has been randomly split into a training set (1890 variants) and a test set (810 variants). The training set contains both chemical features and quality labels. The test set contains only the chemical features.

Examine the Data

It is a good idea to inspect your data by hand before running any machine learning experiments, to ensure that the dataset is in the correct format and that you understand what the dataset contains. The following sections will familiarize you with the data and introduce some tools in Weka.

The ARFF Format

View train.arff and test.arff in a text editor. You should see something like this:

The files are in ARFF (Attribute-Relation File Format), a text format developed for Weka. At the top of each file you will see a list of attributes, followed by a data section with rows of comma separated values, one for each instance. The text and training files look similar, except that the last value for each training instance is a quality label and the last value for each test instance is a question mark, since these instances are unlabeled.

For this assignment you will not need to deal with the ARFF format directly, as Weka will handle reading and writing ARFF files for you. In future experiments you may have to convert between ARFF and another data format. (You can close the text editor.)

The Weka ARFF Viewer

Run Weka. You will get a screen like the following:

From the Tools menu choose ArffViewer. In the window that opens, choose FileOpen and open one of the data files. You should see something like the following (see important note below):

Here you see the same data as in the text editor, but parsed into a spreadsheet-like format. Although you will not need the ArffViewer for this assignment, it is a useful tool to know about when working with Weka. (You can close the ArffViewer window.)

Important Note

You may find that the ARFF files are grayed out and that the All Files option needs to be selected from the File Format dropdown menu for the files to be selectable. However, the ARFF Viewer may still not read the files properly. If such is the case, it is likely that a .txt extension got appended to the filename when the files were downloaded. However, even if the files are downloaded without .txt getting appended or an inadvertently added .txt extension is removed, the ARFF Viewer may have trouble reading the files properly. The following steps should resolve the issue:

  1. View the ARFF in your Web browser by clicking on the link in the instructions or open the downloaded ARFF file in a text editor.
  2. Copy all the text and paste it to a new text file.
  3. If you copied the ARFF contents from the downloaded ARFF file, it is recommended that you do not overwrite the downloaded ARFF file when saving the new file on the next step. Instead, delete the downloaded ARFF file.
  4. Save the new text file with a .arff extension, carefully making sure that a .txt extension does not get appended.
  5. Open the newly saved ARFF file in the Weka ARFF Viewer to verify the Viewer can display the file in the manner illustrated in the image above.

The Weka Explorer

From the Weka GUI Choose click on the Explorer button to open the Weka Explorer. The Explorer is the main tool in Weka, and the one you are most likely to work with when setting up an experiment. For the remainder of this assignment you will work within the Weka Explorer. The Explorer should open to the "Preprocess" tab. The Preprocess tab allows you to inspect and modify your dataset before passing it to a machine learning algorithm. Click on the button that says "Open file..." and open train.arff. You should see something like this:

The attributes are listed in the bottom left, and summary statistics for the currently selected attribute are shown on the right side, along with a histogram. Click on each attribute (or use the down arrow key to move through them) and look at the corresponding histogram.

Now answer Question #1.

Classifier Basics

In this section you will see how to train a classifier on the data.

Baseline Classifier

Click on the "Classify" tab. Choose ZeroR as the Classifier if it is not already chosen (it is under the "rules" subtree when you click on the "Choose" button). When used in a classification problem, ZeroR simply chooses the majority class. Under "Test options" select "Use training set", then click the "Start" button to run the classifier. You should see something like this:

The classifier output pane displays information about the model created by the classifier as well as the evaluated performance of the model. In the Summary section, the row "Correctly Classified Instances" reports the accuracy of the model.

Now answer Question #2.

Decision Trees

J48 is the Weka implementation of the C4.5 decision tree algorithm.

Click on the "Choose" button and select J48 under the "trees" section. Notice that the field to the right of the "Choose" button updates to say "J48 -C 0.25 -M 2". This is a command-line representation of the current settings of J48. Click on this field to open up the configuration dialog for J48:

Each classifier has a configuration dialog such as this that shows the parameters of the algorithm as well as buttons at the top for more information. When you change the settings and close the dialog, the command line representation updates accordingly. For now we will use the default settings, so hit "Cancel" to close the dialog.

Under "Test options" select "Use training set", then click the "Start" button to run the classifier. After the classifier finishes, scroll up in the output pane. You should see a textual representation of the generated decision tree.

Now answer Question #3.

Scroll back down and record the percentage of Correctly Classified Instances. Now, under "Test options", select "Cross-validation" with 10 folds. Run the classifier again and record the percentage of Correctly Classified Instances.

In both cases, the final model that is generated is based on all of the training data. The difference is in how the accuracy of that model is estimated.

Now answer Question #4.

Build Your Own Classifier

This is the main part of the assignment. Search through the classifiers in Weka and run some of them on the training set. You may want to try varying some of the classifier parameters as well. Choose the one you feel is most likely to generalize well to unseen examples--namely the unlabeled examples in the test set. Feel free to use validation strategies other than 10-fold cross-validation.

When you have built the classifier you want to submit, move on to the following sections.

Saving the Model

To export a classifier model you have built:

  1. Right-click on the model in the "Result list" in the bottom left corner of the Classify tab.
  2. Select "Save model".
  3. In the dialog that opens, ensure that the File Format is "Model object files"
  4. Save the model using the naming convention given in the submission instructions (e.g. PS1.model).

In order to grade your assignment it must be possible to load your model file in Weka and run it on a labeled version of test.arff. You can load your model by right-clicking in the Result list pane and selecting "Load model".

Generating Predictions

To generate an ARFF file with predictions for the test data, perform the following steps from within the Classify tab. This assumes you already have a trained model in the Result list, which you will run on the test set. You will produce either an ARFF file or a CSV file containing your predictions, either is fine for the assignment, but regardless submit your output file as PS1.arff. Our recommended steps depend on which version of Weka you're using, as detailed below.

In Weka 3.8:
  1. Under "Test options" select "Supplied test set".
  2. Click on the "Set..." button.
  3. In the "Test Instances" dialog that opens click "Open file...".
  4. Open test.arff.
  5. Close the Test Instances dialog.
  6. In Test options, click More options...
  7. Click Choose in Output predictions and select CSV
  8. Click the text "CSV", click the outputFile box and enter a location to save -- name your file PS1.arff for the purpose of the assignment even though the output will not actually be an ARFF file. (It's a CSV File, but Weka doesn't give us the option of outputting ARFF. If you did want an ARFF file to work with, you could convert your CSV into ARFF by naming it PS1.csv and opening it in the ARFF viewer, and then saving as ARFF. You could also turn that file in as PS1.arff, but you don't have to.)
  9. In attributes, enter the string "1-11"; this will output the other attributes for each test example
  10. Right-click on your model in the Result list, select Re-evaluate current model on the test set.
  11. Submit PS1.arff per the submission instructions
In older versions (verified in Weka 3.6):
  1. Under "Test options" select "Supplied test set".
  2. Click on the "Set..." button.
  3. In the "Test Instances" dialog that opens click "Open file...".
  4. Open test.arff.
  5. Close the Test Instances dialog.
  6. Right-click on your model in the Result list and select "Re-evaluate model on current test set". Your output will look something like the picture below. Notice that the output contains a bunch of NaNs. This is because the test data is unlabeled and therefore Weka cannot compute the accuracy.
  7. Right-click again on your model and select "Visualize classifier errors".
  8. In the dialog that opens, click on the "Save" button.
  9. Save the ARFF file using the naming convention given in the submission instructions (e.g. PS1.arff).
Now answer Questions #5 through #7.

Try Another Data Set

You will now build a classifier for a second data set concerning the evaluation of cars, following which you will answer only the last two questions. (You do not have to answer Questions #1 through #8 again.)

In order to answer the questions, perform the following steps:

Download the Car Evaluation Dataset

The car evaluation dataset files are here (see important note below)::

This dataset is adapted from:

Car Evaluation Database, which was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.).

Important Note

The main data file for the car evaluation data set ends in a .data extension and has an associated auxiliary data file ending in a .names extension. However, the usage for the .data file is the same as for the .arff file you are already familiar with, including the important note applicable to the wine evaluation data set files, though you have to pay special attention to the following:

  1. When opening the main data file (car_train.data or car_test.data) in the Weka Explorer, the C4.5 data files (*.data) option needs to be selected from the File Format dropdown menu.
  2. The auxiliary data files (car_train.names and car_test.names) must be located in the same folder as the main data files. (You do not need to take any action on these auxiliary data files other than to keep them in the same folder as the main data files, but inspecting the contents of the files should help you interpret how the main data files work.)

Build Classifiers

You will perform four experiments, measuring the 10-fold cross-validation accuracy of two types of classifiers (call them classifiers A and B) on two data sets (cars and wine). You can choose A and B however you like -- they can be different classifiers (nearest-neighbor vs. decision trees) or the same classifier with different settings (different numbers of nearest neighbors, for example). Your goal is to choose settings such that the A classifier performs great for wine evaluation, but poorly for car evaluation, and vice-versa for classifier B. In other words, you should strive to find a value as large as you can for the expression below:

wine_acc(A) + car_acc(B) – wine_acc(B) – car_acc(A)

where wine_acc(A) refers to the accuracy of Classifier A on the wine data set, and car_acc(B) refers to the accuracy of Classifier B on the car data set, and so on.

Note: You do not need to obtain the largest possible quantity for the above expression, and it is okay to use classifiers we have discussed in class as long as you can achieve some positive value for the above expression (a value of 2% is sufficient for the assignment).

Note that you will only need to use the training data (car_train.data) for this task. The test data (car_test.data) is provided for your personal reference, should you choose to try your car evaluation classifier on it to gain experience. Therefore, you do not need to perform the steps under Generating Predictions for this task.

Now answer the remaining questions.

Questions

Put concise answers to the following questions in a text file, as described in the submission instructions.

  1. Based on the histograms, which attribute appears to be the most useful for classifying wine, and why?
  2. What is the accuracy - the percentage of correctly classified instances - achieved by ZeroR when you run it on the training set? Why is ZeroR a helpful baseline for interpreting the performance of other classifiers?
  3. Using a decision tree Weka learned over the training set, what is the most informative single feature for this task, and what is its influence on wine quality? Does this match your answer from question 1?
  4. What is 10-fold cross-validation? What is the main reason for the difference between the percentage of Correctly Classified Instances when you measured accuracy on the training set itself, versus when you ran 10-fold cross-validation over the training set? Why is cross-validation important?
  5. What is the "command-line" for the model you are submitting? For example, "J48 -C 0.25 -M 2". What is the reported accuracy for your model using 10-fold cross-validation?
  6. In a few sentences, describe how you chose the model you are submitting. Be sure to mention your validation strategy and whether you tried varying any of the model parameters.
  7. A Wired magazine article from several years ago on the 'Peta Age' suggests that increasingly huge data sets, coupled with machine learning techniques, makes model building obsolete. In particular it says: This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology… In a short paragraph (about four sentences), state whether you agree with this statement, and why or why not.
  8. Briefly explain what strategy you used to obtain the Classifiers A and B that performed well on one of the car or wine data sets, and not the other.
  9. Name one major difference between the output space for the car data set vs. the wine data set, that might make some classifiers that are applicable to the wine data not applicable to the car data.