1. In the car-starting domain, there are three boolean variables: CarStarts, BatteryWorks, and HasFuel.
    1. How many parameters are needed to describe the full joint probability distribution in the domain, not assuming any independencies between variables? (1 point)
      Clarifying note: here, parameters are the numbers used to specify the probability distribution over the variables. See for example slide 50 in the lecture notes on Bayesian Learning, which shows that the burglary net requires 10 numbers (i.e., 10 parameters).
    2. Using your intuition about the domain, diagram a Bayes Net that encodes the dependencies between the variables. (2 points)
    3. How many parameters are needed to describe the distribution of the variables using your Bayes Net? (hint: this should be smaller than (b)) (1 point)
  2. Consider using instance-based learning to classify documents.
    1. Describe a simple method for computing the distance between two documents. (1 point)
    2. Describe an enhancement to your simple method that you think will improve accuracy. Why is your new method likely to be better? (1 point)
    3. Describe a different enhancement to your simple method that you think will improve efficiency at query-time. Why is it more efficient? (1 point)
      1. Extra credit: Describe mathematically the expected executiong time savings from your enhancement for classifying a document using k-nearest neighbor, in terms of average document size and the number of training examples. (1 point)
  3. Experiment with a machine learning package downloaded from the Web. We strongly recommend the Weka package, as it is known to have the capabilities required to answer all the questions, and the training and test sets we provide are already in the Weka format.
    1. Train two classifiers - Decision Trees and Naive Bayes - on the heart disease dataset (taken from the UCI Machine Learning repository). You should use the following Training and Test files; the different variables are described here. Train the set with the two classifiers and paste the output of your program. Your decision tree should be unprunned. (2 points)
    2. Test the trained classifiers on the test set. Which performs better? Why do you think this might be the case? (1 point)
    3. How does the training accuracy for the decision tree compare to the test accuracy? What explains the difference? (1 point)
    4. Try using some enhancement to the basic decision tree that alleviates the effect observed in (c); measure your test set performance again. Describe the technique that you used, and explain any changes to test set performance that you see. (2 points) Please provide outputs, or snippets of outputs as evidence for your answers.

    Notes on files: You will have to tweak the files a little if you are not using Weka: In weka, training files can be comma separated values or a proprietary format called ARFF. The files linked in this problem set are in the ARFF format. The ARFF format only adds stuff to an otherwise common comma separated values (CSV) file. Therefore, in order to use these files with other programs you will need to remove the top of the file (which contains ARFF specific instructions) and modify the remaining CSV file accordingly. Please see This Wiki Post for the ARFF format details. The overview should be plenty.

    Notes on classifiers: The ID3 algorithm in its pure form (chapter 3 of the book) can only classify instances with nominal attributes. Because the data for this problem contains attributes with continuous values you will use the C4.5 algorithm. For a deeper discussion on these algorithms read section 3.7 on chapter 3. Weka has one classifier, J48, that implements the C4.5 algorithm, so please use that for decision trees. However, the default options turn prunning on. You have to turn it off for (a) at least, and turn it on with a good reason.

    Notes on Weka: Weka is a java program, so you have to have java installed. When you run Weka you will have the option of launching a few tools. For this assignment you will be using the Explorer. The Explorer has a pretty intuitive interface. In the first tab you open a file and look at its columns, some statistical properties of the columns and some meta information about the columns. The second tab, "Classify" is where you define your test set and choose a classifier to run. Once you choose a classifier, you can click on the text-box next to the Choose button to obtain information about the classifier and to modify the classifier's options (for example to enable or disable prunning!!!!). Once you run a classifier on the data, a full description and metrics of the classification will appear on the right. You can save that output if you want.