In this lab you will be adding capabilities to the decision tree programs you implemented in program 2. The three capabilities you will add are:
You will again be implementing two separate programs, dt_create and dt_predict which should work as in the previous program, but they should also be able to perform post-pruning and cope with datasets that have continuous valued features and unknown feature values.
The dt_create program should work as before but have a fourth command line argument (with a default value of 1.0 if the command line argument is not provided). The program should be run like this:
dt_create DATASET.names DATASET.data DATASET.tree PRUNEVALUE
This takes the dataset defined by the files DATASET.names and DATASET.data, and learn a tree from that data, storing the resulting tree in DATASET.tree using a pruning value (discussed below) or PRUNEVALUE. Your code should print out a nicely formatted version of the tree that is learned from the dataset before pruning and a second version of the tree after pruning (the pruned version of the tree should be written to the file DATASET.tree).
FeatureA < 3.5 FeatureB =B1 Class=[ 4.5, 1.0, 0.0 ] =B2 Class=[ 0.0, 3.5, 0.0 ] >= 3.5 Class=[ 1.0, 0.0, 6.0 ]
Note that since we will be pruning, a particular leaf node may cover examples from multiple classes. In your output from the tree you should show this by having the class be a vector showing how many examples from each class are covered by that leaf node.
The program dt_predict should be run the same as in program 2, but note that examples with unknown feature values should be predicted as discussed in class (more on this below) and a prediction should be based on the class with the largest number of examples at a leaf. Your dt_predict should print out a confusion matrix for the predictions as was done in program 2.
Continuous valued features should be dealt with as discussed in class. For each continuous valued feature you should sort the examples being separated by the values for those features and then consider "split" points as discussed in class.
When considering examples with an unknown value for a feature, a fraction of that example should be assigned to each branch based on the number of examples with same class being passed down that branch. For example, when considering feature A, if five positive examples have value A1 for feature A and 10 positive examples have value A2 for feature A and one positive example has an unknown value for feature A, one third of the unknown example should be considered as having value A1, and two thirds of the unknown example should be considered as having value A2.
When predicting an example, if an unknown value is encountered, the example should be predicted along all paths and weighted by the fraction of examples that went down each branch. For example, if 40 examples followed branch A1 of feature A and 60 examples followed branch A2 of feature A, then the prediction should be based 40% on branch A1 and 60% on branch A2.
Your algorithm should perform post-pruning. This should be done bottom up. A branch should be pruned if the number of examples that would be incorrectly classified if the node is pruned is less than the number of branches times the value PRUNEVALUE. For example, if there is a node with one positive example at branch A1 and three negative examples at branch A2, and PRUNEVALUE is 1.0 then this node could be pruned (by pruning it, only one example will be incorrectly predicted while two branches are eliminated).
Since pruning is used, a leaf node should record how many examples (including fractional examples) that leaf node corresponds to -- for example, if a leaf node has 5 examples of class 1, 4 examples of class 2, and 1 example of class 3, the prediction should be .5 class 1, .4 class 2 and .1 class 3.
You should write up a short report (at least one page, no more than three) discussing your design decisions in implementing the decision tree code and how your version of the code works.
Comment your code and turn in hard copies of all of the code (and the report above). You should also perform tests of your code for 12 of the pairs of data files in the directory ~rmaclin/public/datasets/8751. You should copy all of the .data and .names files from that directory to your code directory. After you have done so, run the following commands using your code and print out all of the results:
dt_create credit-a.names credit-a.train-0.data credit-a.tree0 1.3 dt_predict credit-a.names credit-a.test-0.data credit-a.tree0 dt_create credit-a.names credit-a.train-1.data credit-a.tree1 1.3 dt_predict credit-a.names credit-a.test-1.data credit-a.tree1 dt_create credit-a.names credit-a.train-2.data credit-a.tree2 1.3 dt_predict credit-a.names credit-a.test-2.data credit-a.tree2 dt_create credit-a.names credit-a.train-3.data credit-a.tree3 1.3 dt_predict credit-a.names credit-a.test-3.data credit-a.tree3 dt_create hypo.names hypo.train-0.data hypo.tree0 1.3 dt_predict hypo.names hypo.test-0.data hypo.tree0 dt_create hypo.names hypo.train-1.data hypo.tree1 1.3 dt_predict hypo.names hypo.test-1.data hypo.tree1 dt_create hypo.names hypo.train-2.data hypo.tree2 1.3 dt_predict hypo.names hypo.test-2.data hypo.tree2 dt_create hypo.names hypo.train-3.data hypo.tree3 1.3 dt_predict hypo.names hypo.test-3.data hypo.tree3 dt_create house-votes-84.names house-votes-84.train-0.data house-votes-84.tree0 dt_predict house-votes-84.names house-votes-84.test-0.data house-votes-84.tree0 dt_create house-votes-84.names house-votes-84.train-1.data house-votes-84.tree1 dt_predict house-votes-84.names house-votes-84.test-1.data house-votes-84.tree1 dt_create house-votes-84.names house-votes-84.train-2.data house-votes-84.tree2 dt_predict house-votes-84.names house-votes-84.test-2.data house-votes-84.tree2 dt_create house-votes-84.names house-votes-84.train-3.data house-votes-84.tree3 dt_predict house-votes-84.names house-votes-84.test-3.data house-votes-84.tree3
You should run further tests showing the effects of different values of PRUNEVALUE (you may select which dataset you use to show the effect of this value). Make sure to discuss these results. Also create a decision tree for your personal dataset and discuss how accurate you think the resulting tree is.
You must also submit your code electronically. To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs8751-1-f2003/upload.cgi and follow the directions for uploading a file (you can do this multiple times, though it would be helpful if you would tar your files and upload one file archive).
To make your code easier to check and grade please use the following procedure for collecting the code before uploading it:
rmaclin/prog03Note that the suffix of all C++ code files (not .h files) should be ".cc". Only code files (in C++, only .cc and .h files) and your makefile should be stored in this directory.
tar cf prog03.tar login/prog03