In this lab you will be implementing a decision tree algorithm similar to the ID3 algorithm discussed in class, but using a different formula for gain as discussed below. Note that in this version of decision tree learning you may assume that you only have discrete features, that there are no unknown values, and that no pruning need be performed.
You will actually be implementing two separate programs. The first program should take as command line arguments the .names and .data file of a dataset and should create a decision tree from that data storing the result into a file named as the third command line argument. Your program should be run like this:
dt_create DATASET.names DATASET.data DATASET.tree
This would take a dataset defined by the files DATASET.names and DATASET.data, and learn a tree from that data, storing the resulting tree in DATASET.tree. Your code should also print out a nicely formatted version of the tree that is learned from the dataset that looks something like this (you may make it look more impressive if you like):
Outlook =Rain Wind =Weak Class=Yes =Strong Class=No =Overcast Class=Yes =Sunny Humidity =Normal Class=Yes =High Class=No
Your code will need to save a representation of the resulting tree in the third filename supplied to the code (I suggest writing out the tree using a pre-order traversal). In your second program you should read in a dataset and a previously learned decision tree and determine how accurate that tree is for the dataset. This program should work like this:
dt_predict DATASET.names DATASET.data DATASET.tree
where the DATASET.names and DATASET.data file define the set of data and the DATASET.tree contains the previously learned tree file (you may assume that the tree file will always be generated by your code and will always be correct for the dataset supplied). In this program you should count up for each of the examples in the dataset the class it actually is and which class it is predicted as using the supplied tree and print the resulting totals for each possble combination (the result is called a confusion matrix). You should also print the overall accuracy (which is simply the sum of the counts on the diagonal over the total number of examples). For example, your output might look like this:
Actual Class 0 1 -------- Predicted 0 | 5 1 Class 1 | 0 10 Accuracy = 93.75%
In this program we will be using a different notion of gain from the Information Gain formula used in ID3. For a particular set of the data S and an attribute A the gain (Gain(A,S)) will be defined as follows:
#classes #classes --- --- G(S) = \ \ p(Class=X,S) * p(Class=Y,S) / / --- --- X=1 Y=X+1 #values(A) / #classes #classes \ --- | --- --- | G(A,S) = \ P(Value=V,S) * | \ \ p(Class=X|Value=V,S) * p(Class=Y|Value=V,S) | / | / / | --- | --- --- | V=1 \ X=1 Y=X+1 / Gain(A,S) = G(S) - G(A,S)
where p(Class=X,S) is the proportion of the set of points S that has class X, and P(Value=V,S) is the proportion of the set of points S that has feature value V, and P(Class=X|Value=V,S) is the proportion of those points in S that have class X amongst the points with feature value V.
You should write up a short report (at least one page, no more than three) discussing your design decisions in implementing the decision tree code and how your version of the code works.
Comment your code and turn in hard copies of all of the code (and the report above). You should also perform tests of your code for the 8 pairs of data files in the directory ~rmaclin/public/datasets/8751. You should copy all of the .data and .names files from that directory to your code directory. After you have done so, run the following commands using your code and print out all of the results:
dt_create promoters.names promoters.train-0.data promoters.tree0 dt_predict promoters.names promoters.test-0.data promoters.tree0 dt_create promoters.names promoters.train-1.data promoters.tree1 dt_predict promoters.names promoters.test-1.data promoters.tree1 dt_create promoters.names promoters.train-2.data promoters.tree2 dt_predict promoters.names promoters.test-2.data promoters.tree2 dt_create promoters.names promoters.train-3.data promoters.tree3 dt_predict promoters.names promoters.test-3.data promoters.tree3 dt_create soybean.names soybean.train-0.data soybean.tree0 dt_predict soybean.names soybean.test-0.data soybean.tree0 dt_create soybean.names soybean.train-1.data soybean.tree1 dt_predict soybean.names soybean.test-1.data soybean.tree1 dt_create soybean.names soybean.train-2.data soybean.tree2 dt_predict soybean.names soybean.test-2.data soybean.tree2 dt_create soybean.names soybean.train-3.data soybean.tree3 dt_predict soybean.names soybean.test-3.data soybean.tree3
You must also submit your code electronically. To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs8751-1-f2003/upload.cgi and follow the directions for uploading a file (you can do this multiple times, though it would be helpful if you would tar your files and upload one file archive).
To make your code easier to check and grade please use the following procedure for collecting the code before uploading it:
rmaclin/prog02Note that the suffix of all C++ code files (not .h files) should be ".cc". Only code files (in C++, only .cc and .h files) and your makefile should be stored in this directory.
tar cf prog02.tar login/prog02