In this lab you will be implementing a genetic algorithm for producing a set of rules to predict a concept.
The format of your hypothesis as a bit string should be based on the GABIL algorithm with one generalization, you should allow your rules to reference continuous variables as well.
To represent clauses corresponding to discrete features use the standard GABIL representation, use one bit for each value of that feature, where a 1 indicates that value is allowable and a 0 indicates the value is not allowed. For example, for feature A with three possible values (a1, a2, a3), three bits would be used. If the three bits were set to 101 in a hypothesis that would indicate a condition of (A = a1 or A = a3).For continuously valued attributes, you should use a representation with 2 bits for an operation and 32 bits to represent a floating point value. If both of the operation bits are set to 1, then any value of the continuously valued feature is allowable. If the 2 operations bits are 10 then values that are less than or equal to the floating point value are allowed, and if the 2 operation bits are 01 then values that are greater than or equal to the floating point value are allowed. For example, if the 34 bits for continuous feature B were 01 followed by the 32 bits to represent 1.5, this would indicate a test of B >= 1.5.
You should plan on implementing one cross-over operator, plus the point mutation, AddAlternative, and DropCondition operators. For cross-over, you should use a variation on the GABIL cross-over operator as follows:
The fitness of individual points should be the correctness (on the training set) of each hypothesis.
Your code should take a dataset and several parameters. The following parameters should be set by the user when calling the program:
You should set up your code so that it is able to save the top 11 hypotheses in a file. You should then be able to use the resulting hypotheses to predict the class for a separate set of data (a test file) and produce a confusion matrix for that data. Your output should look something like this:
Confusion matrix for hypothesis 1 from file XXX on data in file XXX: Actual Class 0 1 -------- Predicted 0 | 5 1 Class 1 | 0 10 Accuracy = 93.75% Confusion matrix for hypothesis 2 from file XXX on data in file XXX: ...
Print out a copy of all of your code files. You should hand in printouts demonstrating how your program works by running your program on several data sets, including your own. For the rules produced from your data set try to analyze the resulting rules and determine how accurate you think the rules are at capturing the concept expressed by your data.
You should also write up a short report (at least one page, no more than three) discussing your design decisions in implementing the genetic algorithm and how your version of the code works.
You must also submit your code electronically. To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs8751-1-s2003/upload.cgi and follow the directions for uploading a file (you can do this multiple times, though it would be helpful if you would tar your files and upload one file archive).
To make your code easier to check and grade please use the following procedure for collecting the code before uploading it:
rmaclin/prog04_ccNote that the suffix of all C++ code files (not .h files) should be ".cc". Only code files (for example, in C++, only .cc and .h files) should be stored in this directory.
tar cf prog04.tar login/prog04_PLcode