One important aspect of understanding Machine Learning is the critical role that data plays in ML. In this assignment I want you to familiarize yourself with one of the common formats (C4.5) for the standard data sets that can be found at the UCI Machine Learning repository. The C4.5 format allows researchers to create data sets of interest and deposit them at the UCI repository so that other researchers can use the data in their learning programs and compare their results.
In this assignment you will design a class (in C++ or Java -- if you wish to use a different language contact me) called DataSet that is capable of representing a data set described in C4.5 format (see below) and the methods needed to read such a data set from a file and write out a data set in the same format.
After implementing the DataSet class you should build your own dataset to be used in later testing. Your data set should include at least 50 point data points and at least eight different feature values (in addition to the class). Your data set should have at least two discrete features and at least two continuous features. Unknown values may be useful but are not necessary. Try to pick a data set representing something in which you have interest.
I have ported a number of the data sets from the UCI repository to my home machine so that we can easily access these data sets without having to ftp them. The files for the data sets can be found in the directory:
~rmaclin/public/datasets
Each data set in the C4.5 format consists of two files, one ending in an extension .names which gives the feature names, possible feature values, and classification values for a data set and a second file ending in .data that lists the actual data points in a data set. For example, one data set is the labor data set which is stored in the files labor.names and labor.data. labor.names looks like this:
good, bad. | Classes duration: continuous wage increase first year: continuous wage increase second year: continuous wage increase third year: continuous cost of living adjustment: none,tcf,tc working hours: continuous pension: none,ret_allw,empl_contr standby pay: continuous shift differential: continuous education allowance: yes,no statutory holidays: continuous vacation: belowaverage,average,generous longterm disability assistance: yes,no contribution to dental plan: none,half,full bereavement assistance: yes,no contribution to health plan: none,half,full
The first line of any .names file indicates what classes into which data points are divided. In labor.names, the first line indicates that points are labeled good or bad (good or bad is their classification). The first line has the format:
Name1, Name2, ..., NameN.
Each of the names is a class that a point can be labeled with (and is the focus of our learning in an inductive learning system). In the labor data set, data points are labeled "good" or "bad". In the labor.names file a comment is added at the end of the first line (the characters " | Classes") which is ignored by the parser. Following the first line is a blank line and then a list of the feature names and the possible values of those features.
A feature name is simply any string of characters ending in ":". Some of the feature names in labor.names are "duration", "cost of living adjustment", and "contribution to dental plan". Following the ":" is one of two things, either the single word "continuous" or a list of names separated by commas. If the single word "continuous" appears then this feature is assumed to have values that are real numbers (this includes features with integer values). Examples of such features in labor.names include "duration", "wage increase first year", "wage increase second year", etc. On the other hand, if a list of names appears after the ":" then this list is assumed to indicate all of the different possible "discrete" values the feature may take on. For example, in labor.names, "cost of living adjustment" has "none,tcf,tc" following it. This means that the possible values of this feature for each data point are "none", "tcf" or "tc". The feature "pension" has possible values of "none", "ret_allw" or "empl_contr". Such features are generally called nominal or discrete features.
The .names file for a data set describes the features, feature values and class values for each point in the data set. The .data file actually lists the data points making up that data set. The first five (out of 57) data points in the labor data set data file (labor.data) are:
1,5.0,?,?,?,40,?,?,2,?,11,average,?,?,yes,?,good 2,4.5,5.8,?,?,35,ret_allw,?,?,yes,11,belowaverage,?,full,?,full,good ?,?,?,?,?,38,empl_contr,?,5,?,11,generous,yes,half,yes,half,good 3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good 3,4.5,4.5,5.0,?,40,?,?,?,?,12,average,?,half,yes,half,good
Each data point is simply a list of the values for each feature in the data set (in the order they appear in the .names file) followed by the class value for that data point, with the values separated by commas. So, for example, the first line indicates a data point with the following feature/feature value pairs:
duration = 1 wage increase first year = 5.0 wage increase second year = ? wage increase third year = ? cost of living adjustment = ? working hours = 40 pension = ? standby pay = ? shift differential = 2 education allowance = ? statutory holidays = 11 vacation = average longterm disability assistance = ? contribution to dental plan = ? bereavement assistance = yes contribution to health plan = ?
The final value on the line is the class this point has been labeled with (in this case "good"). Each data point must have one value for each feature. The values must be listed in the order they appear in the .names file and must be of the appropriate type (a number for continuous features or one of the possible feature values for discrete features). The only exception to this rule is that if a feature value is not known for a particular data point, a "?" may be included to indicate that the value is unknown for this data point. In the first data point, several of the feature values are unknown (including "wage increase second year", "wage increase third year", "cost of living adjustment", etc.). Some data sets (especially the labor data set) have lots of examples with unknown values and other have none.
You should implement a class DataSet (and any supporting classes you think you need) to hold a data set in memory. Each such class should have a description of all the features and the possible feature values and then a vector representing each individual data point. You should then implement a method for reading in a pair of data files as described above into your class and then print out the same data (do NOT make the output a part of the input process). To test your method you should run your method on the data sets in the public directory mentioned above to demonstrate that it produces the same data as output. Print out the output produced for your own data set (discussed below), plus the iris and labor data sets.
I also want each of you to construct a data set. Your data set should have at least eight features and at least 50 data points. You may choose any type of data you are interested in but please try to avoid offensive concepts. For your data set you should construct two files, a DATASETNAME.names file and DATASETNAME.data file where DATASETNAME is the name you give the data set. You should also make sure that your data set is not trivial where we will define trivial as being possible to classify based on only one feature.
NOTES:Comment your code and produce a general description of your DataSet class. Print out this code, the general description and several tests of your input and output methods.
Also print out a copy of each of the files making up your data set. Then write a short report discussing the interesting aspects of your data set and why you chose it (and what it means).
You must also submit your code electronically. To do this create a tar file of all of your code as well as your data set and email it to rmaclin@gmail.com.