Computer Science 5751 : Machine Learning
Experimentation Code Description
Introduction
For programs 2-5 in 5751 you will be implementing various learning algorithms
to get a better understanding of how Machine Learning algorithms work.
To enable you to explore the ML algorithms without having to spend too much
time programming other code aspects (such as parsing data files) you will
be implementing the algorithms by completing a set of experimentation code
that I will provide to you.
For each project I will provide a new code skeleton with many of the base
details set up for you and you will have to complete a small number of
functions that pertain to the ML algorithm itself.
On this page I will give a short overview of the workings of the experimentation
code and Hari will provide a more extensive introduction in lab.
Files and Code Organization
The code will be provided for you in compressed, tared archives.
Each archive will contain a directory with a number of .C (code), .h (header)
files and a makefile.
In addition, there may be scripts for running the code in a particular way
plus possibly some simple example data files.
Instructions for uncompressing and unpacking the archived code will be
provided with each program.
The Make File makefile
In general, to produce all of the executables for a particular project you
will simply need to go to the directory containing the code and type "make".
The makefile will take care of the rest.
If you are unsure of whether or not your .o (object) files or executables are
up to date you can eliminate them all and start over by first typing
"make clean" and then typing "make".
If for some reason you decide to add a file of your own to the system you
will need to change the file makefile.
To do this, edit the file and add the .C name and .o name to the appropriate
line.
In general, if you are adding more files pertaining to a particular learning
algorithm you should go to the lines starting with "LEARNMETHSRCS = " and
"LEARNMETHOBJS = " and add the .C and .o to the appropriate line.
Once you finish editing the makefile you should then update the header
dependencies by typing "make depend".
This will automatically read your files (and the existing files) for header
dependencies.
If you get warning from this you should examine the warnings carefully.
Code Files
For the projects you will be pursuing you will generally only have to add
to two files, a .C and .h file provided to you.
The other files in the code should remain largely untouched (though you
are welcome to add debugging comments as you see fit).
Below I will give a short description of the other files in the system:
- Utility File (util) -
The utility files util.h and util.C provides several simple utility routines
such as random number generation, two dimension array allocation,
and string to double or integer processing.
You will likely not need these utilities (except possibly the routines
for 2d array allocation).
- Data Set Utility Files (char_input_numbered_stream,
string_space, read_data_util) -
These six .h and .C files provide
basic utilities for the input routines used in reading in data sets.
- char_input_numbered_stream - these files provides an input
stream class that tracks the line numbers and character number
on a line during input. This is especially useful when errors
come up in parsing an input file.
- string_space - these files provides a StringPointer and
StringSpace class. These classes are also critical for parsing
since they allow for quick lookup of strings as they are read in
(via the hash table in the StringSpace) and for maintaining a
single viable copy of string names (via the StringPointer within
a StringSpace).
- read_data_util - provides a few basic routines involving
strings and white space that are useful in reading in a data
set.
- Data Set Reading Files (read_bp_data, read_c45_data) -
these files provide routines to parse data sets in the bp (backprop)
and the C4.5 formats and produce DataSet objects.
- Data Set (data_set) - the critical file here is
data_set.h. This files describes how data is stored
in a DataSet structure. Basically a DataSet structure has an array
of Feature items describing the input features, another array
describing the output features, an array of ClassDescriptor items
describing how the output features map to class values and an array
of Example items representing data points. For a longer description
check out data_set.h. Hari will spend time describing this for you.
- Data Map (data_map) - also plan on reading data_map.h
carefully. A DataMap is a description of a subset of a DataSet and
an ordering on that subset to be used in learning on or classifying
that subset.
- Command Line (command_line) - this code is written to be
extensible. One of the ways this is done is by allowing files to be
added to the system that add new command line options corresponding
to new learning methods, different learning parameters, etc. These
files provide the base routines for implementing the extendible command
line option system.
- Learning Methods (learning_method and the corresponding files
for each project). The learning_method files implement a base learning
class that we build on in constructing our other learning methods.
The LearningMethod class is specialized to produce each of our
specific learning algorithms.
- Executables (train, classify and nfold) - your make will construct
two (or three) executable files - train, classify, and (possibly) nfold.
The corresponding .C files for each of the executables contains the
main routine for that executable.
- train - routine used to train a new model on an entire data set.
This routine will employ the learn function of your learning
method to create a model for the data and save that model. It
may also attempt to classify the training data using the model
(check out the command line options).
- classify - routine used to classify new data based on a previously
created model. This routine will use the read routine in your
learning method to read in a model and call classify on the
provided data set.
- nfold - routine used to perform N-fold cross validation tests.
I will generally provide scripts that can be used to call the different
routines.
If you are interested in seeing what options are available for each
executable, simply type the name of the executable with no options and
it will print out a list of possible options.
Debugging
As part of the makefile setup, the -g flag is set. This allows you to call
gdb on the resulting executables. This will allow you to debug the code
without having to spend time adding output statements to the code and
recompiling.