CS 5751 (Spring 2001) Experimentation Code Description

Computer Science 5751 : Machine Learning

Experimentation Code Description

Introduction

For programs 2-5 in 5751 you will be implementing various learning algorithms to get a better understanding of how Machine Learning algorithms work. To enable you to explore the ML algorithms without having to spend too much time programming other code aspects (such as parsing data files) you will be implementing the algorithms by completing a set of experimentation code that I will provide to you. For each project I will provide a new code skeleton with many of the base details set up for you and you will have to complete a small number of functions that pertain to the ML algorithm itself. On this page I will give a short overview of the workings of the experimentation code and Hari will provide a more extensive introduction in lab.

Files and Code Organization

The code will be provided for you in compressed, tared archives. Each archive will contain a directory with a number of .C (code), .h (header) files and a makefile. In addition, there may be scripts for running the code in a particular way plus possibly some simple example data files. Instructions for uncompressing and unpacking the archived code will be provided with each program.

The Make File makefile

In general, to produce all of the executables for a particular project you will simply need to go to the directory containing the code and type "make". The makefile will take care of the rest. If you are unsure of whether or not your .o (object) files or executables are up to date you can eliminate them all and start over by first typing "make clean" and then typing "make".

If for some reason you decide to add a file of your own to the system you will need to change the file makefile. To do this, edit the file and add the .C name and .o name to the appropriate line. In general, if you are adding more files pertaining to a particular learning algorithm you should go to the lines starting with "LEARNMETHSRCS = " and "LEARNMETHOBJS = " and add the .C and .o to the appropriate line. Once you finish editing the makefile you should then update the header dependencies by typing "make depend". This will automatically read your files (and the existing files) for header dependencies. If you get warning from this you should examine the warnings carefully.

Code Files

For the projects you will be pursuing you will generally only have to add to two files, a .C and .h file provided to you. The other files in the code should remain largely untouched (though you are welcome to add debugging comments as you see fit). Below I will give a short description of the other files in the system:

Utility File (util) - The utility files util.h and util.C provides several simple utility routines such as random number generation, two dimension array allocation, and string to double or integer processing. You will likely not need these utilities (except possibly the routines for 2d array allocation).
Data Set Utility Files (char_input_numbered_stream, string_space, read_data_util) - These six .h and .C files provide basic utilities for the input routines used in reading in data sets.
- char_input_numbered_stream - these files provides an input stream class that tracks the line numbers and character number on a line during input. This is especially useful when errors come up in parsing an input file.
- string_space - these files provides a StringPointer and StringSpace class. These classes are also critical for parsing since they allow for quick lookup of strings as they are read in (via the hash table in the StringSpace) and for maintaining a single viable copy of string names (via the StringPointer within a StringSpace).
- read_data_util - provides a few basic routines involving strings and white space that are useful in reading in a data set.
Data Set Reading Files (read_bp_data, read_c45_data) - these files provide routines to parse data sets in the bp (backprop) and the C4.5 formats and produce DataSet objects.
Data Set (data_set) - the critical file here is data_set.h. This files describes how data is stored in a DataSet structure. Basically a DataSet structure has an array of Feature items describing the input features, another array describing the output features, an array of ClassDescriptor items describing how the output features map to class values and an array of Example items representing data points. For a longer description check out data_set.h. Hari will spend time describing this for you.
Data Map (data_map) - also plan on reading data_map.h carefully. A DataMap is a description of a subset of a DataSet and an ordering on that subset to be used in learning on or classifying that subset.
Command Line (command_line) - this code is written to be extensible. One of the ways this is done is by allowing files to be added to the system that add new command line options corresponding to new learning methods, different learning parameters, etc. These files provide the base routines for implementing the extendible command line option system.
Learning Methods (learning_method and the corresponding files for each project). The learning_method files implement a base learning class that we build on in constructing our other learning methods. The LearningMethod class is specialized to produce each of our specific learning algorithms.
Executables (train, classify and nfold) - your make will construct two (or three) executable files - train, classify, and (possibly) nfold. The corresponding .C files for each of the executables contains the main routine for that executable.
- train - routine used to train a new model on an entire data set. This routine will employ the learn function of your learning method to create a model for the data and save that model. It may also attempt to classify the training data using the model (check out the command line options).
- classify - routine used to classify new data based on a previously created model. This routine will use the read routine in your learning method to read in a model and call classify on the provided data set.
- nfold - routine used to perform N-fold cross validation tests.
I will generally provide scripts that can be used to call the different routines. If you are interested in seeing what options are available for each executable, simply type the name of the executable with no options and it will print out a list of possible options.

Debugging