Computer Science 8751
Machine Learning
Programming Assignment 1
Using WEKA and Creating your own Data Set (30 points)
Due Wednesday, February 25, 2009
Introduction
For this class we will be making use of the WEKA machine learning code which is
implemented in Java.
To get WEKA you can go to the webpage http://www.cs.waikato.ac.nz/ml/weka.
You should download this code and familiarize yourself with it.
WEKA has online documentation built into it, you can also find a copy of a
chapter from Witten & Frank's Data Mining book at http://www.cs.waikato.ac.nz/ml/weka/book.html which
discusses the code (note that a new version of Witten and Frank's book has come
out and has more extensive documentation, you may want to purchase a copy of
this book).
In addition to downloading WEKA you will also be developing your own dataset.
I would prefer that this dataset relate to your thesis research topic, but if
you are having difficulty creating an appropriate dataset you may create one
of interest to you. Note: NO cricket data sets.
To Do
- Pick a dataset from the UCI ML dataset repository that you want to work with.
This dataset should have at least 200 examples, with at least 10 features, at
least one of which should be continuous and at least one should be nominal.
You may have to convert the dataset into a format appropriate for WEKA.
- Create a second dataset based on your research. It should have at least
50 examples, with at least 5 features, with at least one continuous and
one nominal feature.
- Pick a learning method other than J48.
- Using the J48 decision tree method and the one you chose, perform five three-fold cross validation experiments on each dataset.
To Hand In
- Writeup a description of your dataset and include a printout of your data
points for this dataset. Your description should include a discussion of each
feature, a discussion of the class variable, and some expectations regarding
likely class models that will be learned from this dataset.
- Present, using confusion matrices and appropriate bar graphs a discussion
of the accuracy of the two learning algorithms for each of the datasets you
used for testing. Also include documentation of your results.