Computer Science 8751
Machine Learning
Programming Assignment 1
Using WEKA and Creating your own Data Set (20 points)
Due Thursday, September 29, 2005
Introduction
For this class we will be making use of the WEKA machine learning code which is
implemented in Java.
To get WEKA you can go to the webpage http://www.cs.waikato.ac.nz/ml/weka.
You should download this code and familiarize yourself with it.
WEKA has online documentation built into it, you can also find a copy of a
chapter from Witten & Frank's Data Mining book at http://www.cs.waikato.ac.nz/ml/weka/book.html which
discusses the code (note that a new version of Witten and Frank's book has come
out and has more extensive documentation, you may want to purchase a copy of
this book).
In addition to downloading WEKA you will also be developing your own dataset.
I would prefer that this dataset relate to your thesis research topic, but if
you are having difficulty creating an appropriate dataset you may create one
of interest to you.
To Do
- Pick a dataset from the UCI ML dataset repository that you want to work with.
This dataset should have at least 200 examples, with at least 10 features, at
least one of which should be continuous and at least one should be nominal.
You may have to convert the dataset into a format appropriate for WEKA.
- Create a second dataset based on your research. It should have at least
50 examples, with at least 5 features, with at least one continuous and
one nominal feature.
- Pick a learning algorithm (other than J48 -- preferably not a tree algorithm
as J48 is a tree algorithm) to use for learning in WEKA.
- Perform ten 10-fold cross validation experiments on each dataset using the
J48 decision tree method from WEKA and the other learning algorithm you
chose.
To Hand In
- Writeup a description of your dataset and include a printout of your data
points for this dataset. Your description should include a discussion of each
feature, a discussion of the class variable, and some expectations regarding
likely class models that will be learned from this dataset.
- Present, using confusion matrices and appropriate bar graphs a discussion
of the accuracy of the two learning algorithms for each of the datasets you
used for testing. Also include documentation of your results.