In this lab you will implement the K-Means clustering algorithm discussed in class and in the clustering paper available from the class web page.
We will be implementing a K-Means algorithm that will include a number of extra features. In the basic K-Means algorithm we pick a set of K cluster centers (or centroids) then repeatedly do the following:
For each data point, determine the closest centroid, assign that data point to that centroid For each centroid, calculate a simple gradient consisting of the sum of the differences between that centroid and the points that make up its cluster Move each centroid towards the "center" of its cluster (step an amount in the gradient direction based on the gradient calculated above and a movement or learning rate)
We will use a basic set of parameters to control this algorithm:
We will also have a set of extra parameters that will control other aspects of the algorithm:
Although K-Means is not a supervised learning method, we will still be using the same dataset code we have used previously to implement this algorithm. We will be using the class information associated with the data points to do some evaluation of our clusters.
After your code runs it should print:
For example, your code might produce the following:
% kmeans breast-cancer-wisconsin.names breast-cancer-wisconsin.data -maxmove 0.01 Generating clusters: Terminating learning early (31 epochs), no centroid moved more than 0.010 Centroid 1: [0.652,0.410,0.353,0.693,0.372,0.944,0.462,0.222,0.137] Class (0) Distribution: [3,48] Centroid 2: [0.144,0.018,0.021,0.018,0.103,0.049,0.246,0.011,0.001] Class (0) Distribution: [94,0] Centroid 3: [0.145,0.012,0.022,0.017,0.112,0.034,0.058,0.007,0.010] Class (0) Distribution: [213,0] Centroid 4: [0.581,0.763,0.705,0.586,0.492,0.858,0.666,0.519,0.078] Class (0) Distribution: [2,71] Centroid 5: [0.734,0.461,0.524,0.197,0.429,0.504,0.391,0.533,0.119] Class (0) Distribution: [9,60] Centroid 6: [0.443,0.065,0.091,0.077,0.134,0.060,0.133,0.068,0.010] Class (0) Distribution: [137,12] Centroid 7: [0.754,0.914,0.911,0.764,0.710,0.720,0.739,0.899,0.408] Class (0) Distribution: [0,50]
When the more advanced features are invoked you should indicate this during the learning process:
% kmeans -dataset breast-cancer-wisconsin -k 2 -maxmove 0.01 -split 1.8 Generating clusters: Terminating learning early (22 epochs), no centroid moved more than 0.010 Largest distance within cluster too large : 2.577 -- splitting cluster Terminating learning early (24 epochs), no centroid moved more than 0.010 Largest distance within cluster too large : 2.146 -- splitting cluster Terminating learning early (26 epochs), no centroid moved more than 0.010 Largest distance within cluster too large : 2.146 -- splitting cluster Terminating learning early (53 epochs), no centroid moved more than 0.010 Largest distance within cluster too large : 1.902 -- splitting cluster Terminating learning early (22 epochs), no centroid moved more than 0.010 Largest distance within cluster too large : 1.902 -- splitting cluster Terminating learning early (28 epochs), no centroid moved more than 0.010 Largest distance within cluster too large : 1.985 -- splitting cluster Terminating learning early (23 epochs), no centroid moved more than 0.010 Largest distance within cluster too large : 1.869 -- splitting cluster Terminating learning early (21 epochs), no centroid moved more than 0.010 Centroid 1: [0.206,0.021,0.034,0.029,0.111,0.037,0.112,0.014,0.007] Class (0) Distribution: [436,1] Centroid 2: [0.732,0.427,0.455,0.400,0.327,0.955,0.472,0.251,0.077] Class (0) Distribution: [5,66] Centroid 3: [0.635,0.821,0.829,0.728,0.608,0.966,0.707,0.694,0.128] Class (0) Distribution: [0,42] Centroid 4: [0.873,0.812,0.757,0.178,0.509,0.539,0.412,0.686,0.107] Class (0) Distribution: [1,24] Centroid 5: [0.794,0.890,0.885,0.737,0.777,0.732,0.638,0.852,0.855] Class (0) Distribution: [0,24] Centroid 6: [0.606,0.287,0.314,0.245,0.273,0.268,0.313,0.195,0.103] Class (0) Distribution: [9,23] Centroid 7: [0.701,0.846,0.759,0.646,0.524,0.250,0.735,0.705,0.063] Class (0) Distribution: [0,25] Centroid 8: [0.514,0.290,0.331,0.146,0.520,0.285,0.354,0.855,0.032] Class (0) Distribution: [6,8] Centroid 9: [0.338,0.510,0.483,0.756,0.505,0.919,0.683,0.682,0.130] Class (0) Distribution: [1,28]
or
kmeans should also have one further option, show, when this option is used you should also print the names of the data points that are part of the cluster:
% kmeans -dataset labor -maxmove 0.01 -k 4 -show Generating clusters: Terminating learning early (60 epochs), no centroid moved more than 0.010 Centroid 1: [0.583,0.256,0.378,0.415,0.495,0.281,0.420,0.947,0.612,0.313,0.313,0.413,0.300,0.275,0.725,0.250,0.904,0.096,0.005,0.542,0.458,0.150,0.850,0.150,0.695,0.305,0.374,0.402,0.448] Class (0) Distribution: [4,9] Members: Example_12 Example_45 Example_21 Example_7 Example_25 Example_8 Example_17 Example_43 Example_37 Example_34 Example_31 Example_30 Example_16 Centroid 2: [0.569,0.472,0.445,0.550,0.713,0.286,0.168,0.794,0.402,0.402,0.521,0.542,0.314,0.503,0.497,0.412,0.157,0.337,0.546,0.714,0.286,0.267,0.228,0.694,0.764,0.236,0.150,0.188,0.811] Class (0) Distribution: [22,3] Members: Example_13 Example_24 Example_53 Example_29 Example_27 Example_41 Example_56 Example_50 Example_54 Example_20 Example_46 Example_9 Example_48 Example_10 Example_35 Example_14 Example_28 Example_11 Example_52 Example_1 Example_15 Example_39 Example_19 Example_23 Example_49 Centroid 3: [0.201,0.240,0.500,0.500,0.800,0.000,0.200,0.869,0.800,0.200,0.000,0.433,0.248,0.500,0.500,0.267,0.401,0.399,0.200,0.000,1.000,1.000,0.000,0.000,0.201,0.799,1.000,0.000,0.000] Class (0) Distribution: [0,5] Members: Example_36 Example_33 Example_18 Example_44 Example_40 Centroid 4: [0.744,0.301,0.357,0.623,0.231,0.526,0.474,0.683,0.263,0.263,0.737,0.500,0.444,0.655,0.345,0.404,0.228,0.609,0.310,0.693,0.307,0.406,0.594,0.242,0.822,0.178,0.321,0.679,0.244] Class (0) Distribution: [11,3] Members: Example_26 Example_51 Example_3 Example_32 Example_38 Example_47 Example_42 Example_22 Example_4 Example_2 Example_5 Example_6 Example_55 Example_0
Conduct experiments to try to determine an appropriate number of clusters to use for the breast-cancer-wisconsin data. A rough estimate of how good a set of clusters is can be obtained by totaling the smaller of the number of points in the class break down for each cluster (these points can be thought of as errors). Show runs from several experiments to show how your parameters choices change things.
Next run experiments to see whether splitting clusters can be made to approximate your choices.
Also run your code on your own dataset (note you will likely need to use a smaller number of clusters). Discuss what conclusions can you draw from your results.
Print out a copy of all of your code files. You should hand in printouts demonstrating how your program works by running your program on several data sets, including your own. Make sure to include tests of a variety of parameter values.
You should also write up a short report (at least one page, no more than three) discussing your design decisions in implementing the K-means algorithm and how your version of the code works.
You must also submit your code electronically. To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs8751-1-f2003/upload.cgi and follow the directions for uploading a file (you can do this multiple times, though it would be helpful if you would tar your files and upload one file archive).
To make your code easier to check and grade please use the following procedure for collecting the code before uploading it:
rmaclin/prog04Note that the suffix of all C++ code files (not .h files) should be ".cc". Only code files (in C++, only .cc and .h files) and your makefile should be stored in this directory.
tar cf prog04.tar login/prog04