1 Concept questions
Please provide a brief answer to each question.
2
1. 2. 3.
4. 5.
What’s the main difference between supervised and unsupervised learning? (5 points) Will different initializations for k-means lead to different results?
Give a short proof (can be in words but using correct logic) why k-means algorithm will converge in finite number of iterations.
What is the main difference between k-means and generalized k-means algorithm? Explain how the choice of the similarity/dissimilarity/distance will impact the result.
Consider the following simple graph
Write down the graph Laplacian matrix and find the eigenvectors associated with the zero eigen- value. Explain how do you find out the number of disconnected clusters in graph and identify these disconnected clusters using these eigenvectors.
Image compression using clustering [40 points]
14 235
In this programming assignment, you are going to apply clustering algorithms for image compression. Your task is to implement K-means for this purpose. It is required you implement the algorithms yourself rather than calling k-means from a package. However, it is ok to use standard packages for supplementary tasks, e.g. file i/o, linear algebra, and visualization.
Formatting instruction
As a starting point, we suggest the following input/output signature for your k-means algorithm.
Input
- pixels: the input image representation. Each row contains one data point (pixel). For image dataset, it contains 3 columns, each column corresponding to Red, Green, and Blue component. Each component has an integer value between 0 and 255.
- k: the number of desired clusters.
1
Output
- class: cluster assignment of each data point in pixels. The assignment should be 1, 2, 3, etc. For k = 5, for example, each cell of class should be either 1, 2, 3, 4, or 5. The output should be a column vector with size(pixels, 1) elements.
- centroid: location of k centroids (or representatives) in your result. With images, each centroid corresponds to the representative color of each cluster. The output should be a matrix with K rows and 3 columns. The range of values should be [0, 255], possibly floating point numbers.Hand-inBoth of your code and report will be evaluated. Upload the code as a zip file, and the report as a pdf, separately from the zip file. In your report, answer to the following questions:
- (20 points) Use k-means with squared-l2 norm as a metric, for GeorgiaTech.bmp and football.bmp and also choose a third picture of your own to work on. We recommend size of 320 × 240 or smaller. Run your k-means implementation with these pictures, with several different k = 2, 4, 8, 16.Run your k-means implementation (with squared-l2 norm) with random initialization centroids. Please try multiple time and report the best one (in terms of the image quality).Please write in your report, how long does it take to converge for each k (report the number of iterations, and elapsed time in seconds) and also include the resulted compressed pictures for each k in the report.
- (20 points) Now try your k-means with the Manhattan distance (or l1 distance) and repeat the same steps in Part (1). Please note that the assignment of data point should be based on the Manhattan distance, and the cluster centroid (by minimizing the sum of deviance – as a result of using the Manhattan distance) will be taken as the “median” of each cluster. Comment on the difference of image compression results using the two methods.
Note
3
You may see errors caused by empty clusters when you use too large k. Your implementation should treat this exception as well. That is, do not terminate even if you have an empty cluster, but auto- matically decrement to a smaller number of clusters.
We recommend you to test your code with several different pictures so that you can detect some problems that might happen occasionally.
If we detect plagarism from any other student’s code or from the web, you will not be eligible for any credit for the entire homework, not just for this problem.
Political blogs dataset [30 points]
We will study a political blogs dataset first compiled for the paper Lada A. Adamic and Natalie Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). It is assumed that blog-site with the same political orientation are more likely to link to each other, thus, forming a “community” or “cluster” in a graph. In this question, we will see whether or not this hypothesis is likely to be true based on data.
The dataset nodes.txt contains a graph with n = 1490 vertices (“nodes”) corresponding to political blogs.
2
The dataset edges.txt contains edges between the vertices. You may remove isolated nodes (nodes that are not connected any other nodes) in the pre-processing.
We will treat the network as an undirected graph; thus, when constructing the adjacency matrix, make it symmetrical by, e.g., set the entry in the adjacency matrix to be one whether there is an edge between the two nodes (in either direction).
In addition, each vertex has a 0-1 label (in the 3rd column of the data file) corresponding to the true political orientation of that blog. We will consider this as the true label and check whether spectral clustering will cluster nodes with the same political orientation as possible.
4
1.
2.
(15 points) Use spectral clustering to find the k = 2, 5, 10, 20 clusters in the network of political blogs (each node is a blog, and their edges are defined in the file edges.txt). Find majority labels in each cluster, for different k values, respectively. For example, if there are k = 2 clusters, and their labels are {0, 1, 1, 1} and {0, 0, 1} then the majority label for the first cluster is 1 and for the second cluster is 0. It is required you implement the algorithms yourself rather than calling from a package.
Now compare the majority label with the individual labels in each cluster, and report the mismatch rate for each cluster, when k = 2, 5, 10, 20. For instance, in the example above, the mismatch rate for the first cluster is 1/4 (only the first node differs from the majority) and the the second cluster is 1/3.
(15 points) Tune your k and find the number of clusters to achieve a reasonably small mismatch rate. Please explain how you tune k and what is the achieved mismatch rate. Please explain intuitively what this results tells about the network community structure.
(Bonus Question) Math of k-means clustering [5 points]
Given m data points xi, i = 1, . . . , m, K-means clustering algorithm groups them into k clusters by mini- mizing the distortion function over {rij,μj}
mk
J =rij∥xi −μj∥2, (1)
i=1 j=1
where rij = 1 if xi belongs to the j-th cluster and rij = 0 otherwise.
1. (3 points) Derive mathematically that using the squared Euclidean distance ∥xi − μj ∥2 as the dissim- ilarity function, the centroid that minimizes the distortion function J for given assignments rij are given by
j i rijxi μ = irij .
That is, μj is the center of j-th cluster.
Hint: You may start by taking the partial derivative of J with respect to μj, with rij fixed.
2. (2 points) Derive mathematically what should be the assignment variables rij be to minimize the distortion function J, when the centroids μj are fixed.




