Description
This notebook is tested under python 3.*.*, and the corresponding packages can be downloaded from miniconda. You may also want to get yourself familiar with several packages:
Please implement the functions that have “raise NotImplementedError”, and after you finish the coding, please delete or comment “raise NotImplementedError”.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
%matplotlib inline
import sys
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import image
from mpl_toolkits.mplot3d import axes3d
from tqdm import tqdm
print('Version information')
print('python: {}'.format(sys.version))
print('matplotlib: {}'.format(matplotlib.__version__))
print('numpy: {}'.format(np.__version__))
# Set random seed so output is all same
np.random.seed(1)
# Load image
import imageio
1. KMeans Clustering
KMeans is trying to solve the following optimization problem:
where one needs to partition the N observations into K clusters: S={S1,S2,…,SK}S={S1,S2,…,SK} and each cluster has μiμi as its center.
1.1 pairwise distance
In this section, you are asked to implement pairwise_dist function.
Given X∈RNxDX∈RNxD and Y∈RMxDY∈RMxD, obtain the pairwise distance matrix dist∈RNxMdist∈RNxM using the euclidean distance metric, where disti,j=||Xi−Yj||2disti,j=||Xi−Yj||2.
DO NOT USE FOR LOOP in your implementation — they are slow and will make your code too slow to pass our grader. Use array broadcasting instead.
1.2 KMeans Implementation
In this section, you are asked to implement _init_centers [5pts], _update_assignment [10pts], _update_centers [10pts] and _get_loss function [5pts].
For the function signature, please see the corresponding doc strings.
1.3 Find the optimal number of clusters
In this section, you are asked to implement find_optimal_num_clusters function.
You will now use the elbow method to find the optimal number of clusters.
1.4 Autograder test to find centers for data points
To obtain these 5 points, you need to be pass the tests set up in the autograder. These will test the centers created by your implementation. Be sure to upload the correct files to obtain these points.
class KMeans(object):
def __init__(self): #No need to implement
pass
def pairwise_dist(self, x, y): # [5 pts]
"""
Args:
x: N x D numpy array
y: M x D numpy array
Return:
dist: N x M array, where dist2[i, j] is the euclidean distance between
x[i, :] and y[j, :]
"""
#l2 norm, ord=2
dist = np.linalg.norm(x[:, np.newaxis, :] - y, ord=2, axis=2)
return dist
# raise NotImplementedError
def _init_centers(self, points, K, **kwargs): # [5 pts]
"""
Args:
points: NxD numpy array, where N is # points and D is the dimensionality
K: number of clusters
kwargs: any additional arguments you want
Return:
centers: K x D numpy array, the centers.
"""
#K random indices from points
indices = np.random.choice(points.shape[0], size=K, replace=False)
#form the numpy array with these indices
centers = points[indices,:]
return centers
# raise NotImplementedError
def _update_assignment(self, centers, points): # [10 pts]
"""
Args:
centers: KxD numpy array, where K is the number of clusters, and D is the dimension
points: NxD numpy array, the observations
Return:
cluster_idx: numpy array of length N, the cluster assignment for each point
Hint: You could call pairwise_dist() function.
"""
#distances between points and all cluster centers
distances = self.pairwise_dist(points,centers)
#index of minimum distance for each row
cluster_idx = np.argmin(distances,axis=1)
return cluster_idx
# raise NotImplementedError
def _update_centers(self, old_centers, cluster_idx, points): # [10 pts]
"""
Args:
old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension
cluster_idx: numpy array of length N, the cluster assignment for each point
points: NxD numpy array, the observations
Return:
centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension.
"""
K,D = old_centers.shape[0], old_centers.shape[1]
#intialize centers as zero array
centers = np.zeros((K,D))
for i in range(K):
#find mean of all points having i as cluster idx
centers[i] = np.mean(points[cluster_idx==i, :], axis=0)
return centers
# raise NotImplementedError
def _get_loss(self, centers, cluster_idx, points): # [5 pts]
"""
Args:
centers: KxD numpy array, where K is the number of clusters, and D is the dimension
cluster_idx: numpy array of length N, the cluster assignment for each point
points: NxD numpy array, the observations
Return:
loss: a single float number, which is the objective function of KMeans.
"""
#find squared distances between all points and cluster centers
distances = np.linalg.norm(points[:, np.newaxis, :] - centers, ord=2, axis=2) ** 2
#select distance from cluster center
distance_from_cluster_center = distances[np.arange(len(distances)), cluster_idx]
#loss is sum of all these distances
loss = np.sum(distance_from_cluster_center)
return loss
# raise NotImplementedError
def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=False, **kwargs):
"""
Args:
points: NxD numpy array, where N is # points and D is the dimensionality
K: number of clusters
max_iters: maximum number of iterations (Hint: You could change it when debugging)
abs_tol: convergence criteria w.r.t absolute change of loss
rel_tol: convergence criteria w.r.t relative change of loss
verbose: boolean to set whether method should print loss (Hint: helpful for debugging)
kwargs: any additional arguments you want
Return:
cluster assignments: Nx1 int numpy array
cluster centers: K x D numpy array, the centers
loss: final loss value of the objective function of KMeans
"""
centers = self._init_centers(points, K, **kwargs)
for it in range(max_iters):
cluster_idx = self._update_assignment(centers, points)
centers = self._update_centers(centers, cluster_idx, points)
loss = self._get_loss(centers, cluster_idx, points)
K = centers.shape[0]
if it:
diff = np.abs(prev_loss - loss)
if diff < abs_tol and diff / prev_loss < rel_tol:
break
prev_loss = loss
if verbose:
print('iter %d, loss: %.4f' % (it, loss))
return cluster_idx, centers, loss
def find_optimal_num_clusters(self, data, max_K=15): # [10 pts]
"""Plots loss values for different number of clusters in K-Means
Args:
image: input image of shape(H, W, 3)
max_K: number of clusters
Return:
losses: an array of loss denoting the loss of each number of clusters
"""
losses = []
#for each value of K, find the loss and append to array
for i in range(1,max_K+1):
cluster_idx, centers, loss = self.__call__(data, i)
losses.append(loss)
return losses
# raise NotImplementedError
# Helper function for checking the implementation of pairwise_distance fucntion. Please DO NOT change this function
# TEST CASE
x = np.random.randn(2, 2)
y = np.random.randn(3, 2)
print("*** Expected Answer ***")
print("""==x==
[[ 1.62434536 -0.61175641]
[-0.52817175 -1.07296862]]
==y==
[[ 0.86540763 -2.3015387 ]
[ 1.74481176 -0.7612069 ]
[ 0.3190391 -0.24937038]]
==dist==
[[1.85239052 0.19195729 1.35467638]
[1.85780729 2.29426447 1.18155842]]""")
print("\n*** My Answer ***")
print("==x==")
print(x)
print("==y==")
print(y)
print("==dist==")
print(KMeans().pairwise_dist(x, y))
def image_to_matrix(image_file, grays=False):
"""
Convert .png image to matrix
of values.
params:
image_file = str
grays = Boolean
returns:
img = (color) np.ndarray[np.ndarray[np.ndarray[float]]]
or (grayscale) np.ndarray[np.ndarray[float]]
"""
img = plt.imread(image_file)
# in case of transparency values
if len(img.shape) == 3 and img.shape[2] > 3:
height, width, depth = img.shape
new_img = np.zeros([height, width, 3])
for r in range(height):
for c in range(width):
new_img[r, c, :] = img[r, c, 0:3]
img = np.copy(new_img)
if grays and len(img.shape) == 3:
height, width = img.shape[0:2]
new_img = np.zeros([height, width])
for r in range(height):
for c in range(width):
new_img[r, c] = img[r, c, 0]
img = new_img
return img
image_values = image_to_matrix('./images/bird_color_24.png')
r = image_values.shape[0]
c = image_values.shape[1]
ch = image_values.shape[2]
# flatten the image_values
image_values = image_values.reshape(r*c,ch)
k = 6 # feel free to change this value
cluster_idx, centers, loss = KMeans()(image_values, k)
updated_image_values = np.copy(image_values)
# assign each pixel to cluster mean
for i in range(0,k):
indices_current_cluster = np.where(cluster_idx == i)[0]
updated_image_values[indices_current_cluster] = centers[i]
updated_image_values = updated_image_values.reshape(r,c,ch)
plt.figure(None,figsize=(9,12))
plt.imshow(updated_image_values)
plt.show()
KMeans().find_optimal_num_clusters(image_values)
Silhouette Coefficient Evaluation
The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster.
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
def intra_cluster_dist(cluster_idx, data, labels): # [4 pts]
"""
Calculates the average distance from a point to other points within the same cluster
Args:
cluster_idx: the cluster index (label) for which we want to find the intra cluster distance
data: NxD numpy array, where N is # points and D is the dimensionality
labels: 1D array of length N where each number indicates of cluster assignement for that point
Return:
intra_dist_cluster: 1D array where the i_th entry denotes the average distance from point i
in cluster denoted by cluster_idx to other points within the same cluster
"""
# indices of rows with cluster==cluster_idx
indices = [i for i in range(len(labels)) if labels[i]==cluster_idx]
# find rows
rows_of_interest = data[indices]
# initialize intra_dist_cluster
intra_dist_cluster = np.zeros((rows_of_interest.shape[0]))
for i in range(intra_dist_cluster.shape[0]):
x = np.concatenate((rows_of_interest[:i,:], rows_of_interest[i+1:,:]))
y = rows_of_interest[i, :]
intra_dist_cluster[i] = np.mean(np.sqrt(np.sum((x - y[None,:])**2, axis = 1)))
return intra_dist_cluster
# raise NotImplementedError
def inter_cluster_dist(cluster_idx, data, labels): # [4 pts]
"""
Calculates the average distance from one cluster to the nearest cluster
Args:
cluster_idx: the cluster index (label) for which we want to find the intra cluster distance
data: NxD numpy array, where N is # points and D is the dimensionality
labels: 1D array of length N where each number indicates of cluster assignement for that point
Return:
inter_dist_cluster: 1D array where the i-th entry denotes the average distance from point i in cluster
denoted by cluster_idx to the nearest neighboring cluster
"""
# find number of distinct clusters
num_clusters = len(np.unique(labels))
# separate data points by cluster
separate_data = []
for j in range(num_clusters):
indices = [i for i in range(len(labels)) if labels[i] == j]
separate_data.append(data[indices])
# indices of rows with cluster==cluster_idx
indices = [i for i in range(len(labels)) if labels[i] == cluster_idx]
# find rows
rows_of_interest = data[indices]
# initialize inter_dist_cluster
inter_dist_cluster = np.zeros((rows_of_interest.shape[0]))
# find min. average distance from each point to nearest cluster
for i in range(rows_of_interest.shape[0]):
y = rows_of_interest[i, :]
distances = []
for j in range(len(separate_data)):
if j != cluster_idx:
x = separate_data[j]
distances.append(np.mean(np.sqrt(np.sum((x - y[None, :]) ** 2, axis=1))))
inter_dist_cluster[i] = min(distances)
return inter_dist_cluster
# raise NotImplementedError
def silhouette_coefficient(data, labels): #[2 pts]
"""
Finds the silhouette coefficient of the current cluster assignment
Args:
data: NxD numpy array, where N is # points and D is the dimensionality
labels: 1D array of length N where each number indicates of cluster assignement for that point
Return:
silhouette_coefficient: Silhouette coefficient of the current cluster assignment
"""
#unique clusters
clusters = np.unique(labels)
#initialize silhouette coefficient
SC = 0
#for each value of cluster_idx, find inter and intra cluster distance, add SC(i) to SC
for cluster_idx in clusters:
intra_dist_cluster = intra_cluster_dist(cluster_idx, data, labels)
inter_dist_cluster = inter_cluster_dist(cluster_idx, data, labels)
#max of inter and intra cluster distance for each point
max_of_both = np.max(np.array([intra_dist_cluster, inter_dist_cluster]), axis=0)
#calculate SC of each point
S = [x/y for x, y in zip(inter_dist_cluster - intra_dist_cluster, max_of_both)]
#add sum of all SCs for that cluster to SC
SC+= np.sum(S)
#SC is average of sum of all individual SCs
SC/= data.shape[0]
return SC
# raise NotImplementedError
def plot_silhouette_coefficient(data, max_K=15):
"""
Plot silhouette coefficient for different number of clusters, no need to implement
"""
clusters = np.arange(2, max_K+1)
silhouette_coefficients = []
for k in range(2, max_K+1):
labels, _, _ = KMeans()(data, k)
silhouette_coefficients.append(silhouette_coefficient(data, labels))
plt.plot(clusters, silhouette_coefficients)
return silhouette_coefficients
data = np.random.rand(200,3) * 100
plot_silhouette_coefficient(data)
Limitation of K-Means
One of the limitations of K-Means Clustering is that it dependes largely on the shape of the dataset. A common example of this is trying to cluster one circle within another (concentric circles). A K-means classifier will fail to do this and will end up effectively drawing a line which crosses the circles. You can visualize this limitation in the cell below.
# visualize limitation of kmeans, do not have to implement
from sklearn.datasets.samples_generator import (make_circles, make_moons)
X1, y1 = make_circles(factor=0.5, noise=0.05, n_samples=1500)
X2, y2 = make_moons(noise=0.05, n_samples=1500)
def visualise(X, C, K):# Visualization of clustering. You don't need to change this function
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=C,cmap='rainbow')
plt.title('Visualization of K = '+str(K), fontsize=15)
plt.show()
pass
cluster_idx1, centers1, loss1 = KMeans()(X1, 2)
visualise(X1, cluster_idx1, 2)
cluster_idx2, centers2, loss2 = KMeans()(X2, 2)
visualise(X2, cluster_idx2, 2)
2. EM algorithm [20 pts]
2.1 Performing EM Update [10 pts]
A univariate Gaussian Mixture Model (GMM) has two components, both of which have their own mean and standard deviation. The model is defined by the following parameters:
For a dataset of N datapoints, find the following:
2.1.1. Write the marginal probability of x, i.e. p(x)p(x) [2pts]
2.1.2. E-Step: Compute the posterior probability, i.e, p(zi=k|xi)p(zi=k|xi), where k = {0,1} [2pts]
2.1.3. M-Step: Compute the updated value of μμ (You can keep σσ fixed for this) [3pts]
2.1.4. M-Step: Compute the updated value for σσ (You can keep μμ fixed for this) [3pts]
2.1.1 Marginal probability of x, i.e. p(x)p(x)
Given: A univariate Gaussian mixture model with two components, defined by their parameters as:
Note: Considering p(z=0)=θp(z=0)=θ, and σσ and 3σ3σ as standard deviation as suggested by TAs on piazza:
Marginal probability of x can be calculated as:
2.1.2 E-Step: Posterior probability, i.e, p(zi=k|xi)p(zi=k|xi), where k = {0,1}
For k=0, we get:
Similarly, for k=0, we get:
2.1.3 M-Step: Computing the updated value of μμ
To find the updated value of μμ, we need to solve the optimization problem:
To solve this optimization problem, we will find its derivative and equate it to 0.
Representing p(Z=1|X,θoldp(Z=1|X,θold as τ(z=1)τ(z=1) and p(Z=0|X,θoldp(Z=0|X,θold as τ(z=0)τ(z=0) and using the value of the derivative of the Gaussian pdf w.r.t. μμ, we get:
Equating the above value of derivative to 0, we get:
This gives us the updated value of μμ as:
2.1.4 M-Step: Computing the updated value for σσ:
For this part, we are solving the same optimization problem, but this time, we will find the partial derivative w.r.t. σσ.
Representing p(Z=1|X,θoldp(Z=1|X,θold as τ(z=1)τ(z=1) and p(Z=0|X,θoldp(Z=0|X,θold as τ(z=0)τ(z=0) and using the value of the derivative of the Gaussian pdf w.r.t. σσ, we get:
Equating the above value of derivative to 0, we get:
\
This gives us the updated value of σσ as:
2.2 EM Algorithm in ABO Blood Groups [10 pts]
In the ABO blood group system, each individual has a phenotype and a genotype as shown below. The genotype is made of underlying alleles (A, B, O).
In a research experiment, scientists wanted to model the distribution of the genotypes of the population. They collected the phenotype information from the participants as this could be directly observed from the individual’s blood group. The scientists, however want to use this data to model the underlying genotype information. In order to help them obtain an understanding, you suggest using the EM algorithm to find out the genotype distribution.
You know that the probability of that an allele is present in an individual is independent of the probability of any other allele, i.e, P(AO)=P(OA)=P(A)∗P(O)P(AO)=P(OA)=P(A)∗P(O) and so on. Also note that the genotype pairs: (AO, OA) and (BO, OB) are identical and can be treated as AO, BO respectively. You also know that the alleles follow a multinomial distribution.
Let nA,nB,nO,nABnA,nB,nO,nAB be the number of individuals with the phenotypes A, B, O and AB respectively.\ Let nAA,nAO,nBB,nBO,nABnAA,nAO,nBB,nBO,nAB be the numbers of individuals with genotypes AA, AO, BB, BO and AB respectively.\ The satisfy the following conditions:
"", ComparePerformance().accuracy_GNB_cleandata(clean_data, independent))