CS7641 Homework 2 -KMeans Clustering Solved

35.00 $

Category:
Click Category Button to View Your Next Assignment | Homework

You'll get a download link with a: zip solution files instantly, after Payment

Securely Powered by: Secure Checkout

Description

5/5 - (2 votes)

This notebook is tested under python 3.*.*, and the corresponding packages can be downloaded from miniconda. You may also want to get yourself familiar with several packages:

Please implement the functions that have “raise NotImplementedError”, and after you finish the coding, please delete or comment “raise NotImplementedError”.

In [1]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

%matplotlib inline  

import sys
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import image
from mpl_toolkits.mplot3d import axes3d
from tqdm import tqdm

print('Version information')

print('python: {}'.format(sys.version))
print('matplotlib: {}'.format(matplotlib.__version__))
print('numpy: {}'.format(np.__version__))

# Set random seed so output is all same
np.random.seed(1)

# Load image
import imageio
Version information
python: 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
matplotlib: 3.1.0
numpy: 1.16.4

1. KMeans Clustering

KMeans is trying to solve the following optimization problem:

argminSi=1KxjSi||xjμi||2arg⁡minS∑i=1K∑xj∈Si||xj−μi||2

where one needs to partition the N observations into K clusters: S={S1,S2,,SK}S={S1,S2,…,SK} and each cluster has μiμi as its center.

1.1 pairwise distance

In this section, you are asked to implement pairwise_dist function.

Given XRNxDX∈RNxD and YRMxDY∈RMxD, obtain the pairwise distance matrix distRNxMdist∈RNxM using the euclidean distance metric, where disti,j=||XiYj||2disti,j=||Xi−Yj||2.

DO NOT USE FOR LOOP in your implementation — they are slow and will make your code too slow to pass our grader. Use array broadcasting instead.

1.2 KMeans Implementation

In this section, you are asked to implement _init_centers [5pts], _update_assignment [10pts], _update_centers [10pts] and _get_loss function [5pts].

For the function signature, please see the corresponding doc strings.

1.3 Find the optimal number of clusters

In this section, you are asked to implement find_optimal_num_clusters function.

You will now use the elbow method to find the optimal number of clusters.

1.4 Autograder test to find centers for data points

To obtain these 5 points, you need to be pass the tests set up in the autograder. These will test the centers created by your implementation. Be sure to upload the correct files to obtain these points.

In [2]:
class KMeans(object):
    
    def __init__(self): #No need to implement
        pass
    
    def pairwise_dist(self, x, y): # [5 pts]
        """
        Args:
            x: N x D numpy array
            y: M x D numpy array
        Return:
                dist: N x M array, where dist2[i, j] is the euclidean distance between 
                x[i, :] and y[j, :]
                """
        #l2 norm, ord=2
        dist = np.linalg.norm(x[:, np.newaxis, :] - y, ord=2, axis=2)
        return dist
#         raise NotImplementedError

    def _init_centers(self, points, K, **kwargs): # [5 pts]
        """
        Args:
            points: NxD numpy array, where N is # points and D is the dimensionality
            K: number of clusters
            kwargs: any additional arguments you want
        Return:
            centers: K x D numpy array, the centers. 
        """
        #K random indices from points
        indices = np.random.choice(points.shape[0], size=K, replace=False)
        #form the numpy array with these indices
        centers = points[indices,:]
        return centers
        
#         raise NotImplementedError

    def _update_assignment(self, centers, points): # [10 pts]
        """
        Args:
            centers: KxD numpy array, where K is the number of clusters, and D is the dimension
            points: NxD numpy array, the observations
        Return:
            cluster_idx: numpy array of length N, the cluster assignment for each point
            
        Hint: You could call pairwise_dist() function.
        """
        #distances between points and all cluster centers
        distances = self.pairwise_dist(points,centers)
        #index of minimum distance for each row
        cluster_idx = np.argmin(distances,axis=1)
        return cluster_idx
#         raise NotImplementedError

    def _update_centers(self, old_centers, cluster_idx, points): # [10 pts]
        """
        Args:
            old_centers: old centers KxD numpy array, where K is the number of clusters, and D is the dimension
            cluster_idx: numpy array of length N, the cluster assignment for each point
            points: NxD numpy array, the observations
        Return:
            centers: new centers, K x D numpy array, where K is the number of clusters, and D is the dimension.
        """
        K,D = old_centers.shape[0], old_centers.shape[1]
        #intialize centers as zero array
        centers = np.zeros((K,D))
        for i in range(K):
            #find mean of all points having i as cluster idx
            centers[i] = np.mean(points[cluster_idx==i, :], axis=0)
        return centers        
#         raise NotImplementedError

    def _get_loss(self, centers, cluster_idx, points): # [5 pts]
        """
        Args:
            centers: KxD numpy array, where K is the number of clusters, and D is the dimension
            cluster_idx: numpy array of length N, the cluster assignment for each point
            points: NxD numpy array, the observations
        Return:
            loss: a single float number, which is the objective function of KMeans. 
        """
        #find squared distances between all points and cluster centers
        distances = np.linalg.norm(points[:, np.newaxis, :] - centers, ord=2, axis=2) ** 2
        #select distance from cluster center
        distance_from_cluster_center = distances[np.arange(len(distances)), cluster_idx]
        #loss is sum of all these distances
        loss = np.sum(distance_from_cluster_center)
        return loss
        
#         raise NotImplementedError
        
    def __call__(self, points, K, max_iters=100, abs_tol=1e-16, rel_tol=1e-16, verbose=False, **kwargs):
        """
        Args:
            points: NxD numpy array, where N is # points and D is the dimensionality
            K: number of clusters
            max_iters: maximum number of iterations (Hint: You could change it when debugging)
            abs_tol: convergence criteria w.r.t absolute change of loss
            rel_tol: convergence criteria w.r.t relative change of loss
            verbose: boolean to set whether method should print loss (Hint: helpful for debugging)
            kwargs: any additional arguments you want
        Return:
            cluster assignments: Nx1 int numpy array
            cluster centers: K x D numpy array, the centers
            loss: final loss value of the objective function of KMeans
        """
        centers = self._init_centers(points, K, **kwargs)
        for it in range(max_iters):
            cluster_idx = self._update_assignment(centers, points)
            centers = self._update_centers(centers, cluster_idx, points)
            loss = self._get_loss(centers, cluster_idx, points)
            K = centers.shape[0]
            if it:
                diff = np.abs(prev_loss - loss)
                if diff < abs_tol and diff / prev_loss < rel_tol:
                    break
            prev_loss = loss
            if verbose:
                print('iter %d, loss: %.4f' % (it, loss))
        return cluster_idx, centers, loss
    
    def find_optimal_num_clusters(self, data, max_K=15): # [10 pts]
        """Plots loss values for different number of clusters in K-Means
        
        Args:
            image: input image of shape(H, W, 3)
            max_K: number of clusters
        Return:
            losses: an array of loss denoting the loss of each number of clusters
        """
        losses = []
        #for each value of K, find the loss and append to array
        for i in range(1,max_K+1):
            cluster_idx, centers, loss = self.__call__(data, i)
            losses.append(loss)
        return losses
        
#         raise NotImplementedError
In [3]:
# Helper function for checking the implementation of pairwise_distance fucntion. Please DO NOT change this function
# TEST CASE
x = np.random.randn(2, 2)
y = np.random.randn(3, 2)

print("*** Expected Answer ***")
print("""==x==
[[ 1.62434536 -0.61175641]
 [-0.52817175 -1.07296862]]
==y==
[[ 0.86540763 -2.3015387 ]
 [ 1.74481176 -0.7612069 ]
 [ 0.3190391  -0.24937038]]
==dist==
[[1.85239052 0.19195729 1.35467638]
 [1.85780729 2.29426447 1.18155842]]""")


print("\n*** My Answer ***")
print("==x==")
print(x)
print("==y==")
print(y)
print("==dist==")
print(KMeans().pairwise_dist(x, y))
*** Expected Answer ***
==x==
[[ 1.62434536 -0.61175641]
 [-0.52817175 -1.07296862]]
==y==
[[ 0.86540763 -2.3015387 ]
 [ 1.74481176 -0.7612069 ]
 [ 0.3190391  -0.24937038]]
==dist==
[[1.85239052 0.19195729 1.35467638]
 [1.85780729 2.29426447 1.18155842]]

*** My Answer ***
==x==
[[ 1.62434536 -0.61175641]
 [-0.52817175 -1.07296862]]
==y==
[[ 0.86540763 -2.3015387 ]
 [ 1.74481176 -0.7612069 ]
 [ 0.3190391  -0.24937038]]
==dist==
[[1.85239052 0.19195729 1.35467638]
 [1.85780729 2.29426447 1.18155842]]
In [4]:
def image_to_matrix(image_file, grays=False):
    """
    Convert .png image to matrix
    of values.
    params:
    image_file = str
    grays = Boolean
    returns:
    img = (color) np.ndarray[np.ndarray[np.ndarray[float]]]
    or (grayscale) np.ndarray[np.ndarray[float]]
    """
    img = plt.imread(image_file)
    # in case of transparency values
    if len(img.shape) == 3 and img.shape[2] > 3:
        height, width, depth = img.shape
        new_img = np.zeros([height, width, 3])
        for r in range(height):
            for c in range(width):
                new_img[r, c, :] = img[r, c, 0:3]
        img = np.copy(new_img)
    if grays and len(img.shape) == 3:
        height, width = img.shape[0:2]
        new_img = np.zeros([height, width])
        for r in range(height):
            for c in range(width):
                new_img[r, c] = img[r, c, 0]
        img = new_img
    return img
In [5]:
image_values = image_to_matrix('./images/bird_color_24.png')

r = image_values.shape[0]
c = image_values.shape[1]
ch = image_values.shape[2]
# flatten the image_values
image_values = image_values.reshape(r*c,ch)

k = 6 # feel free to change this value
cluster_idx, centers, loss = KMeans()(image_values, k)
updated_image_values = np.copy(image_values)

# assign each pixel to cluster mean
for i in range(0,k):
    indices_current_cluster = np.where(cluster_idx == i)[0]
    updated_image_values[indices_current_cluster] = centers[i]
    
updated_image_values = updated_image_values.reshape(r,c,ch)

plt.figure(None,figsize=(9,12))
plt.imshow(updated_image_values)
plt.show()
In [6]:
KMeans().find_optimal_num_clusters(image_values)
Out[6]:
[25575.970952916756,
 14136.693323245287,
 10225.60985715669,
 6993.510281152104,
 6388.54308345205,
 4749.894457724676,
 3383.379286682485,
 3418.0366633949584,
 2738.2662250749277,
 2554.129469081243,
 2065.325028335666,
 1929.4098314621665,
 1965.2534688501296,
 1706.074071535149,
 1580.6903867544215]

Silhouette Coefficient Evaluation

The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

In [7]:
def intra_cluster_dist(cluster_idx, data, labels): # [4 pts]
    """
    Calculates the average distance from a point to other points within the same cluster
    
    Args:
        cluster_idx: the cluster index (label) for which we want to find the intra cluster distance
        data: NxD numpy array, where N is # points and D is the dimensionality
        labels: 1D array of length N where each number indicates of cluster assignement for that point
    Return:
        intra_dist_cluster: 1D array where the i_th entry denotes the average distance from point i 
                            in cluster denoted by cluster_idx to other points within the same cluster
    """
    # indices of rows with cluster==cluster_idx
    indices = [i for i in range(len(labels)) if labels[i]==cluster_idx]
    # find rows
    rows_of_interest = data[indices]
    # initialize intra_dist_cluster
    intra_dist_cluster = np.zeros((rows_of_interest.shape[0]))
    
    for i in range(intra_dist_cluster.shape[0]):
        x = np.concatenate((rows_of_interest[:i,:], rows_of_interest[i+1:,:]))
        y = rows_of_interest[i, :]
        intra_dist_cluster[i] = np.mean(np.sqrt(np.sum((x - y[None,:])**2, axis = 1)))
        
    return intra_dist_cluster
    #     raise NotImplementedError

def inter_cluster_dist(cluster_idx, data, labels): # [4 pts]
    """
    Calculates the average distance from one cluster to the nearest cluster
    Args:
        cluster_idx: the cluster index (label) for which we want to find the intra cluster distance
        data: NxD numpy array, where N is # points and D is the dimensionality
        labels: 1D array of length N where each number indicates of cluster assignement for that point
    Return:
        inter_dist_cluster: 1D array where the i-th entry denotes the average distance from point i in cluster
                            denoted by cluster_idx to the nearest neighboring cluster
    """
    # find number of distinct clusters
    num_clusters = len(np.unique(labels))
    # separate data points by cluster
    separate_data = []
    for j in range(num_clusters):
        indices = [i for i in range(len(labels)) if labels[i] == j]
        separate_data.append(data[indices])

    # indices of rows with cluster==cluster_idx
    indices = [i for i in range(len(labels)) if labels[i] == cluster_idx]
    # find rows
    rows_of_interest = data[indices]

    # initialize inter_dist_cluster
    inter_dist_cluster = np.zeros((rows_of_interest.shape[0]))

    # find min. average distance from each point to nearest cluster
    for i in range(rows_of_interest.shape[0]):
        y = rows_of_interest[i, :]
        distances = []
        for j in range(len(separate_data)):
            if j != cluster_idx:
                x = separate_data[j]
                distances.append(np.mean(np.sqrt(np.sum((x - y[None, :]) ** 2, axis=1))))
        inter_dist_cluster[i] = min(distances)
    return inter_dist_cluster
    
#     raise NotImplementedError

def silhouette_coefficient(data, labels): #[2 pts]
    """
    Finds the silhouette coefficient of the current cluster assignment
    
    Args:
        data: NxD numpy array, where N is # points and D is the dimensionality
        labels: 1D array of length N where each number indicates of cluster assignement for that point
    Return:
        silhouette_coefficient: Silhouette coefficient of the current cluster assignment
    """
    #unique clusters
    clusters = np.unique(labels)
    #initialize silhouette coefficient
    SC = 0
    
    #for each value of cluster_idx, find inter and intra cluster distance, add SC(i) to SC
    for cluster_idx in clusters:
        intra_dist_cluster = intra_cluster_dist(cluster_idx, data, labels)
        inter_dist_cluster = inter_cluster_dist(cluster_idx, data, labels)
        #max of inter and intra cluster distance for each point
        max_of_both = np.max(np.array([intra_dist_cluster, inter_dist_cluster]), axis=0)
        #calculate SC of each point
        S = [x/y for x, y in zip(inter_dist_cluster - intra_dist_cluster, max_of_both)]
        #add sum of all SCs for that cluster to SC
        SC+= np.sum(S)
        
    #SC is average of sum of all individual SCs
    SC/= data.shape[0]
    
    return SC
        
        
#     raise NotImplementedError
In [8]:
def plot_silhouette_coefficient(data, max_K=15):
    """
    Plot silhouette coefficient for different number of clusters, no need to implement
    """
    clusters = np.arange(2, max_K+1)
    
    silhouette_coefficients = []
    for k in range(2, max_K+1):
        labels, _, _ = KMeans()(data, k)
        silhouette_coefficients.append(silhouette_coefficient(data, labels))
    plt.plot(clusters, silhouette_coefficients)
    return silhouette_coefficients


data = np.random.rand(200,3) * 100
plot_silhouette_coefficient(data)
Out[8]:
[0.27434894586234465,
 0.23974317309219537,
 0.25170126787277375,
 0.29110188891166133,
 0.32392134092333413,
 0.2978439345698108,
 0.24118691265277048,
 0.2861673492358512,
 0.27509945216139964,
 0.2710429481302329,
 0.27049817823622974,
 0.28742447415210476,
 0.28737886551571834,
 0.2721522663631721]

Limitation of K-Means

One of the limitations of K-Means Clustering is that it dependes largely on the shape of the dataset. A common example of this is trying to cluster one circle within another (concentric circles). A K-means classifier will fail to do this and will end up effectively drawing a line which crosses the circles. You can visualize this limitation in the cell below.

In [9]:
# visualize limitation of kmeans, do not have to implement
from sklearn.datasets.samples_generator import (make_circles, make_moons)

X1, y1 = make_circles(factor=0.5, noise=0.05, n_samples=1500)
X2, y2 = make_moons(noise=0.05, n_samples=1500)

def visualise(X, C, K):# Visualization of clustering. You don't need to change this function   
    fig, ax = plt.subplots()
    ax.scatter(X[:, 0], X[:, 1], c=C,cmap='rainbow')
    plt.title('Visualization of K = '+str(K), fontsize=15)
    plt.show()
    pass

cluster_idx1, centers1, loss1 = KMeans()(X1, 2)
visualise(X1, cluster_idx1, 2)

cluster_idx2, centers2, loss2 = KMeans()(X2, 2)
visualise(X2, cluster_idx2, 2)

2. EM algorithm [20 pts]

2.1 Performing EM Update [10 pts]

A univariate Gaussian Mixture Model (GMM) has two components, both of which have their own mean and standard deviation. The model is defined by the following parameters:

zBernoulli(θ)z∼Bernoulli(θ)
p(x|z=0)N(μ,σ)p(x|z=0)∼N(μ,σ)
p(x|z=1)N(2μ,3σ)p(x|z=1)∼N(2μ,3σ)

For a dataset of N datapoints, find the following:

2.1.1. Write the marginal probability of x, i.e. p(x)p(x) [2pts]

2.1.2. E-Step: Compute the posterior probability, i.e, p(zi=k|xi)p(zi=k|xi), where k = {0,1} [2pts]

2.1.3. M-Step: Compute the updated value of μμ (You can keep σσ fixed for this) [3pts]

2.1.4. M-Step: Compute the updated value for σσ (You can keep μμ fixed for this) [3pts]

2.1.1 Marginal probability of x, i.e. p(x)p(x)
Given: A univariate Gaussian mixture model with two components, defined by their parameters as:

zBernoulli(θ)z∼Bernoulli(θ)
p(x|z=0)N(μ,σ)p(x|z=0)∼N(μ,σ)
p(x|z=1)N(2μ,3σ)p(x|z=1)∼N(2μ,3σ)

Note: Considering p(z=0)=θp(z=0)=θ, and σσ and 3σ as standard deviation as suggested by TAs on piazza:

Marginal probability of x can be calculated as:

p(x)=zp(z)p(x|z)=p(z=0)p(x|z=0)+p(z=1)p(x|z=1)=θN(μ,σ)+(1θ)N(2μ,3σ)p(x)=∑zp(z)∗p(x|z)=p(z=0)∗p(x|z=0)+p(z=1)∗p(x|z=1)=θ∗N(μ,σ)+(1−θ)∗N(2μ,3σ)

2.1.2 E-Step: Posterior probability, i.e, p(zi=k|xi)p(zi=k|xi), where k = {0,1}
For k=0, we get:

p(zi=0|xi)=p(zi=0)p(xi|zi=0)p(x)=θN(μ,σ)θN(μ,σ)+(1θ)N(2μ,3σ)p(zi=0|xi)=p(zi=0)∗p(xi|zi=0)p(x)=θ∗N(μ,σ)θ∗N(μ,σ)+(1−θ)∗N(2μ,3σ)

Similarly, for k=0, we get:

p(zi=1|xi)=p(zi=1)p(xi|zi=1)p(x)=(1θ)N(2μ,3σ)θN(μ,σ)+(1θ)N(2μ,3σ)p(zi=1|xi)=p(zi=1)∗p(xi|zi=1)p(x)=(1−θ)∗N(2μ,3σ)θ∗N(μ,σ)+(1−θ)∗N(2μ,3σ)

2.1.3 M-Step: Computing the updated value of μμ
To find the updated value of μμ, we need to solve the optimization problem:

θnew=argmaxθzp(Z|X,θold)ln(p(X,Z|θ)θnew=argmaxθ∑zp(Z|X,θold)∗ln⁡(p(X,Z|θ)

To solve this optimization problem, we will find its derivative and equate it to 0.

Representing p(Z=1|X,θoldp(Z=1|X,θold as τ(z=1)τ(z=1) and p(Z=0|X,θoldp(Z=0|X,θold as τ(z=0)τ(z=0) and using the value of the derivative of the Gaussian pdf w.r.t. μμ, we get:

μzp(Z|X,θold)ln(p(X,Z|θ)=μ[τ(z=1)ln((1θ)N(2μ,3σ))+τ(z=0)ln(θN(μ,σ))]=τ(z=1)N(2μ,3σ)μ+τ(z=0)N(μ,σ)μ=τ(z=1)2(x2μ)9σ2+τ(z=0)(xμ)σ2∂∂μ∑zp(Z|X,θold)∗ln⁡(p(X,Z|θ)=∂∂μ[τ(z=1)∗ln⁡((1−θ)∗N(2μ,3σ))+τ(z=0)∗ln⁡(θ∗N(μ,σ))]=τ(z=1)∗∂N(2μ,3σ)∂μ+τ(z=0)∗∂N(μ,σ)∂μ=τ(z=1)∗2(x−2μ)9σ2+τ(z=0)∗(x−μ)σ2

Equating the above value of derivative to 0, we get:

τ(z=1)2(x2μ)9σ2+τ(z=0)(xμ)σ2=0τ(z=1)∗2(x−2μ)9σ2+τ(z=0)∗(x−μ)σ2=0
2xτ(z=1)4μτ(z=1)+9xτ(z=0)9μτ(z=0)=02x∗τ(z=1)−4μ∗τ(z=1)+9x∗τ(z=0)−9μ∗τ(z=0)=0
x(2τ(z=1)+9τ(z=0))=μ(9τ(z=0)+4τ(z=1))x(2∗τ(z=1)+9∗τ(z=0))=μ(9∗τ(z=0)+4∗τ(z=1))

This gives us the updated value of μμ as:

μ=x(2τ(z=1)+9τ(z=0))9τ(z=0)+4τ(z=1)μ=x(2τ(z=1)+9τ(z=0))9τ(z=0)+4τ(z=1)

2.1.4 M-Step: Computing the updated value for σσ:

For this part, we are solving the same optimization problem, but this time, we will find the partial derivative w.r.t. σσ.

Representing p(Z=1|X,θoldp(Z=1|X,θold as τ(z=1)τ(z=1) and p(Z=0|X,θoldp(Z=0|X,θold as τ(z=0)τ(z=0) and using the value of the derivative of the Gaussian pdf w.r.t. σσ, we get:

σzp(Z|X,θold)ln(p(X,Z|θ)=σ[τ(z=1)ln((1θ)N(2μ,3σ))+τ(z=0)ln(θN(μ,σ))]=τ(z=1)N(2μ,3σ)σ+τ(z=0)N(μ,σ)σ=τ(z=1)(x2μ)29σ29σ3+τ(z=0)(xμ)2σ2σ3∂∂σ∑zp(Z|X,θold)∗ln⁡(p(X,Z|θ)=∂∂σ[τ(z=1)∗ln⁡((1−θ)∗N(2μ,3σ))+τ(z=0)∗ln⁡(θ∗N(μ,σ))]=τ(z=1)∗∂N(2μ,3σ)∂σ+τ(z=0)∗∂N(μ,σ)∂σ=τ(z=1)∗(x−2μ)2−9σ29σ3+τ(z=0)∗(x−μ)2−σ2σ3

Equating the above value of derivative to 0, we get:

τ(z=1)(x2μ)29σ29σ3+τ(z=0)(xμ)2σ2σ3=0τ(z=1)∗(x−2μ)2−9σ29σ3+τ(z=0)∗(x−μ)2−σ2σ3=0

\

τ(z=1)((x2μ)29σ2)+9τ(z=0)((xμ)2σ2)=0τ(z=1)∗((x−2μ)2−9σ2)+9τ(z=0)∗((x−μ)2−σ2)=0
τ(z=1)(x2μ)29τ(z=1)σ2+9τ(z=0)(xμ)29τ(z=0)σ2=0τ(z=1)(x−2μ)2−9τ(z=1)∗σ2+9τ(z=0)(x−μ)2−9τ(z=0)∗σ2=0
σ2=τ(z=1)(x2μ)2+9τ(z=0)(xμ)29(τ(z=1)+τ(z=0))σ2=τ(z=1)(x−2μ)2+9τ(z=0)(x−μ)29(τ(z=1)+τ(z=0))

This gives us the updated value of σσ as:

σ=13τ(z=1)(x2μ)2+9τ(z=0)(xμ)2τ(z=1)+τ(z=0)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√σ=13τ(z=1)(x−2μ)2+9τ(z=0)(x−μ)2τ(z=1)+τ(z=0)

2.2 EM Algorithm in ABO Blood Groups [10 pts]

In the ABO blood group system, each individual has a phenotype and a genotype as shown below. The genotype is made of underlying alleles (A, B, O).

PhenotypeAAABBBOABGenotypeAAAOOABBBOOBOOABPhenotypeGenotypeAAAAAOAOABBBBBOBOBOOOABAB

In a research experiment, scientists wanted to model the distribution of the genotypes of the population. They collected the phenotype information from the participants as this could be directly observed from the individual’s blood group. The scientists, however want to use this data to model the underlying genotype information. In order to help them obtain an understanding, you suggest using the EM algorithm to find out the genotype distribution.

You know that the probability of that an allele is present in an individual is independent of the probability of any other allele, i.e, P(AO)=P(OA)=P(A)P(O)P(AO)=P(OA)=P(A)∗P(O) and so on. Also note that the genotype pairs: (AO, OA) and (BO, OB) are identical and can be treated as AO, BO respectively. You also know that the alleles follow a multinomial distribution.

p(O)=1p(A)p(B)p(O)=1−p(A)−p(B)

Let nA,nB,nO,nABnA,nB,nO,nAB be the number of individuals with the phenotypes A, B, O and AB respectively.\ Let nAA,nAO,nBB,nBO,nABnAA,nAO,nBB,nBO,nAB be the numbers of individuals with genotypes AA, AO, BB, BO and AB respectively.\ The satisfy the following conditions:

"", ComparePerformance().accuracy_GNB_cleandata(clean_data, independent))
  • Homework-2-i8kgpq.zip