DSCI553 Assignment 6-Clustering Solved

35.00 $

Category:

Description

Rate this product

1. Overview of the Assignment

In Assignment 6, you will implement the Bradley-Fayyad-Reina (BFR) algorithm. The goal is to let you be familiar with the process clustering in general and various distance measurements. The datasets you are going to use is a synthetic dataset.

2. Assignment Requirements

2.1 Programming Language and Library Requirements

a. You must use Python to implement the algorithm. You can only use the following external Python libraries: numpy and sklearn.

2.2 Programming Environment

Python 3.6, JDK 1.8, Scala 2.11 and Spark 2.4.4
We will use Python 3.6 to test your code. There will be a 20% penalty if we cannot run your code due to

the library version inconsistency.

2.3 Write your own code

Do not share your code with other students!!

We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all the detected plagiarism.

3. Dataset

Since the BFR algorithm has a strong assumption that the clusters are normally distributed with independent dimensions, we generated synthetic datasets by initializing some random centroids and creating some data points with the centroids and some standard deviations to form the clusters. We also add some other data points as the outliers in the dataset to evaluate the algorithm. Data points which are outliers belong to clusters that is named or indexed as “-1”. Figure 1 shows an example of a part of the dataset. The first column is the data point index. The second column is the name/index of the cluster that the data point belongs to. The rest columns represent the features/dimensions of the data point.

Figure 1: An example of the dataset

a. hw6_clustering.txt is the synthetic clustering dataset. The dataset is available on Vocareum(public data folder).
b. We generate the testing dataset using a similar method. Notice that the number of the dimensions could be different from the hw6_clustering.txt. We do not share the testing dataset.

4. Task

You will implement the Bradley-Fayyad-Reina (BFR) algorithm to cluster the data contained in hw6_clustering.txt.
In BFR, there are three sets of points that you need to keep track of:

Discard set (DS), Compression set (CS), Retained set (RS)

For each cluster in the DS and CS, the cluster is summarized by: N: The number of points

SUM: the sum of the coordinates of the points SUMSQ: the sum of squares of coordinates

The conceptual steps of the BFR algorithm (Please refer to the slide for details):

Implementation details of the BFR algorithm: (just for your reference, the number of input clusters = n_cluster parameter given as input)

Step 1. Load 20% of the data randomly.
Step 2. Run K-Means (e.g., from sklearn) with a large K (e.g., 5 times of the number of the input clusters) on the data in memory using the Euclidean distance as the similarity measurement.

Step 3. In the K-Means result from Step 2, move all the clusters that contain only one point to RS (outliers).
Step 4. Run K-Means again to cluster the rest of the data points with K = the number of input clusters.

Step 5. Use the K-Means result from Step 4 to generate the DS clusters (i.e., discard their points and generate statistics).

The initialization of DS has finished, so far, you have K numbers of DS clusters (from Step 5) and some numbers of RS (from Step 3).

Step 6. Run K-Means on the points in the RS with a large K (e.g., 5 times of the number of the input clusters) to generate CS (clusters with more than one points) and RS (clusters with only one point).

Step 7. Load another 20% of the data randomly.
Step 8. For the new points, compare them to each of the DS using the Mahalanobis Distance and assign

themtothenearestDSclustersifthedistanceis<2 𝑑.
Step 9. For the new points that are not assigned to DS clusters, using the Mahalanobis Distance and

assignthepointstothenearestCSclustersifthedistanceis<2 𝑑
Step 10. For the new points that are not assigned to a DS cluster or a CS cluster, assign them to RS.

Step 11. Run K-Means on the RS with a large K (e.g., 5 times of the number of the input clusters) to generate CS (clusters with more than one points) and RS (clusters with only one point).

Step12.MergeCSclustersthathaveaMahalanobisDistance<2 𝑑.

Repeat Steps 7 – 12.

If this is the last run (after the last chunk of data), merge CS clusters with DS clusters that have a MahalanobisDistance<2 𝑑.

At each run, including the initialization step, you need to count and output the number of the discard points, the number of the clusters in the CS, the number of the compression points, and the number of the points in the retained set.

Input format: (we will use the following command to execute your code)

python3 task.py <input_file> <n_cluster> <output_file>

Param: input_file: the name of the input file (e.g., hw6_clustering.txt), including the file path. Param: n_cluster: the number of the clusters.
Param: output_file: the name of the output txt file, including the file path.

Output format:

The output file is a text file, containing the following information (see Figure 2):

a. The intermediate results (the line is named as “The intermediate results”). Then each line should be started with “Round {𝑖}:” and 𝑖 is the count for the round (including the initialization, i.e., initialization would be “Round 1:”. You need to output the numbers in the order of “the number of the discard points”, “the number of the clusters in the compression set”, “the number of the compression points”, and “the number of the points in the retained set”.

Leave one line in the middle before writing out the cluster results.

b. The clustering results (the line is named as “The clustering results”), including the data points index and their clustering results after the BFR algorithm. The clustering results should be in [0, the number of clusters). The cluster of outliers should be represented as -1.

 

  • Scalable_Clustering-5jszse.zip