1.Concept: Birds of same feather fly together.

a.Feathers are same when they have similar size, similar color, similar density, similar texture etc. Each category / class of bird has distinct range of values for these parameters.

b.When a new data point / sample with no pre-defined class represented by combination of these paramters is presented, the algorithm tries to assess the class/es which this new data point may belong to

c.To assess the likely class, the algorithm uses a concept of distance. There are various ways of calculating distances which in our case will be Euclidian distance

2.Well suited for classification tasks where relationship between features and target classes is numerous, complex and difficult to understand yet items in a class tend to be fairly homogenous on the values of attributes

3.Not suitable if the data is noisy and the target classes do not have clear demarcation in terms of attribute values

a.Of the K nearest neighbours, the new point belongs to that group where most of the Kpoints belong

a.Measuring similarity with distance between the points using Eucledian method

b.This distance is calculated between the test point / sample point and all the identified nearest neighbours.

c.The majority class of the nearest neighbours is assigned to the test point

a.What should be the value of K? Why not take all the observations as K or why not take only one observation.?

b.If we take all the data points as K, there is a high chance that a new data point is identified as belonging to that class which as majority of data points in the given data set (this is called Bias)

c.On the other hand if we take K = 1 i.e. the nearest point decides the class. Then there is a high chance that new data point will be impacted by the noise and get impacted by variance

d.The only proven way to identify the correct value of K for the classification is to test the model on several K values. Select the one that gives best performance

a.The distance formula is highly dependent on how features / attributes / dimensions are measured.

b.Those dimensions which have larger possible range of values will dominate the result of the distance calculation using Euclidian formula

c.To ensure all the dimensions have similar scale, we normalize the data on all the dimensions / attributes

d.There are multiple ways of normalizing the data. We will use Z-score standardization

e.Using this formula for Z score, normalize all the numerical dimensions

1.K-NN algorithm

a.Most common nearest neighbour implementation algorithm.

( Lab-1 Diagnosing breast cancer using KNN clustering)

import org.apache.spark.mllib.clustering.KMeans

import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data

val data = sc.textFile("/user/data/wisc_bc_data.csv")

val parsedData = data.map(s => Vectors.dense(s.split(’ ,’).map(_.toDouble))).cache()

// Cluster the data into two classes using KMeans

val numClusters = 2

val numIterations = 20

val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors

val WSSSE = ")

println("Within Set Sum of Squared Errors = " + WSSSE)

Source: https://spark.apache.org/docs/1.3.1/mllib-clustering.html#k-means