Implementation of K Nearest Neighbors
Last Updated :
09 Nov, 2022
Prerequisite: K nearest neighbors
Introduction
Say we are given a data set of items, each having numerically valued features (like Height, Weight, Age, etc). If the count of features is n, we can represent the items as points in an n-dimensional grid. Given a new item, we can calculate the distance from the item to every other item in the set. We pick the k closest neighbors and we see where most of these neighbors are classified in. We classify the new item there.
So the problem becomes how we can calculate the distances between items. The solution to this depends on the data set. If the values are real we usually use the Euclidean distance. If the values are categorical or binary, we usually use the Hamming distance.
Algorithm:
Given a new item:
1. Find distances between new item and all other items
2. Pick k shorter distances
3. Pick the most common class in these k distances
4. That class is where we will classify the new item
Reading Data
Let our input file be in the following format:
Height, Weight, Age, Class
1.70, 65, 20, Programmer
1.90, 85, 33, Builder
1.78, 76, 31, Builder
1.73, 74, 24, Programmer
1.81, 75, 35, Builder
1.73, 70, 75, Scientist
1.80, 71, 63, Scientist
1.75, 69, 25, Programmer
Each item is a line and under “Class” we see where the item is classified in. The values under the feature names (“Height” etc.) are the value the item has for that feature. All the values and features are separated by commas.
Place these data files in the working directory data2 and data. Choose one and paste the contents as-is into a text file named data.
We will read from the file (named “data.txt”) and we will split the input by lines:
f = open('data.txt', 'r');
lines = f.read().splitlines();
f.close();
The first line of the file holds the feature names, with the keyword “Class” at the end. We want to store the feature names into a list:
# Split the first line by commas,
# remove the first element and
# save the rest into a list. The
# list now holds the feature
# names of the data set.
features = lines[0].split(', ')[:-1];
Then we move on to the data set itself. We will save the items into a list, named items, whose elements are dictionaries (one for each item). The keys to these item-dictionaries are the feature names, plus “Class” to hold the item class. In the end, we want to shuffle the items in the list (this is a safety measure, in case the items are in a weird order).
Python3
items = [];
for i in range ( 1 , len (lines)):
line = lines[i].split( ', ' );
itemFeatures = { "Class" : line[ - 1 ]};
for j in range ( len (features)):
f = features[j];
v = float (line[j]);
itemFeatures[f] = v;
items.append(itemFeatures);
shuffle(items);
|
Classifying the data
With the data stored into items, we now start building our classifier. For the classifier, we will create a new function, Classify. It will take as input the item we want to classify, the items list, and k, the number of the closest neighbors.
If k is greater than the length of the data set, we do not go ahead with the classifying, as we cannot have more closest neighbors than the total amount of items in the data set. (alternatively, we could set k as the items length instead of returning an error message)
if(k > len(Items)):
# k is larger than list
# length, abort
return "k larger than list length";
We want to calculate the distance between the item to be classified and all the items in the training set, in the end keeping the k shortest distances. To keep the current closest neighbors we use a list, called neighbors. Each element in the least holds two values, one for the distance from the item to be classified and another for the class the neighbor is in. We will calculate distance via the generalized Euclidean formula (for n dimensions). Then, we will pick the class that appears most of the time in neighbors and that will be our pick. In code:
Python3
def Classify(nItem, k, Items):
if (k > len (Items)):
return "k larger than list length" ;
neighbors = [];
for item in Items:
distance = EuclideanDistance(nItem, item);
neighbors = UpdateNeighbors(neighbors, item, distance, k);
count = CalculateNeighborsClass(neighbors, k);
return FindMax(count);
|
The external functions we need to implement are EuclideanDistance, UpdateNeighbors, CalculateNeighborsClass, and FindMax.
Finding Euclidean Distance
The generalized Euclidean formula for two vectors x and y is this:
distance = sqrt{(x_{1}-y_{1})^2 + (x_{2}-y_{2})^2 + ... + (x_{n}-y_{n})^2}
In code:
Python3
def EuclideanDistance(x, y):
S = 0 ;
for key in x.keys():
S + = math. pow (x[key] - y[key], 2 );
return math.sqrt(S);
|
Updating Neighbors
We have our neighbors list (which should at most have a length of k) and we want to add an item to the list with a given distance. First, we will check if neighbors have a length of k. If it has less, we add the item to it regardless of the distance (as we need to fill the list up to k before we start rejecting items). If not, we will check if the item has a shorter distance than the item with the max distance in the list. If that is true, we will replace the item with max distance with the new item.
To find the max distance item more quickly, we will keep the list sorted in ascending order. So, the last item in the list will have the max distance. We will replace it with a new item and we will sort it again.
To speed this process up, we can implement an Insertion Sort where we insert new items in the list without having to sort the entire list. The code for this though is rather long and, although simple, will bog the tutorial down.
Python3
def UpdateNeighbors(neighbors, item, distance, k):
if ( len (neighbors) > distance):
neighbors[ - 1 ] = [distance, item[ "Class" ]];
neighbors = sorted (neighbors);
return neighbors;
|
CalculateNeighborsClass
Here we will calculate the class that appears most often in neighbors. For that, we will use another dictionary, called count, where the keys are the class names appearing in neighbors. If a key doesn’t exist, we will add it, otherwise, we will increment its value.
Python3
def CalculateNeighborsClass(neighbors, k):
count = {};
for i in range (k):
if (neighbors[i][ 1 ] not in count):
count[neighbors[i][ 1 ]] = 1 ;
else :
count[neighbors[i][ 1 ]] + = 1 ;
return count;
|
FindMax
We will input to this function the dictionary count we build in CalculateNeighborsClass and we will return its max.
Python3
def FindMax(countList):
maximum = - 1 ;
classification = "";
for key in countList.keys():
if (countList[key] > maximum):
maximum = countList[key];
classification = key;
return classification, maximum;
|
Conclusion
With that, this kNN tutorial is finished.
You can now classify new items, setting k as you see fit. Usually, for k an odd number is used, but that is not necessary. To classify a new item, you need to create a dictionary with keys the feature names, and the values that characterize the item. An example of classification:
newItem = {'Height' : 1.74, 'Weight' : 67, 'Age' : 22};
print Classify(newItem, 3, items);
The complete code of the above approach is given below:-
Python3
import math
from random import shuffle
f = open (fileName, 'r' )
lines = f.read().splitlines()
f.close()
features = lines[ 0 ].split( ', ' )[: - 1 ]
items = []
for i in range ( 1 , len (lines)):
line = lines[i].split( ', ' )
itemFeatures = { 'Class' : line[ - 1 ]}
for j in range ( len (features)):
f = features[j]
v = float (line[j])
itemFeatures[f] = v
items.append(itemFeatures)
shuffle(items)
return items
S = 0
for key in x.keys():
S + = math. pow (x[key] - y[key], 2 )
return math.sqrt(S)
def CalculateNeighborsClass(neighbors, k):
count = {}
for i in range (k):
if neighbors[i][ 1 ] not in count:
count[neighbors[i][ 1 ]] = 1
else :
count[neighbors[i][ 1 ]] + = 1
return count
def FindMax( Dict ):
maximum = - 1
classification = ''
for key in Dict .keys():
if Dict [key] > maximum:
maximum = Dict [key]
classification = key
return (classification, maximum)
neighbors = []
for item in Items:
distance = EuclideanDistance(nItem, item)
neighbors = UpdateNeighbors(neighbors, item, distance, k)
count = CalculateNeighborsClass(neighbors, k)
return FindMax(count)
def UpdateNeighbors(neighbors, item, distance,
k, ):
if len (neighbors) < k:
neighbors.append([distance, item[ 'Class' ]])
neighbors = sorted (neighbors)
else :
if neighbors[ - 1 ][ 0 ] > distance:
neighbors[ - 1 ] = [distance, item[ 'Class' ]]
neighbors = sorted (neighbors)
return neighbors
if K > len (Items):
return - 1
correct = 0
total = len (Items) * (K - 1 )
l = int ( len (Items) / K)
for i in range (K):
trainingSet = Items[i * l:(i + 1 ) * l]
testSet = Items[:i * l] + Items[(i + 1 ) * l:]
for item in testSet:
itemClass = item[ 'Class' ]
itemFeatures = {}
for key in item:
if key ! = 'Class' :
itemFeatures[key] = item[key]
guess = Classify(itemFeatures, k, trainingSet)[ 0 ]
if guess = = itemClass:
correct + = 1
accuracy = correct / float (total)
return accuracy
def Evaluate(K, k, items, iterations):
accuracy = 0
for i in range (iterations):
shuffle(items)
accuracy + = K_FoldValidation(K, k, items)
print accuracy / float (iterations)
items = ReadData( 'data.txt' )
Evaluate( 5 , 5 , items, 100 )
if __name__ = = '__main__' :
main()
|
Output:
0.9375
The output can vary from machine to machine. The code includes a Fold Validation function, but it is unrelated to the algorithm, it is there for calculating the accuracy of the algorithm.
Similar Reads
Implementation of K-Nearest Neighbors from Scratch using Python
Instance-Based LearningK Nearest Neighbors Classification is one of the classification techniques based on instance-based learning. Models based on instance-based learning to generalize beyond the training examples. To do so, they store the training examples first. When it encounters a new instance
8 min read
K-Nearest Neighbors and Curse of Dimensionality
In high-dimensional data, the performance of the k-nearest neighbor (k-NN) algorithm often deteriorates due to increased computational complexity and the breakdown of the assumption that similar points are proximate. These challenges hinder the algorithm's accuracy and efficiency in high-dimensional
6 min read
Mathematical explanation of K-Nearest Neighbour
KNN stands for K-nearest neighbour is a popular algorithm in Supervised Learning commonly used for classification tasks. It works by classifying data based on its similarity to neighboring data points. The core idea of KNN is straightforward when a new data point is introduced the algorithm finds it
4 min read
k-nearest neighbor algorithm in Python
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning method. It operates for classification as well as regression: Classification: For a new data point, the algorithm identifies its nearest neighbors based on a distance metric (e.g., Euclidean distance). The predicted class is dete
4 min read
r-Nearest neighbors
r-Nearest neighbors are a modified version of the k-nearest neighbors. The issue with k-nearest neighbors is the choice of k. With a smaller k, the classifier would be more sensitive to outliers. If the value of k is large, then the classifier would be including many points from other classes. It is
5 min read
K Nearest Neighbors with Python | ML
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining, and intrusion detection. The K-Nearest Neighbors (KNN) algorithm is a simple, easy
5 min read
Implementation of Radius Neighbors from Scratch in Python
Radius Neighbors is also one of the techniques based on instance-based learning. Models based on instance-based learning generalize beyond the training examples. To do so, they store the training examples first. When it encounters a new instance (or test example), then they instantly build a relatio
8 min read
K-Nearest Neighbor(KNN) Algorithm
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at whatâs nearby. Imagine a streaming service wants to predict if a new user is likely to cancel their subscription (churn) based on their age. They checks the ages of its existing users and whether they churned or stayed. If mo
10 min read
Implementation of KNN using OpenCV
KNN is one of the most widely used classification algorithms that is used in machine learning. To know more about the KNN algorithm read here KNN algorithm Today we are going to see how we can implement this algorithm in OpenCV and how we can visualize the results in 2D plane showing different featu
3 min read
How To Predict Diabetes using K-Nearest Neighbor in R
In this article, we are going to predict Diabetes using the K-Nearest Neighbour algorithm and analyze on Diabetes dataset using the R Programming Language. What is the K-Nearest Neighbor algorithm?The K-Nearest Neighbor (KNN) algorithm is a popular supervised learning classifier frequently used by d
13 min read