K-Nearest Neighbor(KNN) Algorithm
Last Updated :
29 Jan, 2025
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby. Imagine a streaming service wants to predict if a new user is likely to cancel their subscription (churn) based on their age. They checks the ages of its existing users and whether they churned or stayed. If most of the “K” closest users in age of new user canceled their subscription KNN will predict the new user might churn too. The key idea is that users with similar ages tend to have similar behaviors and KNN uses this closeness to make decisions.
Getting Started with K-Nearest Neighbors
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification it performs an action on the dataset.
As an example, consider the following table of data points containing two features:

KNN Algorithm working visualization
The new point is classified as Category 2 because most of its closest neighbors are blue squares. KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its closest neighbours.
- The red diamonds represent Category 1 and the blue squares represent Category 2.
- The new data point checks its closest neighbours (circled points).
- Since the majority of its closest neighbours are blue squares (Category 2) KNN predicts the new data point belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
What is ‘K’ in K Nearest Neighbour ?
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm how many nearby points (neighbours) to look at when it makes a decision.
Example:
Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits you already know.
- If k = 3, the algorithm looks at the 3 closest fruits to the new one.
- If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an apple because most of its neighbours are apples.
How to choose the value of k for KNN Algorithm?
The value of k is critical in KNN as it determines the number of neighbors to consider when making predictions. Selecting the optimal value of k depends on the characteristics of the input data. If the dataset has significant outliers or noise a higher k can help smooth out the predictions and reduce the influence of noisy data. However choosing very high value can lead to underfitting where the model becomes too simplistic.
Statistical Methods for Selecting k:
- Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-validation. This involves splitting the data into k subsets training the model on some subsets and testing it on the remaining ones and repeating this for each subset. The value of k that results in the highest average validation accuracy is usually the best choice.
- Elbow Method: In the elbow method we plot the model’s error rate or accuracy for different values of k. As we increase k the error usually decreases initially. However after a certain point the error rate starts to decrease more slowly. This point where the curve forms an “elbow” that point is considered as best k.
- Odd Values for k: It’s also recommended to choose an odd value for k especially in classification tasks to avoid ties when deciding the majority class.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbour, these neighbours are used for classification and regression task. To identify nearest neighbour we use below distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or space. You can think of it like the shortest path you would walk if you were to go directly from one point to another.
[Tex] \text{distance}(x, X_i) = \sqrt{\sum_{j=1}^{d} (x_j – X_{i_j})^2} ][/Tex]
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and vertical lines (like a grid or city streets). It’s also called “taxicab distance” because a taxi can only drive along the grid-like streets of a city.
[Tex]d\left ( x,y \right )={\sum_{i=1}^{n}\left | x_i-y_i \right |}[/Tex]
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan distances as special cases.
[Tex]d\left ( x,y \right )=\left ( {\sum_{i=1}^{n}\left ( x_i-y_i \right )^p} \right )^{\frac{1}{p}}[/Tex]
From the formula above we can say that when p = 2 then it is the same as the formula for the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan distance.
So, you can think of Minkowski as a flexible distance formula that can look like either Manhattan or Euclidean distance depending on the value of p
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it predicts the label or value of a new data point by considering the labels or values of its K nearest neighbors in the training dataset.
.png)
Step-by-Step explanation of how KNN works is discussed below:
Step 1: Selecting the optimal value of K
- K represents the number of nearest neighbors that needs to be considered while making prediction.
Step 2: Calculating distance
- To measure the similarity between target and training data points Euclidean distance is used. Distance is calculated between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
- The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
- When you want to classify a data point into a category (like spam or not spam), the K-NN algorithm looks at the K closest points in the dataset. These closest points are called neighbors. The algorithm then looks at which category the neighbors belong to and picks the one that appears the most. This is called majority voting.
- In regression, the algorithm still looks for the K closest points. But instead of voting for a class in classification, it takes the average of the values of those K neighbors. This average is the predicted value for the new point for the algorithm.

Working of KNN Algorithm
It shows how a test point is classified based on its nearest neighbors. As the test point moves the algorithm identifies the closest ‘k’ data points i.e 5 in this case and assigns test point the majority class label that is grey label class here.
Python Implementation of KNN Algorithm
1. Importing Libraries:
Python
import numpy as np
from collections import Counter
Counter
: is used to count the occurrences of elements in a list or iterable. In KNN after finding the k
nearest neighbors labels Counter
helps count how many times each label appears.
2. Defining the Euclidean Distance Function:
Python
def euclidean_distance(point1, point2):
return np.sqrt(np.sum((np.array(point1) - np.array(point2))**2))
- euclidean_distance: to calculate euclidean distance between points
3. KNN Prediction Function:
Python
def knn_predict(training_data, training_labels, test_point, k):
distances = []
for i in range(len(training_data)):
dist = euclidean_distance(test_point, training_data[i])
distances.append((dist, training_labels[i]))
distances.sort(key=lambda x: x[0])
k_nearest_labels = [label for _, label in distances[:k]]
return Counter(k_nearest_labels).most_common(1)[0][0]
- distances.append: Each distance is paired with the corresponding label (
training_labels[i]
) of the training data. This pair is stored in a list called distances
. - distances.sort: The list of distances is sorted in ascending order so that the closest points are at the beginning of the list.
- k_nearest_labels: The function then selects the labels of the
k
closest neighbors. - The labels of the
k
nearest neighbors are counted using the Counter
class, and the most frequent label is returned as the prediction for the test_point
. This is based on the majority vote of the k neighbors.
4. Training Data, Labels and Test Point:
Python
training_data = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]
training_labels = ['A', 'A', 'A', 'B', 'B']
test_point = [4, 5]
k = 3
5. Prediction and Output:
Python
prediction = knn_predict(training_data, training_labels, test_point, k)
print(prediction)
Output:
A
The algorithm calculates the distances of the test point [4, 5]
to all training points, selects the 3 closest points (as k = 3
), and determines their labels. Since the majority of the closest points are labelled ‘A’, the test point is classified as ‘A’.
In machine learning we can also use Scikit Learn python library which has in built functions to perform KNN machine learning model and for that you refer to Implementation of KNN classifier using Sklearn.
Applications of the KNN Algorithm
Here are some real life applications of KNN Algorithm.
- Recommendation Systems: Many recommendation systems, such as those used by Netflix or Amazon, rely on KNN to suggest products or content. KNN observes at user behavior and finds similar users. If user A and user B have similar preferences, KNN might recommend movies that user A liked to user B.
- Spam Detection: KNN is widely used in filtering spam emails. By comparing the features of a new email with those of previously labeled spam and non-spam emails, KNN can predict whether a new email is spam or not.
- Customer Segmentation: In marketing firms, KNN is used to segment customers based on their purchasing behavior . By comparing new customers to existing customers, KNN can easily group customers into segments with similar choices and preferences. This helps businesses target the right customers with right products or advertisements.
- Speech Recognition: KNN is often used in speech recognition systems to transcribe spoken words into text. The algorithm compares the features of the spoken input with those of known speech patterns. It then predicts the most likely word or command based on the closest matches.
Advantages and Disadvantages of the KNN Algorithm
Advantages:
- Easy to implement: The KNN algorithm is easy to implement because its complexity is relatively low as compared to other machine learning algorithms.
- No training required: KNN stores all data in memory and doesn’t require any training so when new data points are added it automatically adjusts and uses the new data for future predictions.
- Few Hyperparameters: The only parameters which are required in the training of a KNN algorithm are the value of k and the choice of the distance metric which we would like to choose from our evaluation metric.
- Flexible: It works for Classification problem like is this email spam or not? and also work for Regression task like predicting house prices based on nearby similar houses.
Disadvantages:
- Doesn’t scale well: KNN is considered as a “lazy” algorithm as it is very slow especially with large datasets
- Curse of Dimensionality: When the number of features increases KNN struggles to classify data accurately a problem known as curse of dimensionality.
- Prone to Overfitting: As the algorithm is affected due to the curse of dimensionality it is prone to the problem of overfitting as well.
Also Check for more understanding:
K-Nearest Neighbor(KNN) Algorithm – FAQs
Why KNN is lazy learner?
KNN algorithm does not build a model during the training phase. The algorithm memories the entire training dataset and performs action on the dataset at the time of classification.
Why KNN is nonparametric?
The KNN algorithm does not make assumptions about the data it is analyzing.
What is the difference between KNN, and K means?
- KNN is a supervised machine learning model used for classification problems whereas K-means is an unsupervised machine learning model used for clustering.
- The “K” in KNN is the number of nearest neighbors whereas the “K” in K means is the number of clusters.
Similar Reads
K-Nearest Neighbor(KNN) Algorithm
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at whatâs nearby. Imagine a streaming service wants to predict if a new user is likely to cancel their subscription (churn) based on their age. They checks the ages of its existing users and whether they churned or stayed. If mo
10 min read
k-nearest neighbor algorithm in Python
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning method. It operates for classification as well as regression: Classification: For a new data point, the algorithm identifies its nearest neighbors based on a distance metric (e.g., Euclidean distance). The predicted class is dete
4 min read
kNN: k-Nearest Neighbour Algorithm in R From Scratch
In this article, we are going to discuss what is KNN algorithm, how it is coded in R Programming Language, its application, advantages and disadvantages of the KNN algorithm. kNN algorithm in RKNN can be defined as a K-nearest neighbor algorithm. It is a supervised learning algorithm that can be use
15+ min read
r-Nearest neighbors
r-Nearest neighbors are a modified version of the k-nearest neighbors. The issue with k-nearest neighbors is the choice of k. With a smaller k, the classifier would be more sensitive to outliers. If the value of k is large, then the classifier would be including many points from other classes. It is
5 min read
ML | K-means++ Algorithm
Clustering is one of the most common tasks in machine learning where we group similar data points together. K-Means Clustering is one of the simplest and most popular clustering algorithms but it has one major drawback â the random initialization of cluster centers often leads to poor clustering res
5 min read
Implementation of K Nearest Neighbors
Prerequisite: K nearest neighbors Introduction Say we are given a data set of items, each having numerically valued features (like Height, Weight, Age, etc). If the count of features is n, we can represent the items as points in an n-dimensional grid. Given a new item, we can calculate the distance
10 min read
K Nearest Neighbors with Python | ML
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining, and intrusion detection. The K-Nearest Neighbors (KNN) algorithm is a simple, easy
5 min read
K-Nearest Neighbors and Curse of Dimensionality
In high-dimensional data, the performance of the k-nearest neighbor (k-NN) algorithm often deteriorates due to increased computational complexity and the breakdown of the assumption that similar points are proximate. These challenges hinder the algorithm's accuracy and efficiency in high-dimensional
6 min read
Basic Understanding of CURE Algorithm
CURE(Clustering Using Representatives) It is a hierarchical based clustering technique, that adopts a middle ground between the centroid based and the all-point extremes. Hierarchical clustering is a type of clustering, that starts with a single point cluster, and moves to merge with another cluster
2 min read
Understanding Decision Boundaries in K-Nearest Neighbors (KNN)
Decision boundary is an imaginary line or surface that separates different classes in a classification problem. It represents regions as one class versus another based on model assigns them. K-Nearest Neighbors (KNN) algorithm operates on the principle that similar data points exist in close proximi
5 min read