K-nearest neighbors (KNN for short) describes a supervised learning algorithm that uses distance calculations between points to classify data. It can assign a class to new data points by determining the k-nearest data points and applying their majority class to the new data point.
How does the algorithm work?
Classification is a general attempt to assign the points in a data set to a certain class. Accordingly, a model is to be trained that can then independently decide for new points to which class they should best belong. These models can be either in the domain of supervised or unsupervised learning. One distinguishes with it whether the training data set already has a special class assignment or not.
If, for example, we want to divide the customers of a company into three different groups depending on their purchasing power and the number of purchases, we distinguish between a supervised learning algorithm in which the customers in the training data set have already been assigned to a customer group and the model is to infer new values based on this given classification. In unsupervised learning, on the other hand, the customers in the training data set are not yet classified and the model must find groupings independently based on the recognized structures.
Suppose we take the following customers as an example to explain the KNN algorithm:
Customer | Total Sales | Count Purchases | Group |
---|---|---|---|
A | 10.000 € | 5 | A |
B | 1.500 € | 1 | B |
C | 7.500 € | 3 | A |
For the k-Nearest Neighbor algorithm, a concrete value for k must first be determined at the beginning. This specifies how many neighbors we will compare the new data point with in the end. For our example, we choose k = 2. Let’s assume that we now have a new customer D who has already made a purchase from us for 9,000 € and has made a total of four purchases. To determine its classification, we now search for the two (k=2) nearest neighbors to this data point and determine their class.

In our case, these are customer A and C, who both belong to class “A”. So we classify the new customer D as an A-customer as well, since its next two neighbors belong to class “A” in the majority.
Besides the choice of the value k, the distance calculation between the points determines the quality of the model. There are different calculation methods for this.
How can distances be calculated?
Depending on the use case and the characteristics of the data, various distance functions can be used to determine the nearest neighbors. We will take a closer look at these in this chapter.
Euclidean Distance
The Euclidean distance is the most widely used and can be applied to real vectors with many dimensions. It involves calculating the distances between two points in all dimensions, squaring them, and then summing them. The square root of this sum is then the final result.
\(\) \[d(x,y) = \sqrt{\sum_{i = 1}^{n}(y_{i} – x_{i})^2}\]
Simply place a direct line between the two points x and y and measure their length.
Manhattan Distance
The Manhattan distance, on the other hand, calculates the absolute difference of the points in all dimensions and is therefore also called “cab distance”. This is because the procedure is similar to the journey of a cab through the vertical streets in New York.
\(\) \[d(x,y) = \sum_{i = 1}^{n}|y_{i} – x_{i}|\]
The use of this distance function makes sense especially when you want to compare objects with each other. For example, if two houses are to be compared by looking more closely at the number of rooms and the living area in square meters, it makes no sense to take the Euclidean distance, but to look separately at the difference in the rooms and then at the difference in the living area. Otherwise, these dimensions would be confused with different units.
Other Distance Functions
In addition, there are other distance functions that can be used if you use special data formats. For example, the Hamming distance is useful for Boolean values such as True and False. The Minkowski distance, on the other hand, is a mixture of the Euclic and Manhattan distances.
Which applications use the KNN?
When working with large data sets, classifying helps to get a first impression about the feature expressions and the distribution of the data points. In addition, there are many other applications for classifying:
- Market segmentation: An attempt is made to find similar customer groups with comparable purchasing behavior or other characteristics.
- Image segmentation: An attempt is made to find the locations within an image that belongs to a specific object, e.g. all pixels that are part of a car or the road.
- Document Clustering: Within a document, an attempt is made to find passages with a similar content focus.
- Recommendation Engine: When recommending products, similar customers are searched for with the help of k-Nearest Neighbors, and their purchased products are suggested to the respective customer if he has not yet purchased them.
- Healthcare: When testing drugs and their effectiveness, KNN is used to search for particularly similar patients and then administer the drug to one patient and not to the other. This makes it possible to compare which effects were triggered by the drug and which might have occurred anyway.
What are the advantages and disadvantages of the k-Nearest Neighbor algorithm?
The k-Nearest Neighbor model enjoys great popularity because it is easy to understand and apply. Moreover, there are only two hyperparameters that can be varied, namely the number of neighbors k and the distance metric. On the other hand, this is of course also a disadvantage, since the algorithm can be adapted only little or not at all to the concrete application.
However, due to its simplicity, the k-Nearest Neighbors algorithm also requires a lot of time and memory for large data sets, which quickly becomes a cost factor for larger projects. Therefore, larger data projects like to rely on more elaborate models, such as k-Means clustering.
What is the difference between k-nearest neighbors and k-means?
Although the names of the k-Nearest Neighbors algorithm and k-Means clustering sound very similar at first, they actually have relatively little in common and are used for completely different applications. The k in k-Means Clustering describes the number of classes into which the algorithm divides a data set. In k-Nearest Neighbors, on the other hand, the k stands for the number of neighbors that are used to determine the class of the new data point.
Furthermore, the k-Nearest Neighbors model is a supervised learning model, since it requires the assignment to groups in order to derive new ones. The k-Means clustering, on the other hand, is an unsupervised learning algorithm, since it is able to recognize different groups independently on the basis of the structures in the data and to assign the data to these classes.
How can you do k-nearest neighbor matching in Python?
K-Nearest Neighbor (K-NN) matching is a useful technique for finding similar data points within a dataset. In this section, we’ll walk you through the process of performing K-NN matching in Python using a public dataset and provide code examples.
Step 1: Import Libraries
Begin by importing the necessary Python libraries, including numpy
for numerical operations and sklearn
for K-NN implementation.

Step 2: Load a Public Dataset
For this example, let’s use the Iris dataset, a well-known dataset available in the scikit-learn library.

Step 3: Create a k-nearest neighbor Model
Next, create a K-NN model using scikit-learn’s NearestNeighbors
class. You’ll need to specify the number of neighbors (K) you want to consider.

Step 4: Perform K-NN Matching
Now that you have a K-NN model trained on your dataset, you can use it to find the K nearest neighbors for a given data point. In this example, we’ll find the nearest neighbors for a random data point from the dataset.

Step 5: Review Results
You can now review the results of the K-NN matching. distances
will contain the distances from the query point to its K nearest neighbors, and indices
will contain the indices of those neighbors in the original dataset.

Step 6: Interpretation and Further Analysis
The results will provide you with the K nearest neighbors to the query point along with their distances. You can interpret these results for various purposes, such as finding similar data points, clustering, or outlier detection.
This example demonstrates the basic steps to perform K-NN matching in Python using a public dataset. You can apply the same principles to your own datasets and use cases. K-NN matching is a versatile technique with applications in various domains, including recommendation systems, image analysis, and anomaly detection.
How can you improve the results of k-nearest neighbor?
K-NN, while a straightforward algorithm, can benefit from careful considerations and adjustments to optimize its performance. Here are key strategies for improving K-NN:
- Optimal K-Value Selection: Choosing the right number of neighbors (K) is critical. A small K may result in noisy predictions, while a large K can introduce bias. Utilize techniques like cross-validation to strike the right balance and determine the ideal K for your dataset.
- Feature Selection and Engineering: The quality of your features plays a pivotal role in K-NN’s success. Identify the most relevant features and consider feature engineering to create new attributes that enhance the algorithm’s ability to discern patterns in your data.
- Distance Metrics: The choice of a distance metric, such as Euclidean or Manhattan, is vital. It impacts how k-nearest neighbor perceives feature scales and data distribution. Experiment with different metrics or employ custom distance measures tailored to your domain’s characteristics.
- Scaling Features: Features often exhibit different scales, which can affect K-NN’s performance. Standardizing features (z-score normalization) or scaling them to a consistent range (min-max scaling) prevents certain features from exerting undue influence.
- Data Preprocessing: Dealing with missing values and outliers is imperative. Employ techniques like imputation for missing data and robust distance measures for handling outliers, bolstering K-NN’s resilience to noisy data.
- Weighted K-NN: Standard K-NN treats all neighbors equally. Weighted K-NN assigns varying weights based on distances, allowing closer neighbors to exert more influence on predictions, which can be advantageous in scenarios where certain neighbors hold greater significance.
- Dimensionality Reduction: High-dimensional data can pose challenges due to the “curse of dimensionality.” Dimensionality reduction techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) preserve essential information while reducing dimensionality, enhancing K-NN’s efficiency.
By implementing these strategies thoughtfully, tailored to your specific dataset and problem, you can significantly enhance the performance and reliability of the K-NN algorithm. Remember that experimentation and iterative refinement are key to achieving optimal results in your machine learning endeavors.
This is what you should take with you
- k-Nearest Neighbors is a supervised learning algorithm that uses distance calculations between data points to divide them into groups.
- A new point can be assigned to a group by looking at the k neighboring data points and using their majority class.
- Such a clustering method can be useful for navigating large data sets, making product recommendations for new customers, or for dividing test and control groups in medical trials.
Other Articles on the Topic of k-Nearest Neighbors
IBM has written an interesting post on the k-Nearest Neighbor model.