Machine Learning Basics with the K-Nearest Neighbors Algorithm (2024)

Table of Contents

The KNN Algorithm The KNN implementation (from scratch) Choosing the right value for K Advantages Disadvantages Recommender Systems If you learned something new or enjoyed reading this article, please clap it up 👏 and share it so that others will see it. Feel free to leave a comment too. FAQs References

Onel Harrison

Published in

Towards Data Science

9 min read

Sep 10, 2018

The KNN Algorithm

Load the data
Initialize K to your chosen number of neighbors

3. For each example in the data

3.1 Calculate the distance between the query example and the current example from the data.

3.2 Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances

5. Pick the first K entries from the sorted collection

6. Get the labels of the selected K entries

7. If regression, return the mean of the K labels

8. If classification, return the mode of the K labels

The KNN implementation (from scratch)

Choosing the right value for K

To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.

Here are some things to keep in mind:

As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, imagine K=1 and we have a query point surrounded by several reds and one green (I’m thinking about the top left corner of the colored plot above), but the green is the single nearest neighbor. Reasonably, we would think the query point is most likely red, but because K=1, KNN incorrectly predicts that the query point is green.
Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker.

Advantages

The algorithm is simple and easy to implement.
There’s no need to build a model, tune several parameters, or make additional assumptions.
The algorithm is versatile. It can be used for classification, regression, and search (as we will see in the next section).

Disadvantages

The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.

KNN’s main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where predictions need to be made rapidly. Moreover, there are faster algorithms that can produce more accurate classification and regression results.

However, provided you have sufficient computing resources to speedily handle the data you are using to make predictions, KNN can still be useful in solving problems that have solutions that depend on identifying similar objects. An example of this is using the KNN algorithm in recommender systems, an application of KNN-search.

Recommender Systems

At scale, this would look like recommending products on Amazon, articles on Medium, movies on Netflix, or videos on YouTube. Although, we can be certain they all use more efficient means of making recommendations due to the enormous volume of data they process.

However, we could replicate one of these recommender systems on a smaller scale using what we have learned here in this article. Let us build the core of a movies recommender system.

What question are we trying to answer?

Given our movies data set, what are the 5 most similar movies to a movie query?

Gather movies data

If we worked at Netflix, Hulu, or IMDb, we could grab the data from their data warehouse. Since we don’t work at any of those companies, we have to get our data through some other means. We could use some movies data from the UCI Machine Learning Repository, IMDb’s data set, or painstakingly create our own.

Explore, clean, and prepare the data

Wherever we obtained our data, there may be some things wrong with it that we need to correct to prepare it for the KNN algorithm. For example, the data may not be in the format that the algorithm expects, or there may be missing values that we should fill or remove from the data before piping it into the algorithm.

Our KNN implementation above relies on structured data. It needs to be in a table format. Additionally, the implementation assumes that all columns contain numerical data and that the last column of our data has labels that we can perform some function on. So, wherever we got our data from, we need to make it conform to these constraints.

The data below is an example of what our cleaned data might resemble. The data contains thirty movies, including data for each movie across seven genres and their IMDB ratings. The labels column has all zeros because we aren’t using this data set for classification or regression.

Additionally, there are relationships among the movies that will not be accounted for (e.g. actors, directors, and themes) when using the KNN algorithm simply because the data that captures those relationships are missing from the data set. Consequently, when we run the KNN algorithm on our data, similarity will be based solely on the included genres and the IMDB ratings of the movies.

Use the algorithm

Imagine for a moment. We are navigating the MoviesXb website, a fictional IMDb spin-off, and we encounter The Post. We aren’t sure we want to watch it, but its genres intrigue us; we are curious about other similar movies. We scroll down to the “More Like This” section to see what recommendations MoviesXb will make, and the algorithmic gears begin to turn.

The MoviesXb website sends a request to its back-end for the 5 movies that are most similar to The Post. The back-end has a recommendation data set exactly like ours. It begins by creating the row representation (better known as a feature vector) for The Post, then it runs a program similar to the one below to search for the 5 movies that are most similar to The Post, and finally sends the results back to the MoviesXb website.

When we run this program, we see that MoviesXb recommends 12 Years A Slave, Hacksaw Ridge, Queen of Katwe, The Wind Rises, and A Beautiful Mind. Now that we fully understand how the KNN algorithm works, we are able to exactly explain how the KNN algorithm came to make these recommendations. Congratulations!

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. It’s easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

In the case of classification and regression, we saw that choosing the right K for our data is done by trying several Ks and picking the one that works best.

Finally, we looked at an example of how the KNN algorithm could be used in recommender systems, an application of KNN-search.

Machine Learning Basics with the K-Nearest Neighbors Algorithm (8)

[1] The KNN movies recommender implemented in this article does not handle the case where the movie query might be part of the recommendation data set for the sake of simplicity. This might be unreasonable in a production system and should be dealt with appropriately.

If you learned something new or enjoyed reading this article, please clap it up 👏 and share it so that others will see it. Feel free to leave a comment too.

Machine Learning Basics with the K-Nearest Neighbors Algorithm (2024)

FAQs

What is the k-nearest neighbor algorithm in machine learning? ›

The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. It is one of the popular and simplest classification and regression classifiers used in machine learning today.

Read On ›

Do we need a learning method for a K-nearest neighbors algorithm? ›

The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

What is the best way to choose k in KNN? ›

Find the distance between your new data point and the chosen number of neighbors. Imagine it as measuring the straight-line distance between two points. Pick the 'K' neighbors with the smallest calculated distances. These are the closest points to your new data.

Discover More Details ›

How do you evaluate K-nearest neighbor algorithm? ›

Two common approaches are cross -validation and training/testing split. Overall, both cross -validation and training/testing split are useful approaches for evaluating the performance of the KNN algorithm, and the choice between them depends on the specific problem and the available resources.

See Details ›

Why is KNN called lazy learner? ›

K-NN is a non-parametric algorithm, which means that it does not make any assumptions about the underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the data set and at the time of classification it performs an action on the data set.

Learn More Now ›

When not to use KNN? ›

One such situation is when dealing with large datasets and high-dimensional data, as KNN becomes computationally expensive and less effective in these cases. Another situation is when the classes in the dataset are highly unbalanced, with one class having significantly fewer examples than the others.

Read The Full Story ›

How can I improve my KNN algorithm? ›

One way to improve the KNN algorithm is to select the most relevant features for the classification task. This can reduce the dimensionality of the data, speed up the computation, and avoid the curse of dimensionality. Feature selection can be done using various methods, such as filter, wrapper, or embedded approaches.

Find Out More ›

Which algorithm is better than KNN? ›

While both algorithms yield positive results regarding the accuracy in which they classify the images, the SVM provides significantly better classification accuracy and classification speed than the kNN.

See Details ›

What is the difference between KNN and K-nearest neighbor? ›

KNN is a supervised learning algorithm mainly used for classification problems, whereas K-Means (aka K-means clustering) is an unsupervised learning algorithm. K in K-Means refers to the number of clusters, whereas K in KNN is the number of nearest neighbors (based on the chosen distance metric).

Find Out More ›

What are the disadvantages of KNN? ›

The KNN algorithm has limitations in terms of scalability and the training process. It can be computationally expensive for large datasets, and the memory requirements can be significant. Additionally, KNN does not explicitly learn a model and assumes equal importance of all features.

Tell Me More ›

What are the difficulties with the K nearest neighbor algorithm? ›

KNN has some drawbacks and challenges, such as computational expense, slow speed, memory and storage issues for large datasets, sensitivity to the choice of k and the distance metric, and susceptibility to the curse of dimensionality.

What causes overfitting in KNN? ›

To avoid overfitting and underfitting in KNN, it is important to choose an appropriate value for K. A small K value can lead to overfitting, while a large K value can result in underfitting. The optimal K value depends on the specific dataset and problem at hand.

Discover More ›

What is the Kbann algorithm in machine learning? ›

KBANN (Knowledge-Based Artificial Neural Networks) is a hybrid learning system built on top of connectionist learning techniques. It maps problem-specific “domain theories”, represented in propositional logic, into neural networks and then refines this reformulated knowledge using backpropagation.

Learn More ›

What is the algorithm of the nearest neighbor? ›

Algorithm

Initialize all vertices as unvisited.
Select an arbitrary vertex, set it as the current vertex u. Mark u as visited.
Find out the shortest edge connecting the current vertex u and an unvisited vertex v.
Set v as the current vertex u. ...
If all the vertices in the domain are visited, then terminate.

Get More Info Here ›

What is the KNN algorithm equation? ›

The k-nearest neighbor classifier fundamentally relies on a distance metric. The better that metric reflects label similarity, the better the classified will be. The most common choice is the Minkowski distance dist(x,z)=(d∑r=1|xr−zr|p)1/p.

Learn More ›

What is the conclusion of KNN algorithm in machine learning? ›

Conclusion. The KNN algorithm in machine learning is a simple, yet versatile supervised algorithm that can be used to solve both classification and regression problems.

Explore More ›