K-Nearest neighbor (KNN)

The K-Nearest Neighbor (KNN) algorithm is a basic yet common way to create classifications or predictions via a machine learning technique. KNN determines a given input’s classification by finding the K most similar (or closest) data points to that input and uses either the most common class (i.e., classification) or average value (regression) of those points to determine the final class (classification only) or predicted value for that input.

KNN can be used to classify based on the similarity of nearby data points.
KNN employs distance metrics (like Euclidean Distance) to calculate how far away data points are from one another and therefore determine who their nearest neighbors are.
KNN is an instance-based learning method and therefore provides no assumptions about the underlying distribution of the input data (versus other machine learning algorithms).

The K-Nearest Neighbors(KNN) algorithm has been classified as a “lazy learner” because it does not make immediate use of the training dataset to create an understanding about the training data. Instead, KNN simply stores the entire dataset and then, upon classification, will perform all computations necessary to produce a solution.

For example, consider two features: Category 1 and Category 2.

KNN assigns a “category” based on the nearest points to another point. As shown in the following images, KNN makes predictions about whether or not a new point belongs in a category based on the location of its closest neighbors.

In the image, the green represents points belonging to Category 1 and the red represents points belonging to Category 2. In this example, the new point is located at the point where the circles appear. If we look at the number of neighbors that are classified as red (Category 2), we see that most of the neighbors for the new point are red – KNN has determined that the new point belongs to Category 2.

KNN works by using proximity and majority voting to make predictions.

What does ‘K’ mean in K-Nearest Neighbour?
In the k-Nearest Neighbours algorithm (KNN), ‘k’ simply refers to how many neighbouring points (or closest neighbouring points) the algorithm takes into account when making a decision regarding how best to classify a new point of interest; this value is defined by you.

For example, let’s say you want to determine what kind of fruit you have just based on its size and shape; you can compare it against some fruit you already recognise.

If you define k=3, the algorithm searches for and identifies the 3 nearest points to the new point. If two of the 3 points are apples, then the algorithm identifies that the new point is an apple (the majority of its nearest neighbours are).
How do I choose the value of ‘k’ for my KNN Algorithm?
The value of k you provide to your KNN algorithm determines how many neighbours will be evaluated on your behalf when predicting the output for an input point.

Selecting the optimal value of k is very important in order to obtain desirable results. If your input data contains either substantial amounts of noise or significant outliers, selecting a larger value of k can result in more reliable predictions.

On the other hand, if the chosen value of k is too large, the model will become overly simplistic and fail to identify important trends (or underfit). For this reason, the value of k selected must be dependent on the characteristics of your input data.

Methods to Determine K #

Cross-Validation: A properly executed k-fold cross-validation will assist you in determining the most suitable k value. Generally, a dataset is divided into k groups. After creating a model with some of the groups, the model will be evaluated against other groups. Each group will be used for model creation and evaluation rotated through k times. The k value exhibiting the highest mean accuracy across all rounds of evaluations represents the most suitable k value.

Elbow Method: The elbow method is another manner used to determine suitable k values by graphing the error or accuracy that was created with each k value. With increases in k, the number of errors will reduce initially as the k value increases. Eventually, there will be minimal decreases in the number of errors associated with increasing the k value. The optimal k value is where the curve of the number of errors level off, which approximates a 90-degree elbow.

Odd Values for K: Choosing odd k values is recommended when k is used for classification purposes. This will prevent situations where determining the majority classification will be considered a tie.

Distance Metrics Used in KNN Algorithm #

In the K-Nearest Neighbors (KNN) algorithm, distance metrics play a key role in finding the closest data points (neighbors). These neighbors are then used to make predictions in classification and regression tasks.

To measure how close points are, KNN commonly uses the following distance metrics:

1. Euclidean Distance #

Euclidean distance is the most common way to measure distance. It represents the straight-line distance between two points in space.

You can imagine it as the shortest path between two points—like walking directly from one location to another.

$d(x, X_i) = \sqrt{\sum_{j=1}^{d} (x_j – X_{i_j})^2}$ d(x,Xi)=∑j=1d(xj−Xij)2

2. Manhattan Distance #

Manhattan distance measures distance along a grid. Instead of moving diagonally, you can only move horizontally or vertically—just like driving through city streets.

That’s why it’s also called taxicab distance.

$d(x, y) = \sum_{i=1}^{n} |x_i – y_i|$ d(x,y)=∑i=1n∣xi−yi∣

3. Minkowski Distance #

Minkowski distance is a generalized form of distance measurement. It combines both Euclidean and Manhattan distances into a single flexible formula.

$d(x, y) = \left( \sum_{i=1}^{n} |x_i – y_i|^p \right)^{\frac{1}{p}}$ d(x,y)=(∑i=1n∣xi−yi∣p)p1

When p = 1, it becomes Manhattan distance
When p = 2, it becomes Euclidean distance

So, Minkowski distance acts like a family of distance measures, allowing you to adjust how distance is calculated based on your needs.

Working of KNN algorithm #

K-Nearest Neighbors (KNN) is an algorithm that predicts the label or value of an unknown data point by looking at the closest K points (neighbors) in the training data set. This algorithm uses similarity to calculate how closely an unknown point is related to known points.

Choose a value of K
K is the number of neighbors (the closest points) that will be used to calculate a prediction.
Calculate distance
Euclidean distance is commonly used to measure similarity between the target and training data points. Distances are calculated based on the data points in the data set and the target point.
Find the nearest neighbors
The k number of points in the data set with the smallest distance from the target point are referred to as the nearest neighbors.
Vote for classification or average for regression
In classification problems (e.g., spam or not spam), KNN will consider the K nearest neighbors to the data point being classified, and then find the most commonly occurring class among those neighbors (majority voting).
In regression problems, KNN still finds the K nearest neighbors, but averages the values of those neighbors to arrive at a predicted value for the new, unknown data point.

Conclusion: The Core of KNN #

KNN is a Very Simple Algorithm K-Nearest Neighbors (KNN) is a very basic, “lazy” learning algorithm that simply predicts according to the distance from its existing known data. It does not form an elaborate model, and does not require any data at the beginning; KNN simply remembers the training data and searches through its training data for examples that are the same or similar to the item needing to be identified as to get the proper answer.

Five Big Ideas About KNN

k, or the number of “nearest neighbors” to consider, determines how sensitive the KNN model will be. A small k means the model is sensitive to noise; a large k means the model is oversimplified. Use odd k’s to avoid “ties” when classifying with KNN.
How you calculate “closeness” is the basis for accuracy in the KNN algorithm. If using Euclidean distance or the Manhattan style or some other way to determine how similar two examples are, how you measure “closeness” greatly affects how accurate KNN is.
KNN is a “dual-threat” algorithm because it can perform both Classification (by voting) and Regression (by taking an average of the nearest neighbors) easily.
KNN is a simple, easy-to-use algorithm that has no “learning” time; however, it will be slow with large datasets and can be sensitive to scale.
KNN is an ideal first-line algorithm because it is easy to use, requires no assumptions regarding the distribution of the data, and has a straightforward rationale for decisions that KNN makes.

Logistic Regression Project K-Nearest Neighbor (KNN) Project