Visualize error rate vs. K plot to find the most suitable K value.
K-Nearest Neighbors is the supervised machine learning algorithm used for classification and regression. It manipulates the training data and classifies the new test data based on distance metrics. It finds the k-nearest neighbors to the test data, and then classification is performed by the majority of class labels.
Selecting the optimal K value to achieve the maximum accuracy of the model is always challenging for a data scientist. I hope you all know the basic idea behind the KNN, yet I will clarify an overview of knn later in this article. For a comprehensive explanation of working of this algorithm, I suggest going through the below article:
In this article, I will demonstrate the implementable approach to perceive the ideal value of K in the knn algorithm.
Table of Contents
- Overview of KNN
- Distance Metrics
- How to choose a K value?
- KNN model implementation
- Key Takeaways
1. Overview of KNN
Using K-Nearest Neighbour, we predict the category of the test point from the available class labels by finding the distance between the test point and trained k nearest feature values. By analyzing all the information, you will come up with a question,
How to calculate the distance?
Let me answer your enthusiastic question in the next section of distance metrics.
2. Distance Metrics
The distance metric is the effective hyper-parameter through which we measure the distance between data feature values and new test inputs.
Usually, we use the Euclidean approach, which is the most widely used distance measure to calculate the distance between test samples and trained data values. We measure the distance along a straight line from point (x1, y1) to point (x2, y2).
By analyzing the above computations, I hope that you understand how we calculate the Euclidean distance. Let’s throw some light on the prediction method in KNN.
To classify an unknown record:
- Initialize the K value.
- Calculate the distance between test input and K trained nearest neighbors.
- Check class categories of nearest neighbors and determine the type in which test input falls.
- Classification will be done by taking the majority of votes.
- Return the class category.
We understand the process of classifying an unknown record, but what about choosing an optimal K value?
Let’s answer it.
3. How to choose a K value?
K value indicates the count of the nearest neighbors. We have to compute distances between test points and trained labels points. Updating distance metrics with every iteration is computationally expensive, and that’s why KNN is a lazy learning algorithm.
- As you can verify from the above image, if we proceed with K=3, then we predict that test input belongs to class B, and if we continue with K=7, then we predict that test input belongs to class A.
- That’s how you can imagine that the K value has a powerful effect on KNN performance.
Then how to select the optimal K value?
- There are no pre-defined statistical methods to find the most favorable value of K.
- Initialize a random K value and start computing.
- Choosing a small value of K leads to unstable decision boundaries.
- The substantial K value is better for classification as it leads to smoothening the decision boundaries.
- Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate.
Now you will get the idea of choosing the optimal K value by implementing the model.
4. KNN model implementation
Let’s start the application by importing all the required packages. Then read the telecommunication data file using read_csv() function.
As you can see, there are 12 columns, namely as region, tenure, age, marital, address, income, ed, employ, retire, gender, reside, and custcat. We have a target column, ‘custcat’ categorizes the customers into four groups:
- 1- Basic Service
- 2- E-Service
- 3- Plus Service
- 4- Total Service
- We collect all independent data features into the X data-frame and target field into a y data-frame. Then we manipulate the data and normalize it.
- After splitting the data, we take 0.8% data for training and remaining for testing purposes.
- We import the classifier model from the sklearn library and fit the model by initializing K=4. So we have achieved an accuracy of 0.32 here.
Now it’s time to improve the model and find out the optimal k value.
From the plot, you can see that the smallest error we got is 0.59 at K=37. Further on, we visualize the plot between accuracy and K value.
Now you see the improved results. We got the accuracy of 0.41 at K=37. As we already derived the error plot and got the minimum error at k=37, so we will get better efficiency at that K value.
As our principal focus is on determining optimal K value but, you can perform exploratory data analysis and can achieve even greater accuracy. Data file and code are available in my GitHub repository.
5. Key Takeaways
- We obtained an accuracy of 0.41 at k=37, which is higher than the efficiency calculated at k=4.
- The small K value isn’t suitable for classification.
- The optimal K value usually found is the square root of N, where N is the total number of samples.
- Use an error plot or accuracy plot to find the most favorable K value.
- KNN performs well with multi-label classes, but you must be aware of the outliers.
- KNN is used broadly in the area of pattern recognition and analytical evaluation.
That’s all folks,
See you in my next article.
References: Elbow Method in Supervised Machine Learning