Skip to content

Latest commit

 

History

History
347 lines (312 loc) · 20.9 KB

README.md

File metadata and controls

347 lines (312 loc) · 20.9 KB

K-Nearest Neighbors (KNN) Classifier from Scratch

This repository contains a Python implementation of a K-Nearest Neighbors (KNN) classifier from scratch. The KNN classifier is applied to the "BankNote_Authentication" dataset, which consists of four features (variance, skew, curtosis, and entropy) and a class attribute indicating whether a banknote is real or forged.

Dataset

The dataset used for training and testing the KNN classifier is provided in the "BankNote_Authentication.csv" file. The dataset is loaded into a pandas DataFrame and then shuffled to ensure randomization during training and testing.

KNN Classifier Implementation

The KNN classifier is implemented in the KNN_Classifier class. The class takes the following inputs during initialization:

  • x_train: The training data features.
  • y_train: The training data labels.
  • x_test: The test data features.
  • k: The number of nearest neighbors to consider.

The KNN classifier consists of the following methods:

1. euclidean_distance

This method calculates the Euclidean distance between a training row and a test row. It takes two input vectors and computes the Euclidean distance according to the formula:

distance = sqrt(sum((x_train_row[i] - x_test_row[i])^2))

2. predict

This method predicts the class for each test point based on the K-nearest neighbors in the training data. For each test point, the Euclidean distance is calculated between the test point and all training points. The K-nearest neighbors with the smallest distances are determined, and their corresponding class labels are counted. If there is a tie in the number of votes for different classes, the tie is broken in favor of the class that comes first in the training data.

3. calc_accuracy

This method calculates the accuracy of the KNN classifier by comparing the predicted labels with the true labels for the test data. The accuracy is computed as the ratio of correctly classified instances to the total number of instances in the test set.

Normalization

Before training and testing the KNN classifier, the feature columns are normalized separately using the mean and standard deviation of the values in the training data. Each feature is transformed using the function:

f(v) = (v - mean) / std

This normalization ensures that each feature contributes equally to the distance calculation.

Training and Testing

The dataset is split into 70% for training and 30% for testing. The training and test sets are created by dividing the feature and label arrays accordingly.

The KNN classifier is then trained on the training data and tested on the test data for different values of K ranging from 1 to 9. For each value of K, the classifier's accuracy is calculated and stored in a list.

Experiment

The KNN classifier is evaluated using different values of K ranging from 1 to 15. The accuracy of the classifier is measured for each K value, and the results are summarized in the following table:

K

Accuracy

1

1.0

2

1.0

3

1.0

4

1.0

5

1.0

6

1.0

7

1.0

8

1.0

9

1.0

10

1.0

11

1.0

12

0.9975728155339806

13

1.0

14

0.9975728155339806

15

0.9975728155339806


Results

The results of the KNN classifier for different values of K are displayed in the console. The output includes the value of K used for the test set and summary information for each K value, including the number of correctly classified test instances, the total number of instances in the test set, and the accuracy.

An example of the output:

K Value: 12
Number of correctly classified instances: 444
Total number of instances: 445
Accuracy: 0.9975728155339806

Conclusion

This code provides implementation of a KNN classifier from scratch using Python. It demonstrates the steps involved in training and testing a KNN classifier, including data normalization, distance calculation, and prediction. By experimenting with different values of K, the code evaluates the performance of the classifier and provides accuracy metrics for each K value.

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request.

Team

License

This program is licensed under the MIT License.