Increasing data classification accuracy using k nearest neighborhood algorithm based on training data pre-clusterization

UDC 004.855.5

INCREASING DATA CLASSIFICATION ACCURACY USING k NEAREST NEIGHBORHOOD ALGORITHM BASED ON TRAINING DATA PRE-CLUSTERIZATION

V. I. Oreshkov, Ph.D. (in technical sciences.), associate professor, CAD department, RSREU, Ryazan, Russia;
orcid.org/0000-0003-0316-4927, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

The problem of data classification using the metric method of machine learning of k nearest neighbors is considered. The aim of the work is to develop a methodology for adaptive adjustment of the algorithm's k nearest neighbors neighborhood parameter in order to improve the accuracy of the algorithm by taking into account the heterogeneity of the distribution of training examples in feature space. In those areas of the feature space where the density of training examples is high, the algorithm works well even for any values of neighborhood parameter, since all the examples it defines are compactly located and, as a rule, belong to the same class. In sparse areas of the feature space, large values of neighborhood parameter lead to the examples located at a large distance, which may belong to different classes, being involved in the classification process which degrades the classification accuracy. To solve this problem, the article offers to use the value of neighborhood parameter, which is not set similar for all examples, but can change depending on their density in the area of the feature space. For this, preliminary clustering of the training set is performed and the density of each cluster is determined as a root-mean-square distance from examples of a cluster to its centroid. Based on the obtained density value, a neighborhood parameter for the cluster being used to classify any object within the cluster is calculates. The experiment showed more accurate operation of the modified algorithm in cross-validation process.

Key words: data mining, machine learning, supervised learning, training example, classification, clustering, class, cluster, centroid, cross validation.

Download

Vestnik of Ryazan StateRadio Engineering University

Issue 76

Increasing data classification accuracy using k nearest neighborhood algorithm based on training data pre-clusterization

Vestnik of Ryazan State
Radio Engineering University