This email address is being protected from spambots. You need JavaScript enabled to view it.
 
+7 (4912) 72-03-73
 
Интернет-портал РГРТУ: https://rsreu.ru

UDC 004.93'12

PREPROCESSING METHODS FOR CLASSIFICATION WITH MISSING DATA: REVIEW

K. A. Maykov, PhD (technical sciences), full processor, BMSTU, Moscow; This email address is being protected from spambots. You need JavaScript enabled to view it.
P. A. Gavrilov, master student, BMSTU, Moscow; This email address is being protected from spambots. You need JavaScript enabled to view it.

Classification task with missing data is considered. The aim of this work is to investigate features and limitations of a number of known preprocessing methods used for handling missing values. Missing data mechanisms have been described. Different approaches in pattern classification with missing data have been shown. Statistical imputation methods (mean, median, mode and hot deck imputation) have been considered. The results of the comparative analysis of a number of imputation methods have been presented using k-nearest neighbor algorithm as a classifier. The quality of the classifier is evaluated by 10-fold cross-validation. The choice of software for numerical experiments has been justified. As it has been shown in the obtained results, all the above methods provide similar results for 5 – 20% missing data percentages, and hot deck imputation provides lower cross-validation scores than mean, median and mode imputation methods for 30 – 40% missing data percentages. In the same time median imputation exceeds other reviewed methods for 40% missing data percentage.

Key words: machine learning, classification, missing data, data preprocessing.

 Скачать статью