UDC 007:681.512.2
KNOWLEDGE MODELS TO CORRECT DATA DRIFT IN DATA MINING
I. Yu. Kashirin, Dr. Sc. (Tech.), full professor, RSREU, Ryazan, Russia;
orcid.org/0000-0003-1694-7410, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
The article contains a description of a new method for determining and correcting data drift in machine learning models (Machine Learning, ML-models). Data drift is a negative change over time in the patterns of forming the basic features of initial training, test and validation data sets that worsen prediction accuracy characteristics of ML models in Data Mining concept. Existing methods for detecting and correcting data drift, drift varieties are considered, the problem of deep semantic changes in data caused by dynamics peculiarities of main concepts and relations in the subject area of ML-models application is formulated. A new drift correction method is the basis of a new technology to design classification, regression and forecasting models for specifically formalized subject areas. When choosing the scope of “sliding window”, first of all, the structure of domain knowledge model is taken into account, which can use ontological representation of Semantic Web concept. Input features of training data set are grouped according to the structure of concepts and relationships in knowledge base. Alternative paradigmatic relations where local study is carried out on the drift of semantic features corresponding to the chosen paradigm are tracked. As an example for experimental part of the study, the subject area of communication services was chosen, the data source of which is the international Kaggle repository. Software implementation was performed using Spider v.4 toolkit in Python v.3.8. The results of the experiments performed show the effectiveness of a new method and technology for data drift correction with obtaining qualitatively new possibilities for automatic data analysis. The aim of the work is to present a new method for determining and correcting data drift, as well as the corresponding technology, allowing the use of automatic search, monitoring and correction of data sets in their temporal development.
Key words: : data drift, ML-models, data mining, forecasting accuracy, knowledge base, semantic networks, onto-logical knowledge models, hierarchical number