This email address is being protected from spambots. You need JavaScript enabled to view it.
 
+7 (4912) 72-03-73
 
Интернет-портал РГРТУ: http://rsreu.ru

UDC 004.912

EXPERIMENTAL STUDY OF TEXT DOCUMENTS VECTORIZATION TECHNIQUES AND THEIR CLUSTERING ALGORITHMS EFFICIENCY

K. K. Otradnov, Lecturer, Federal State Budget Educational Institution of Higher Education «Moscow Technological University»; This email address is being protected from spambots. You need JavaScript enabled to view it.
V. K. Raev, Professor, Department of Instrumental and Applied Software, Institute of Information Technologies, Moscow Technological University (MIREA); This email address is being protected from spambots. You need JavaScript enabled to view it.

The aim of this work is the experimental comparison of quality and speed of processing textual information using various methods of their vector representation (document – the term with TF-IDF frequency metric with and without n-gram, the document is an associative-semantic group with TF-IDF frequency metric; document-theme using Latent Dirichlet Allocation (LDA)) and text clustering algorithms («K-means», «DbScan», «Affinity Propagation», «Agglomerative Clustering» and «BIRCH»). In assessing the quality and efficiency of document processing, the criterion of the amount of processing time for the test sample of documents (10,000 texts) was used on the existing hardware platform, and the quality metrics: «V», «Adjusted Rand Index» (ARI), «Silhouette», «Expert Assessment». The experiments showed that the best quality with the shortest operating time is shown by non-hierarchical algorithms of clustering – «K-means» and «Affinity Propagation» using the «document-term» model with TF-IDF without N-gram and «document-lexicalsemantic group» with TF-IDF

Key words: clustering efficiency, quality metrics, processing time, clustering algorithms, vector document representation models.

 Download