Analsysis of text documents vectorization methods

UDC 004.912

ANALSYSIS OF TEXT DOCUMENTS VECTORIZATION METHODS

O. A. Popova, teacher, TGMU, Tjumen, Rossia;

orcid.org/ 0009-0006-3530-5703, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

The article deals with the problem: analysis of methods for vectorization of text data. The data are presented as an optional and variable part of medical training courses. The purpose of the work is to choose the optimal method for vectorizing textual data on medical topics. The urgency of the problem of choosing a vectorization method is based on the need to improve the quality of the recommendation system for the selection of educational content for students. The selected method will be recommended as a text pre-processing procedure for the recommender system in the future. The paper presents 4 vectorization methods: BinaryBOW, Bag of words, TF-IDF, Word2Vec. According to the results of the experiment, the success of the application of the method based on neural networks – Word2Vec - was established. Its algorithm is based on the pre-dictability of the result, based on the semantic proximity of words, machine learning and vector representation of words. The article presents the choice of hyperparameters of the vectorizer of the machine learning model in accordance with the set of text data.

Key words: : Vectorization, BinaryBOW, Bag of words, TF-IDF, Word2Vec, Skip-gram, softmax, corpus, neural network, vector

Download

Vestnik of Ryazan StateRadio Engineering University

Issue 85

Analsysis of text documents vectorization methods

Vestnik of Ryazan State
Radio Engineering University