This email address is being protected from spambots. You need JavaScript enabled to view it.
 
+7 (4912) 72-03-73
 
Интернет-портал РГРТУ: https://rsreu.ru

UDC 007:681.512.2

EMBEDDINGS OF HIERARCHICAL NUMBERS TO ENRICH TRANSFORMATIONAL LANGUAGE MODELS WITH EXTERNAL ONTOLOGICAL KNOWLEDGE EXTERNAL ONTOLOGICAL KNOWLEDGE

I. Yu. Kashirin, Dr. in technical sciences, full professor, RSREU, Ryazan, Russia;

orcid.org/0000-0003-1694-7410, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

A new method for the vectorization of input natural language sentences in language generative LLM models is considered. The basis for a new method is the algebra of hierarchical numbers used in algorithms to calculate semantic proximity of words and sentences. The method is suitable for local subject areas and has been tested in «political news» subject area. OWL ontology and corresponding graphical representation in the form of semantic network with hierarchical numbers marking up basic concepts have been developed for this local area. Semantic network includes general and applied layers. General ontology uses ICF+ rela tions which make it possible to simplify polymorphic operations in knowledge models. Software implementation of the method considered uses the technology of neural generative networks with DistilBERT concentration of attention. Knowledge enrichment of pre-trained neural network is carried out by generating new semantic embeddings for words (concepts) of natural language sentences and embedding them into a new neural network before further training in a selected local subject area. General corpus from Hugging Face Datasets repository and local corpus of materials extracted by the author of this article from English-language political articles of international electronic media, in particular, RT, CNN, TASS, NYTimes, WSJ, are used as training corpora for obtaining a new neural network model mIYu-bert v.2.0. The experimental part of material is based on Python v.3 (Anaconda 3), OWL2EL, and mIYu-bert v.2.0 software package. The latter toolkit is implemented by the author of the material. The performed series of experiments allows us to qualify a new method of using hierarchical numbers in further training of LLM models to calculate semantic similarity as the basis for the technology that is not inferior in efficiency to international analogues available today. The aim of this paper is to present a new method for enriching LLM language models with embeddings of hierarchical numbers based on OWL ontologies for local subject areas.

Key words: : embeddings of hierarchical numbers, neural DistilBET models, natural language analysis, ontological taxonomies, semantic similarity.

 Download