This email address is being protected from spambots. You need JavaScript enabled to view it.
 
+7 (4912) 72-03-73
 
Интернет-портал РГРТУ: https://rsreu.ru

UDC 007:681.512.2

DISTINCTIVE FEATURES OF DECIMAL HIERARCHICAL NUMBERS IN LARGE LANGUAGE MODEL EMBEDDINGS INTERPRETATION

I. Yu. Kashirin, Doctor in technical sciences, Professor, Department of computing and applied mathematics, RSREU, Ryazan, Russia;

orcid.org/0000-0003-1694-7410, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

This paper discusses a new method for analyzing input natural language sentences for LLM language models. A new method is based on the algebra of decimal hierarchical numbers used in algorithms for calcu lating the semantic similarity of words, phrases, and sentences. The method is suitable for local subject are as and has been tested in «political news» subject area. For this local domain, an OWL ontology and a cor responding graphical representation in the form of a semantic network were develope, with basic entities marked up with decimal hierarchical numbers. A semantic network includes general and application level. A fragment of general ontology is represented by relation, which significantly reduce computational complexi ty of algebraic operations on knowledge graphs and, consequently, reduce the time required to calculate the semantic similarity of natural language constructs. Software implementation of the method under consideration uses well-known DistilBERT tech nology for language neural networks with attentional focus. Knowledge enrichment of pre-trained neural network is achieved by generating new semantic embeddings for words (entities) of natural language sentences and integrating them into a new neural network before fine-tuning in local do main. Training corpora for a new neural network model mIYu-bert v.2.0 were a general corpus from Hugging Face Datasets repository and a local corpus of materials extracted by the author from English-language political articles from international electronic media outlets, including RT, Meduza, CNN, TASS, NYTimes, Bloomberg, and WSJ. The experimental portion of the material is based on Python v.3 (Anaconda 3) programming language toolkit, LLM DistilBERT, and mIYu-bert v.3.1 software package. The latter toolkit was implemented by the author. The completed series of experiments allows us to qualify a new method of using decimal hierar chical numbers in the retraining of LLM models for calculating semantic similarity as the basis for the technology that is equal in efficiency to currently available international analogues and does not exceed them in computational complexity. The aim of this paper is to describe a new method to calculate semantic similarity in LLM lan guage models using decimal hierarchical numbers based on OWL ontologies, as well as universal algebras for generating knowledge graphs in local subject areas.

Key words: decimal hierarchical numbers, universal algebras, DistilBERT language models, natural language analysis, ontological taxonomies, semantic similarity.

 Download