UDC 007:681.512.2
FORMATION OF LLM-ORIENTED KNOWLEDGE RESOURCES BASED ON GENERATION OF AUGMENTED INFORMATION
I. Yu. Kashirin, Dr. in technical sciences, full professor, RSREU, Ryazan, Russia;
orcid.org/0000-0003-1694-7410, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
A new technology for designing question-answering systems based on large generative language models (LLMs) is being considered. The disadvantages of LLMs, the main being the lack of up-to-date information that has appeared in information services in relatively recent time, are investigated. The new technology is based on a modern approach to the development of large models with the resources of new up-to-date or specific knowledge. Such systems are called RAG systems of augmented generation (Retrieval-Augmented Generation, RAG). Additional databases are used to improve the quality of dialogue in these systems. The author of the article suggests using the method of expanding the semantic space for vectorization of natural language texts to generate new knowledge resources. The method is based on a system of operations on a set of hierarchical numbers generated as semantic indices of dictionary concepts and dictionary definitions of events. This makes it possible to more accurately calculate the semantic proximity of dictionary constructions. The new approach can be used for specialized subject areas. Software implementation of the proposed technology has been implemented in IYuRAG v.1.0 RAG system. The design involved the previously developed CorpusMining v.2.1 module for collecting thematic corpora, which is based on Googlesearch and BeautifulSoup4 tools in Python v.3.10 and Anaconda v.2.1. In addition, LLM RoBERTa-transformers toolkit was used. IYuRAG v.1.0 RAG system provides an opportunity to generate knowledge resources in «Political news/ Armed conflicts» domain. RAG system's question-and-answer module enhances the capabilities of existing LLMs. The aim of this article is to present a new method for designing RAG systems based on the use of hierarchical numbers to expand the semantic space in large neural network generative models.
Key words: : augmented information generation, hierarchical number embeddings, neural network transformers, natural language analysis, ontological taxonomies, and semantic space.
