Automatic architecture-expansion training for mixture-of-experts models

UDC 004.032.26

AUTOMATIC ARCHITECTURE-EXPANSION TRAINING FOR MIXTURE-OF-EXPERTS MODELS

A. K. Klimenko, post-graduate student, BMSTU, Moscow, Russia;

orcid.org/0009-0009-2412-0641, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

K. A. Maikov, Dr. in technical sciences, full professor, ICS7 BMSTU, Moscow, Russia;

orcid.org/0000-0000-0000-000X, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

V. V. Tishkina, PhD (in technical sciences.), associate professor, RSREU, Ryazan, Russia;

orcid.org/0000-0002-6320-3513, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Mixture-of-Experts (MoE) architectures enable language-model scaling without a proportional increase in computational cost by activating only a subset of parameters per token. Classical approaches, however, fix the number of experts a priori, often yielding sub-optimal capacity and slower convergence. We propose a training method that automatically grows the expert pool during optimization. A new expert is inserted when the validation metric plateaus; a newcomer is initialized via small random perturbation of an existing expert and is warmed-up with increased learning rate. GLUE benchmarks show 5 – 8 % faster convergence versus static-MoE baselines of comparable final size, while theoretical nesting of hypothesis spaces guaran tees non-increasing loss. The method provides a theoretically justified opportunity to improve quality char acteristics of the solution and to reduce resource intensity.

Key words: mixture of experts, MoE, adaptive training, dynamic architecture expansion.

Download

Vestnik of Ryazan StateRadio Engineering University

Issue 95

Automatic architecture-expansion training for mixture-of-experts models

Vestnik of Ryazan State
Radio Engineering University