UDC 004.032.26
AUTOMATIC ARCHITECTURE-EXPANSION TRAINING FOR MIXTURE-OF-EXPERTS MODELS
A. K. Klimenko, post-graduate student, BMSTU, Moscow, Russia;
orcid.org/0009-0009-2412-0641, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
K. A. Maikov, Dr. in technical sciences, full professor, ICS7 BMSTU, Moscow, Russia;
orcid.org/0000-0000-0000-000X, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
V. V. Tishkina, PhD (in technical sciences.), associate professor, RSREU, Ryazan, Russia;
orcid.org/0000-0002-6320-3513, e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
Mixture-of-Experts (MoE) architectures enable language-model scaling without a proportional increase in computational cost by activating only a subset of parameters per token. Classical approaches, however, fix the number of experts a priori, often yielding sub-optimal capacity and slower convergence. We propose a training method that automatically grows the expert pool during optimization. A new expert is inserted when the validation metric plateaus; a newcomer is initialized via small random perturbation of an existing expert and is warmed-up with increased learning rate. GLUE benchmarks show 5 – 8 % faster convergence versus static-MoE baselines of comparable final size, while theoretical nesting of hypothesis spaces guaran tees non-increasing loss. The method provides a theoretically justified opportunity to improve quality char acteristics of the solution and to reduce resource intensity.
Key words: mixture of experts, MoE, adaptive training, dynamic architecture expansion.
