Training LLMs: Do We Understand Our Optimizers?
Vortrag von Dr. Antonio Orvietto
Datum: 25.09.25 Zeit: 12.00 - 13.15 Raum: Y27H12
Why does Adam so consistently outperform SGD when training Transformer language models? Despite many proposed explanations, this optimizer gap is still not fully understood. In this talk, we will present results from two complementary studies. First, using over 2000 language model training runs, we compare Adam with simplified variants such as signed gradient and signed momentum. We find that while signed momentum is faster than SGD, it still lags behind Adam; however, we crucially notice that constraining Adam’s momentum parameters to be equal (beta1 = beta2) retains near-optimal performance. This is of great practical importance and also reveals a new insight: Adam in this form has a robust statistical interpretation and a clear link to mollified sign descent. Second, through carefully tuned comparisons of SGD with momentum and Adam, we show that SGD can actually match Adam in small-batch training, but loses ground as batch size grows. Analyzing both Transformer experiments and quadratic models with stochastic differential equations, we shed new light on the role of batch size in shaping training dynamics.http://orvi.altervista.org/