Why Cosine Annealing Works
and Why We Don’t Actually Need It.
Page created: Oct 25 2025 at 12:00 AM
Table of contents
Learning rate schedules are used to coax better training performance out of deep learning models, and play a pivotal role in training stability and generalization. But a fundamental question has lingered: Why is it so effective? Is there something magical about these schedule functions?
Constant Setup
Let \(n>1\), \(\alpha[t], \mathbf{w}[t], \mathbf{g}[t], \Delta[t+1] \in \mathbb{R}^{n \times 1}\)
For each coordinate \(i\),
Let \(0 < \mu \le 1\) be a trust-region penalty constant that enforces the per-iteration subproblem
\[\boxed{ \trobjsca = \stepmom + \mu\,\stepcorrng }\]Since
\(\frac{d^2\trobjsca}{d{}\alpha[t,i]^2} = \dengmom\),
the subproblem defined by \(\trobjsca\) is strongly convex in \(\alpha[t,i]\). Let the normalized gradient be \(\ngrad = \frac{\mathbf{g}[t,i]}{\dengmomsqrt}\), where \(\numngradcorr=1\).
Minimizing \(\trobjsca\) w.r.t \(\alpha[t,i]\) gives
which is in the form
\[\alpha[t,i] = \mu\,\Phi[t,i], \quad \Phi[t,i] = \frac{1}{\dengmomsqrt} .\]Iterative Setup
In general, to enforce the per-iteration subproblem, replace the trust-region constant \(\mu\) with an iterative form \(0 \le \mu[t] \le 1\), where
\(\mu[t]=\mu\,\digamma[t], \quad 0\le \digamma[t] \le 1\),
and \(0 < \mu \le 1\) is the trust-region penalty constant.
The trust-region objective becomes
\[\boxed{ \trobjsca = \stepmom + \mu[t]\,\stepcorrng }\]Minimizing \(\trobjsca\) w.r.t \(\alpha[t,i]\) gives
which is in the form
\[\alpha[t,i] = \mu[t]\,\Phi[t,i].\]