Why Cosine Annealing Works

and Why We Don’t Actually Need It.

Work in Progress

Page created: Oct 25 2025 at 12:00 AM

Table of contents

Why Cosine Annealing Works
1. Constant Setup
2. Iterative Setup

Learning rate schedules are used to coax better training performance out of deep learning models, and play a pivotal role in training stability and generalization. But a fundamental question has lingered: Why is it so effective? Is there something magical about these schedule functions?

Constant Setup

Let $n>1$, $\alpha[t], \mathbf{w}[t], \mathbf{g}[t], \Delta[t+1] \in \mathbb{R}^{n \times 1}$

For each coordinate $i$,

\[\boxed{ {\mathbf{w}[t+1, i]} = {\mathbf{w}[t,i]} + {\Delta[t+1,i]} }\]

\[\boxed{ \step }\]

Let $0 < \mu \le 1$ be a trust-region penalty constant that enforces the per-iteration subproblem

\[\boxed{ \trobjsca = \stepmom + \mu\,\stepcorrng }\]

Since

$\frac{d^2\trobjsca}{d{}\alpha[t,i]^2} = \dengmom$,

the subproblem defined by $\trobjsca$ is strongly convex in $\alpha[t,i]$. Let the normalized gradient be $\ngrad = \frac{\mathbf{g}[t,i]}{\dengmomsqrt}$, where $\numngradcorr=1$.

Minimizing $\trobjsca$ w.r.t $\alpha[t,i]$ gives

$$ \boxed{ \begin{align}\label{lrmom} \alpha[t,i] = \mu\,\frac{1}{\dengmomsqrt} \end{align} } $$

which is in the form

\[\alpha[t,i] = \mu\,\Phi[t,i], \quad \Phi[t,i] = \frac{1}{\dengmomsqrt} .\]

Iterative Setup

In general, to enforce the per-iteration subproblem, replace the trust-region constant $\mu$ with an iterative form $0 \le \mu[t] \le 1$, where

$\mu[t]=\mu\,\digamma[t], \quad 0\le \digamma[t] \le 1$,

and $0 < \mu \le 1$ is the trust-region penalty constant.

The trust-region objective becomes

\[\boxed{ \trobjsca = \stepmom + \mu[t]\,\stepcorrng }\]

Minimizing $\trobjsca$ w.r.t $\alpha[t,i]$ gives

$$ \boxed{ \begin{align}\label{lrmomt} \alpha[t,i] = \mu[t]\,\frac{1}{\dengmomsqrt} \end{align} } $$

which is in the form

\[\alpha[t,i] = \mu[t]\,\Phi[t,i].\]