Annealing as a Variational Trust-Region Window Problem

Learning-Rate Schedules as Optimal Window Functions

Page created: Sep 11 2025 at 12:00 AM

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation. Oluwasegun Somefun. “Annealing as a Variational Trust-Region Window Problem.” 2025.

Recall that momentum (Polyak, 1969) is a lowpass regularized stochastic gradient algorithm. One that applies a first-order lowpass filter to its stochastic gradient, before multiplying by a possibly optimal iteration-dependent learning rate. Here, more specifically, we show how Adaptive Moment Estimation (Adam) and Momentum Orthogonalized with Newton-Schultz (Muon) encode the principle of trust-region optimal learning rates for the lowpass regularized stochastic gradient algorithm.

The final component of the framework reinterprets learning-rate schedules as extremal solutions to variational problems. Annealing is not a heuristic decay; it is a solution to the minimization of Dirichlet energy on a vanishing trust-region.

1. Introduction: From Heuristics to First-Principles Design

In the discipline of stochastic optimization, the training of deep neural networks has frequently been characterized by the adjustment of empirical parameters. Among these, the learning-rate schedule—the discrete-time function governing step-size attenuation—is often treated as a collection of heuristics lacking spectral justification. However, by adopting a signal-processing perspective, we may reinterpret these schedules not as arbitrary decay curves, but as optimal window functions designed to solve specific variational problems.

This analysis shifts the pedagogical focus from “tuning hyperparameters” to “designing signal-processing elements” that satisfy fundamental stability criteria. The core of this transition lies in addressing the systemic requirements of the discrete-time update.

The Vanishing Trust-Region Radius Problem

This foundational concept treats the learning process as a sequence of parameter updates constrained within a time-varying “trust region.” To ensure asymptotic convergence to a local optimum and dampen the influence of stochastic gradient noise, the radius of this region must systematically contract—or “vanish”—over a finite training window $\tau$. The problem constitutes a challenge in designing the trajectory of this contraction to maximize smoothness and preserve the exponential stability of the update dynamics.

The necessity for such stability leads us to a formal mathematical framework where the “shape” of the learning-rate decay is derived from the requirement to minimize specific mathematical energy functionals.

2. The Mathematical Foundation: The Trust-Region Framework

To derive a schedule from first principles, we consider the parameter update step

\[\Delta(t+1, i) = w(t+1, i) - w(t, i) = -a(t, i) g(t, i).\]

We define the iterative trust-region radius $\delta(t)$ such that its square represents the expected squared step size:

\[\delta^2(t) \equiv \mathbb{E}[\Delta^2(t+1, i)].\]

Under the assumption that the objective function is a log-likelihood, the gradient expectation satisfies

\[\mathbb{E}[g(t)] = 0.\]

Consequently, the recursion for the expected squared parameter error $\epsilon = w - w^\*$ is expressed as

\[\mathbb{E}[\epsilon^2(t+1, i)] = \mathbb{E}[\epsilon^2(t, i)] + \mathbb{E}[\Delta^2(t+1, i)] + 2\mathbb{E}[\epsilon(t, i)\Delta(t+1, i)].\]

Given that $\mathbb{E}[g(t)] = 0$, the cross-term simplifies to a relationship between the parameter and the gradient. By minimizing this expected error subject to the trust-region constraint

\[\mathbb{E}[\Delta^2(t+1, i)] \le \delta^2(t),\]

we define the normalized gradient

\[\bar{g}(t, i) = \frac{g(t, i)}{\sqrt{\mathbb{E}[g^2(t, i)]}}.\]

This leads to the core learning-rate formula, where the schedule is effectively the numerator of an adaptive moment estimation structure:

\[a(t, i) = \frac{\delta(t)}{\sqrt{\mathbb{E}[g^2(t, i)]}}.\]

For any schedule to be mathematically consistent over a learning window $\tau$, it must satisfy two boundary conditions:

Initial Condition $\delta(0) = \mu$
Final Condition $\delta(\tau - 1) = 0$

The specific functional form of $\delta(t)$ is determined by the choice of variational “energy” we seek to minimize over the interval $[0, \tau - 1]$.

3. Comparative Deep Dive: Three Optimal Window Functions

3.1 Linear Decay: The Minimal Variation Solution

Linear Decay is the unique minimizer of the Dirichlet Energy functional, which measures the “roughness” or total variation of a signal.

Functional $\sum_{t=1}^{\tau-1} (\delta(t) - \delta(t-1))^2$
Resulting Form $\delta(t) = \mu\left(1 - \frac{t}{\tau-1}\right)$
Analysis

This schedule provides a constant rate of change in the trust-region radius. It is the optimal path for a learner requiring the most consistent velocity across the training window, avoiding sudden accelerations in the parameter space.

3.2 Cosine Annealing: The Regularized Symmetry Solution

Cosine Annealing arises from Regularized Dirichlet Energy Minimization. By introducing a regularization term that penalizes the total average magnitude $\sum \frac{1}{2}(u^2(t) + u^2(t-1)),$ we encode a requirement for symmetry across the window.

Functional $$ \sum_{t=1}^{\tau-1} (\delta(t) - \delta(t-1))^2
- \lambda [\text{Symmetry Constraint}] $$
Resulting Form $\delta(t) = \mu \cos^2\left(\frac{\pi t}{2(\tau - 1)}\right)$
Analysis

This schedule exhibits a slow initial decay, effectively preserving a larger trust region for longer durations to facilitate early-stage exploration. This is followed by a rapid, symmetric contraction as it approaches the $\tau - 1$ boundary.

3.3 Square-root Linear Decay: The Squared-Variation Minimizer

When the Dirichlet energy minimization is applied to the squared trust-region radius $\delta^2(t)$ rather than $\delta(t)$, we arrive at the square-root linear form.

Functional $\sum_{t=1}^{\tau-1} (\delta^2(t) - \delta^2(t-1))^2$
Resulting Form $\delta(t) = \mu \sqrt{1 - \frac{t}{\tau - 1}}$
Analysis

This is the most “aggressive” schedule. It maintains a large trust-region radius for the majority of the training duration before undergoing a high-velocity dive to zero at the final iterations, maximizing the “learning energy” available before the deadline.

4. Synthesis: Comparison of Optimal Schedules

Schedule Name	Functional Form	Variational Objective	Learner Insight (The ‘So What?’)
Linear Decay	$\mu(1 - \frac{t}{\tau-1})$	Dirichlet Energy on $\delta(t)$	Most consistent decay; Dirichlet energy scales as $O((\tau-1)^{-1})$
Cosine Annealing	$\mu \cos^2(\frac{\pi t}{2(\tau - 1)})$	Regularized Dirichlet Energy	Symmetric exploration; ideal for non-convex landscapes requiring late-stage settling
Square-root Linear	$\mu \sqrt{1 - \frac{t}{\tau - 1}}$	Dirichlet Energy on $\delta^2(t)$	Aggressive resource allocation; maintains high learning power until the final contraction

Thematic Takeaways

Variational Scaling
\[O((\tau - 1)^{-1})\]
Forced Convergence

By satisfying $\delta(\tau - 1) = 0,$ these schedules artificially enforce convergence.
Spectral Regularization

Every popular schedule is an extremal solution to a smoothness problem, implying that the choice of schedule is actually a choice of how we regularize the update trajectory of the discrete-time system.

5. Extending the Framework: Lipschitz-Smooth Prototypes and Interval Shifts

The three optimal schedules belong to a broader class of Smooth Prototype Functions $\phi(x)$. Any monotonically decaying function on the interval $[0, 1]$ where $\phi(0)=1,\quad \phi(1)=0$ can serve as a window, provided its first derivative is Lipschitz-continuous.

To transform these half-window prototypes into complex shapes used in modern architectures, we apply Interval Shifts:

Symmetric Mapping ($q_0$)
$x = |2\tilde{x} - 1|$
Constant Regions ($q_1$)
$x = \max\left(0, \frac{\tilde{x} - \tilde{\epsilon}}{1 - \tilde{\epsilon}}\right)$

By composing these transformations $q_1(q_0(\tilde{x}))$, we can rigorously derive the Trapezoidal and Tukey windows.

6. The Impact on Learning Dynamics: Exponential Stability

The stability of the parameter update is governed by the learning-rate ratio

\[r(t, i) = \frac{\delta(t)}{\delta(t-1)} \frac{d(t-1, i)}{d(t, i)}.\]

As training approaches the window boundary,

\[\frac{\delta(t)}{\delta(t-1)} \to 0.\]

This vanishing ratio forces the update step

\[\Delta(t+1, i) \to 0,\]

artificially inducing Exponential Stability.

7. Conclusion: The Learner’s Toolkit

We have moved from viewing learning rates as hyperparameters to be tuned to viewing them as signal-processing windows to be designed.

Select Linear Decay for minimal variation.
Select Cosine Annealing for symmetric exploration and convergence.
Select Square-root Linear for aggressive late-stage lock-in.

Ultimately, these schedules serve as the mathematical anchors of the discrete-time learning system, ensuring that the trajectory toward the local minimum is both smooth and asymptotically stable. ``

Polyak, B. T. (1969). The Conjugate Gradient Method in Extremal Problems. USSR Computational Mathematics and Mathematical Physics, 9(4), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4

Schedule Name	Functional Form	Variational Objective	Learner Insight (The ‘So What?’)
Linear Decay	\(\mu(1 - \frac{t}{\tau-1})\)	Dirichlet Energy on \(\delta(t)\)	Most consistent decay; Dirichlet energy scales as \(O((\tau-1)^{-1})\)
Cosine Annealing	\(\mu \cos^2(\frac{\pi t}{2(\tau - 1)})\)	Regularized Dirichlet Energy	Symmetric exploration; ideal for non-convex landscapes requiring late-stage settling
Square-root Linear	\(\mu \sqrt{1 - \frac{t}{\tau - 1}}\)	Dirichlet Energy on \(\delta^2(t)\)	Aggressive resource allocation; maintains high learning power until the final contraction