Trust-Region Optimal Learning Rate Principle

A Modular Principle

Page created: Sep 11 2025 at 12:00 AM

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation. Oluwasegun Somefun. “Trust-Region Optimal Learning Rate Principle.” 2025.

Recall that momentum (Polyak, 1969) is a lowpass regularized stochastic gradient algorithm. One that applies a first-order lowpass filter to its stochastic gradient, before multiplying by a possibly optimal iteration-dependent learning rate. Here, more specifically, we show how Adaptive Moment Estimation (Adam) and Momentum Orthogonalized with Newton-Schultz (Muon) encode the principle of trust-region optimal learning rates for the lowpass regularized stochastic gradient algorithm.

Transitional Sentence: Scaling the blueprint up to the skyscraper level, we can see how these 1D insights power modern AI.

6. The “So What?”: Modular Design and Trust Regions

Modern optimizers like Adam and Muon implement two modular principles:

Gradient Filtering: Designing \(\beta, \gamma\)
Learning Rate Normalization: Scaling by \(1/\lambda\)

5. Modern Applications: Scaling the Concept (Adam & Muon)

This “filter + normalization” framework underlies modern optimizers.

Modular Design Principle

Filter \(v(t) = H_{\beta,\gamma}\{g(t)\}\)
Normalize \(w(t+1) = w(t) - \mu \lambda^{-1}(t) v(t)\)

Modern Optimizer Mapping

Adam (Vectorized)
Performs coordinate-wise normalization: \(\alpha(i) = \frac{\mu}{\sqrt{\mathbb{E}[g^2(i)]}}\)
enforcing: \(\mathbb{E}[\Delta^2] \le \mu^2\)
Muon (Matrix-level)
Uses matrix inverse square-root normalization: \(C^{-1/2}\)
ensuring: \(\|\Delta\|_F^2 \le \mu^2\)

Modern Optimizer Mapping

Adam (Vectorized):

\[\alpha(i) = \frac{\mu}{\sqrt{\mathbb{E}[g^2(i)]}}\]

Ensures trust-region constraint:

\[\mathbb{E}[\Delta^2] \le \mu^2.\]

Muon (Matrix-Valued):

Uses matrix normalization:

\[C^{-1/2}\]

to ensure:

\[\|\Delta\|_F^2 \le \mu^2.\]

Section Insight: The Trust-Region Radius

For Heavy-Ball, the stable learning rate lies roughly in:

\[\frac{(1 \pm \sqrt{\beta})^2}{\lambda}.\]

Normalizing by \(1/\lambda\) defines a trust-region radius \(\mu\):

“Never take a step larger than this safe boundary.”

Takeaway 4: The Modular Blueprint for Adam and Muon

This perspective yields a modular design principle:

1. Filter Design

Choose \(\beta, \gamma\) to shape the gradient signal:

\[v(t) = \beta v(t-1) + \gamma g(t)\]

2. Learning-Rate Normalization

Apply a trust-region scaling:

\[\text{update} = \mu \cdot \lambda^{-1} \cdot v(t)\]

where:

\(\lambda\) = scale of the gradient
\(\mu\) = step-size budget

Adam (Coordinate-wise normalization)

\[\lambda_i = \sqrt{\mathbb{E}[g_i^2]}\]

Update:

\[\Delta x_i = \mu \frac{v_i}{\lambda_i}\]

Normalizes each coordinate independently
Enforces:

\[\|\Delta x_i\|^2 \le \mu^2\]

Muon (Operator-level normalization)

Uses matrix scaling:

\[\Delta W = \mu \, (\mathbf{G}^\top \mathbf{G})^{-1/2} \, v\]

Typically approximated via Newton–Schulz iteration
Enforces Frobenius constraint:

\[\|\Delta W\|_F^2 \le \mu^2\]

Unified Insight

Both follow:

\[\text{update} = \mu \cdot \lambda^{-1} \cdot v(t)\]

The only difference is how \(\lambda\) is defined.

“New optimizers can be designed explicitly by (i) designing a gradient filter and (ii) choosing a learning-rate rule that encodes gradient normalization.”

4. The Learning-Rate Subsystem: Trust-Region Normalization

The second module of the framework is the learning-rate subsystem, which is strategically separated from the temporal smoothing of the gradient filter.

The Normalization Principle

We define the general update rule:

\[w(t+1) = w(t) - \alpha(t) v(t)\]

With normalization:

\[\alpha(t) = \mu \lambda^{-1}(t)\]

Where:

\(\mu\) = Trust-Region Radius
\(\lambda(t)\) = gradient scale

Strategic Distinction

Temporal Smoothing (\(H(z)\)): Controls memory/history
Spatial Scaling (\(\mu\)): Limits update magnitude

5. Multi-Scale Formulations: Scalar, Vectorized, and Matrix-Valued

The modular principle extends to higher dimensions.

Vectorized Formulation (Adam-style)

For \(w(t) \in \mathbb{R}^n\):

\[b(t,i) = \rho b(t-1,i) + (1-\rho) g^2(t,i)\]

Constraint:

\[\mathbb{E}[\Delta^2(t+1,i)] \le \mu^2\] \[\mathbb{E}[\|\Delta(t+1)\|_2^2] \le n\mu^2\]

Matrix-Valued Formulation (Muon-style)

Using Frobenius geometry:

\[\langle A, B \rangle_F = \mathrm{tr}(A^T B)\]

Scale:

\[C(t) = v(t)v(t)^T\]

Normalized gradient:

\[\bar{v}(t) = C^{-1/2}(t) v(t)\]

Constraint:

\[\|\Delta(t+1)\|_F^2 = n\mu^2\]

Implementation via Iterative Methods

Matrix inverse square root computed via Newton–Schulz iteration (Lakić polynomial family).

Conclusion: A New Era of Modular Optimizers

We are entering an era of optimizer engineering.

Instead of relying on physical metaphors, we can:

Treat gradients as signals
Design filters explicitly
Control transient and steady-state behavior

Momentum is no longer a heuristic—it is:

\[\text{a structured signal-processing operator}\]

Nesterov works because it is an optimal pole-zero configuration for transient suppression.

Reality Check

These insights are derived under simplifying assumptions:

1D quadratic loss
Stationary gradients
Linear dynamics

Real deep learning systems are:

High-dimensional
Non-stationary
Heavy-tailed

Thus:

Structural roles (filtering + normalization) generalize
Optimal parameter values do not

6. Conclusion: Your New Perspective on Learning

You have moved beyond seeing AI training as a black box of “weights and biases.” You now understand it as a sophisticated signal processing task. Momentum is the heartbeat of this process—a digital filter that separates the signal of the “truth” from the noise of the data.

3-Point Takeaway

Momentum is a Filter
It is a first-order lowpass filter designed to attenuate high-frequency noise and preserve the underlying signal.
Poles and Zeros are the Keys
The difference between Heavy-Ball and Nesterov is determined by: \(\gamma\)
Stability is the True Goal
Momentum expands the stability region, allowing aggressive learning rates while maintaining safe updates within a trust region: \(\mu\)

Open Question

If momentum is just a first-order filter, what about:

Higher-order filters: \(H(z) = \frac{b_0 + b_1 z^{-1} + \cdots}{1 + a_1 z^{-1} + \cdots}\)
Resonant systems
Nonlinear filters

👉 What unexplored filter designs could unlock the next generation of optimizers?

Polyak, B. T. (1969). The Conjugate Gradient Method in Extremal Problems. USSR Computational Mathematics and Mathematical Physics, 9(4), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4