Trust-Region Optimal Learning Rate Principle
A Modular Principle
Page created: Sep 11 2025 at 12:00 AM
Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation. Oluwasegun Somefun. “Trust-Region Optimal Learning Rate Principle.” 2025.
Recall that momentum (Polyak, 1969) is a lowpass regularized stochastic gradient algorithm. One that applies a first-order lowpass filter to its stochastic gradient, before multiplying by a possibly optimal iteration-dependent learning rate. Here, more specifically, we show how Adaptive Moment Estimation (Adam) and Momentum Orthogonalized with Newton-Schultz (Muon) encode the principle of trust-region optimal learning rates for the lowpass regularized stochastic gradient algorithm.
Transitional Sentence: Scaling the blueprint up to the skyscraper level, we can see how these 1D insights power modern AI.
6. The “So What?”: Modular Design and Trust Regions
Modern optimizers like Adam and Muon implement two modular principles:
- Gradient Filtering: Designing \(\beta, \gamma\)
- Learning Rate Normalization: Scaling by \(1/\lambda\)
5. Modern Applications: Scaling the Concept (Adam & Muon)
This “filter + normalization” framework underlies modern optimizers.
Modular Design Principle
-
Filter \(v(t) = H_{\beta,\gamma}\{g(t)\}\)
-
Normalize \(w(t+1) = w(t) - \mu \lambda^{-1}(t) v(t)\)
Modern Optimizer Mapping
-
Adam (Vectorized)
Performs coordinate-wise normalization: \(\alpha(i) = \frac{\mu}{\sqrt{\mathbb{E}[g^2(i)]}}\)
enforcing: \(\mathbb{E}[\Delta^2] \le \mu^2\) -
Muon (Matrix-level)
Uses matrix inverse square-root normalization: \(C^{-1/2}\)
ensuring: \(\|\Delta\|_F^2 \le \mu^2\)
Modern Optimizer Mapping
- Adam (Vectorized):
Ensures trust-region constraint:
\[\mathbb{E}[\Delta^2] \le \mu^2.\]- Muon (Matrix-Valued):
Uses matrix normalization:
\[C^{-1/2}\]to ensure:
\[\|\Delta\|_F^2 \le \mu^2.\]Section Insight: The Trust-Region Radius
For Heavy-Ball, the stable learning rate lies roughly in:
\[\frac{(1 \pm \sqrt{\beta})^2}{\lambda}.\]Normalizing by \(1/\lambda\) defines a trust-region radius \(\mu\):
“Never take a step larger than this safe boundary.”
Takeaway 4: The Modular Blueprint for Adam and Muon
This perspective yields a modular design principle:
1. Filter Design
Choose \(\beta, \gamma\) to shape the gradient signal:
\[v(t) = \beta v(t-1) + \gamma g(t)\]2. Learning-Rate Normalization
Apply a trust-region scaling:
\[\text{update} = \mu \cdot \lambda^{-1} \cdot v(t)\]where:
- \(\lambda\) = scale of the gradient
- \(\mu\) = step-size budget
Adam (Coordinate-wise normalization)
\[\lambda_i = \sqrt{\mathbb{E}[g_i^2]}\]Update:
\[\Delta x_i = \mu \frac{v_i}{\lambda_i}\]- Normalizes each coordinate independently
- Enforces:
Muon (Operator-level normalization)
Uses matrix scaling:
\[\Delta W = \mu \, (\mathbf{G}^\top \mathbf{G})^{-1/2} \, v\]- Typically approximated via Newton–Schulz iteration
- Enforces Frobenius constraint:
Unified Insight
Both follow:
\[\text{update} = \mu \cdot \lambda^{-1} \cdot v(t)\]The only difference is how \(\lambda\) is defined.
“New optimizers can be designed explicitly by (i) designing a gradient filter and (ii) choosing a learning-rate rule that encodes gradient normalization.”
4. The Learning-Rate Subsystem: Trust-Region Normalization
The second module of the framework is the learning-rate subsystem, which is strategically separated from the temporal smoothing of the gradient filter.
The Normalization Principle
We define the general update rule:
\[w(t+1) = w(t) - \alpha(t) v(t)\]With normalization:
\[\alpha(t) = \mu \lambda^{-1}(t)\]Where:
- \(\mu\) = Trust-Region Radius
- \(\lambda(t)\) = gradient scale
Strategic Distinction
- Temporal Smoothing (\(H(z)\)): Controls memory/history
- Spatial Scaling (\(\mu\)): Limits update magnitude
5. Multi-Scale Formulations: Scalar, Vectorized, and Matrix-Valued
The modular principle extends to higher dimensions.
Vectorized Formulation (Adam-style)
For \(w(t) \in \mathbb{R}^n\):
\[b(t,i) = \rho b(t-1,i) + (1-\rho) g^2(t,i)\]Constraint:
\[\mathbb{E}[\Delta^2(t+1,i)] \le \mu^2\] \[\mathbb{E}[\|\Delta(t+1)\|_2^2] \le n\mu^2\]Matrix-Valued Formulation (Muon-style)
Using Frobenius geometry:
\[\langle A, B \rangle_F = \mathrm{tr}(A^T B)\]Scale:
\[C(t) = v(t)v(t)^T\]Normalized gradient:
\[\bar{v}(t) = C^{-1/2}(t) v(t)\]Constraint:
\[\|\Delta(t+1)\|_F^2 = n\mu^2\]Implementation via Iterative Methods
Matrix inverse square root computed via Newton–Schulz iteration (Lakić polynomial family).
Conclusion: A New Era of Modular Optimizers
We are entering an era of optimizer engineering.
Instead of relying on physical metaphors, we can:
- Treat gradients as signals
- Design filters explicitly
- Control transient and steady-state behavior
Momentum is no longer a heuristic—it is:
\[\text{a structured signal-processing operator}\]Nesterov works because it is an optimal pole-zero configuration for transient suppression.
Reality Check
These insights are derived under simplifying assumptions:
- 1D quadratic loss
- Stationary gradients
- Linear dynamics
Real deep learning systems are:
- High-dimensional
- Non-stationary
- Heavy-tailed
Thus:
- Structural roles (filtering + normalization) generalize
- Optimal parameter values do not
6. Conclusion: Your New Perspective on Learning
You have moved beyond seeing AI training as a black box of “weights and biases.” You now understand it as a sophisticated signal processing task. Momentum is the heartbeat of this process—a digital filter that separates the signal of the “truth” from the noise of the data.
3-Point Takeaway
-
Momentum is a Filter
It is a first-order lowpass filter designed to attenuate high-frequency noise and preserve the underlying signal. -
Poles and Zeros are the Keys
The difference between Heavy-Ball and Nesterov is determined by: \(\gamma\) -
Stability is the True Goal
Momentum expands the stability region, allowing aggressive learning rates while maintaining safe updates within a trust region: \(\mu\)
Open Question
If momentum is just a first-order filter, what about:
-
Higher-order filters: \(H(z) = \frac{b_0 + b_1 z^{-1} + \cdots}{1 + a_1 z^{-1} + \cdots}\)
- Resonant systems
- Nonlinear filters
👉 What unexplored filter designs could unlock the next generation of optimizers?
- Polyak, B. T. (1969). The Conjugate Gradient Method in Extremal Problems. USSR Computational Mathematics and Mathematical Physics, 9(4), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4