Optimal Learning Rate Functions

Minimizing Trust-region Subproblems.

Page created: Sep 11 2025 at 12:00 AM

Oluwasegun Somefun. “AutoSGM: Trust-region Optimal Learning rates.” AutoSGM Framework, 2025.

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation.

Table of contents

Optimal Learning Rate Functions
1. Notes
2. Optimal Iteration-Dependent Learning Rate Oracles

AutoSGM reframes stochastic gradient algorithms used in deep learning optimization through the lens of a first-order lowpass filter applied to a stochastic gradient, and the existence of an optimal iteration-dependent learning rate choice.

We show here that the automatic learning rate perspective in the AutoSGM framework captures several stochastic gradient learning variants that have been viewed as preconditioning methods.

Notes

We assume both the objective function $f$ and its gradient are Lipschitz continuous (Bottou et al., 2018).

For mathematical convenience, to derive optimal learning rates, we restrict the objective function to a log-likelihood objective function $f=\ln p(\mathbf{w})$. Let $\mathbb{E}$ denote expectation with respect to a model distribution $p(\mathbf{w})$. The expected gradient of this function is zero w.r.t $\mathbf{w}$ (Moon & Stirling, 2000; Van Trees et al., 2013).

In practice, many widely used training objectives admit log‑likelihood interpretations but differ from this simplified model.

Define $\mathbf{\varepsilon}[t] = \mathbf{w}[t] - {\mathbf{w}}^\star$ as the parameter error, the gap between current weight and a local optimum.

In practice, we can realize the expectations in the derived learning rate functions using time-averages via exponential moving averages (EMA) with long-term memory (Diniz, 2020; Haykin, 2014).

Optimal Iteration-Dependent Learning Rate Oracles

Per-coordinate Setup

Let $n>1$, $\alpha[t], \mathbf{w}[t], \mathbf{g}[t], \Delta[t+1] \in \mathbb{R}^{n \times 1}$

For each coordinate $i$,

\[\boxed{ {\mathbf{w}[t+1, i]} = {\mathbf{w}[t,i]} + {\Delta[t+1,i]} }\]

\[\boxed{ \step }\]

Let $0 < \mu \le 1$ be a trust-region penalty constant that enforces the per-iteration subproblem

\[\boxed{ \trobjsca = \stepmom + \mu\,\stepcorrng }\]

Since

$\frac{d^2\trobjsca}{d{}\alpha[t,i]^2} = \dengmom$,

the subproblem defined by $\trobjsca$ is strongly convex in $\alpha[t,i]$. Let the normalized gradient be $\ngrad = \frac{\mathbf{g}[t,i]}{\dengmomsqrt}$, where $\numngradcorr=1$.

Minimizing $\trobjsca$ w.r.t $\alpha[t,i]$ gives

$$ \boxed{ \begin{align}\label{lrmom} \alpha[t,i] = \mu\,\frac{\numngradcorr}{\dengmomsqrt} = \mu\,\frac{1}{\dengmomsqrt} \end{align} } $$

Realizing this learning rate involves estimating the gradient’s moment at each iteration.

More generally,

\[\boxed{ \trobjsca = \stepmom + \mu\,\stepcorru }\]

So, instead of $\mathbf{u}[t,i] \triangleq \ngrad$ as above, let $\mathbf{u}[t,i] \triangleq \mathbf{w}[t,i]$

Minimizing $\trobjsca$ w.r.t $\alpha[t,i]$ gives

$$ \begin{align} \label{lrparcor} \boxed{ \alpha[t,i] = \mu\,\frac{\numwgcorr}{\dengmomsqrt} } \end{align} $$

Realizing this learning rate involves estimating the gradient’s moment, and the weight-gradient partial correlation at each iteration. It turns out that minimizing the mean squared parameter error $\mathbb{E}[\mathbf{\varepsilon}^2[t+1]]$ yields this same iteration-dependent optimal learning rate with $\mu=1$.

The derived optimal learning rates for the two trust-region subproblems are of a general iteration-dependent learning-rate oracle, in the form

\[\alpha[t,i] = \mu\,\Phi[t,i], \quad \Phi[t,i] = \frac{\mathbf{a}[t,i]}{\mathbf{d}[t,i]}.\]

In general, we can replace the trust-region constant $\mu$ with an iterative form $0 \le \mu[t] \le 1$, where $\mu[t]=\mu\,\digamma[t]$, and $0\le \digamma[t] \le 1$.

\[\boxed{ \begin{aligned} \mathbf{w}[t+1,i] = \mathbf{w}[t,i] -\alpha[t,i]\,\mathbf{g}[t,i] \end{aligned} }\]

$$ \begin{aligned} &\mathbf{m}[t, i] = \dengmom\\ &\mathbf{d}[t, i] = {\mathbf{m}[t]}^{\frac{1}{2}}\\ &\ngrad = {\mathbf{g}[t,i]}/{\mathbf{d}[t, i]} \\ &\boxed{ \mathbf{a}[t, i] =\begin{cases} & 1 \\ & \numngradcorr \\ & \numwgcorr \end{cases} }\\ &\mathbf{w}[t+1,i] = \mathbf{w}[t,i] - \mu[t]\, \mathbf{a}[t, i] \,\ngrad \end{aligned} $$

Matrix Setup

Now, in this setup, let $1 < n \le m$ we assume $\mathbf{w}[t], \mathbf{g}[t], \Delta[t+1] \in \mathbb{R}^{n \times m}$, while $\alpha[t] \in \mathbb{R}^{n \times n}$ is symmetric and full-rank (since at least $n \le m$ to prevent rank-deficiency).

\[\boxed{ \matstep }\] \[\boxed{ \trobjmat = \gmatstepmom + \mu\,\gmatstepcorrng }\]

Rewriting,

\[\boxed{ \trobjmat = \matstepmom + \mu\,\matstepcorrng }\]

Since

$\frac{d^2\mkern1pt\trobjmat}{d\mkern1pt\alpha[t]^2} = \matdengmom \ge 0$.

Let the normalized gradient be $\nmatgrad = {\matdengmomsqrt}{\mathbf{g}[t]}$, where $\matngradcorr=\mathbf{I}_{n\times n}$.

Minimizing $\trobjmat$ w.r.t $\alpha[t]$ gives

$$ \boxed{ \alpha[t] = \mu\mkern1pt{\matngradcorr}\mkern1pt{\matdengmomsqrt} = \mu\,{\matdengmomsqrt} } $$

Similarly, minimizing

\[\boxed{ \trobjmat = \matstepmom + \mu\,\matstepcorru }\]

where, instead of $\mathbf{u}[t] \triangleq \nmatgrad$ as above, we have $\mathbf{u}[t] \triangleq \mathbf{w}[t]$

leads to

$$ \boxed{ \alpha[t] = \mu\mkern1pt{\matwngradcorr}\mkern1pt{\matdengmomsqrt} } $$

The derived optimal learning rate matrix for the two trust-region subproblems are of a general iteration-dependent learning-rate oracle, in the form

\[\alpha[t] = \mu\,\Phi[t], \quad \Phi[t] = {\mathbf{a}[t]}\mkern1mu{\bar{\mathbf{d}}[t]} \in \mathbb{R}^{n \times n}.\]

\[\boxed{ \begin{aligned} \mathbf{w}[t+1] = \mathbf{w}[t] -\alpha[t]\,\mathbf{g}[t] \end{aligned} }\]

$$ \begin{aligned} &\mathbf{m}[t] = \matdengmom\\ &\bar{\mathbf{d}}[t] = \mathbf{m}[t]^{\text{-}\frac{1}{2}}\\ &\nmatgrad = \bar{\mathbf{d}}[t] \mathbf{g}[t,i]\\ &\boxed{ \mathbf{a}[t] = \begin{cases} & 1 \\ & \matwngradcorr \end{cases} }\\ &\mathbf{w}[t+1] = \mathbf{w}[t] - \mu[t]\,\mathbf{a}[t]\,\nmatgrad \end{aligned} $$

with Smoothed Gradients

Per-coordinate Setup

Replace the stochastic gradient with a smoothed version:

\[\boxed{ \stepv }\]

where the smoothed gradient $\mathbf{v}[t,i] = \fof\{\mathbf{g}[t,i]\}$, and $\fof$ is a first-order filter with the transfer function:

\[H(z) = \eta \, \frac{1 - \gamma z^{-1}}{1 - \beta z^{-1}}, \quad 0 \le \beta < 1, \ \gamma < \beta\]

where $\beta$ and $\gamma$ are the tunable pole and zero parameters of the first-order filter. The multiplier $\eta$ is a function of $\beta$ and $\gamma$ chosen to ensure the mean value of both $\mathbf{v}[t,i]$ and $\mathbf{g}[t,i]$ are the same, $\expv = \expg.$ Also, by the mean‑squared input–output relation of an LTI system (Papoulis, 1977), with impulse response $h[k]$,

\[\dengmomv \propto \zeta_{\beta,\gamma}\, \dengmom .\]

where $\zeta_{\beta,\gamma}$ is a variance reduction constant, whose maximum equals the energy of the LTI’s impulse response: $\zeta_{\beta,\gamma} \le \sum_{k=0}^{\infty} h[k]^2 = {\eta^2}\frac{1+\gamma^2-2\gamma\beta}{(1-\beta^2)}$.

Repeating the learning-rate derivation with $\mathbf{v}[t,i]$,

\[\boxed{ \trobjsca = \stepmom + \mu\,\stepcorrngv }\]

yields

\[\alpha[t,i] = \mu\,\frac{1}{\sqrt{\dengmomv}} = \mu^{\prime}\,\frac{1}{\sqrt{\dengmom}}.\]

The varaince reduction factor is a scaling constant, it can be removed by absorbing it into the trust‑region constant, with the rescaled trust‑region constant:

\[\mu^{\prime} \triangleq \frac{\mu}{\sqrt{\zeta}} \subset (0, 1).\]

Therefore, the derived per‑coordinate learning rate retains the same functional form (after absorbing the variance‑reduction factor into a $\mu \subset (0, 1)$):

$$ \boxed{ \alpha[t,i] = \mu\,\frac{1}{\sqrt{\dengmom}} } $$

\[\boxed{ \begin{aligned} \mathbf{w}[t+1,i] = \mathbf{w}[t,i] -\alpha[t,i]\,\mathbf{v}[t,i] \end{aligned} }\]

$$ \begin{aligned} &\mathbf{v}[t,i] = \fof\{\mathbf{g}[t,i]\}\\ &\mathbf{m}[t, i] = \dengmom\\ &\mathbf{d}[t, i] = {\mathbf{m}[t]}^{\frac{1}{2}}\\ &\ngrad = {\mathbf{g}[t,i]}/{\mathbf{d}[t, i]} \\ &\ngradv = {\mathbf{v}[t,i]}/{\mathbf{d}[t, i]} \\ &\boxed{ \mathbf{a}[t, i] =\begin{cases} & 1 \\ & \numngradvcorr \\ & \numwvcorr \end{cases} }\\ &\mathbf{w}[t+1,i] = \mathbf{w}[t,i] - \mu[t]\, \mathbf{a}[t, i] \,\ngradv \end{aligned} $$

Matrix Setup

Now, in this setup,

\[\mathbf{v}[t] = \fof\{\mathbf{g}[t]\}\]

Let $n, m>1$ we assume $\mathbf{w}[t], \mathbf{g}[t], \mathbf{v}[t], \Delta[t+1] \in \mathbb{R}^{n \times m}$, while $\alpha[t] \in \mathbb{R}^{n \times n}$ and symmetric.

\[\boxed{ \matstepv }\] \[\boxed{ \trobjmat = \gmatstepmom + \mu[t]\,\gmatstepcorrngv }\]

Since

$\frac{d^2\mkern1pt\trobjmat}{d\mkern1pt\alpha[t]^2} = \matdengmomv \ge 0$.

Let the normalized smooth gradient be $\nmatgradv = {\matdenvmomsqrt}{\mathbf{v}[t]}$, where $\matngradvcorr=\mathbf{I}_{n\times n}$.

Minimizing $\trobjmat$ w.r.t $\alpha[t]$ gives

$$ \boxed{ \alpha[t] = \mu\mkern1pt{\matngradvcorr}\mkern1pt{\matdenvmomsqrt} = \mu\,{\matdenvmomsqrt} } $$

Unlike the per‑coordinate case, which normalizes using a scalar variance, the matrix case normalization uses a full covariance matrix, which automatically removes all variance‑scaling effects from smoothing.

\[\boxed{ \begin{aligned} \mathbf{w}[t+1] = \mathbf{w}[t] -\alpha[t]\,\mathbf{v}[t] \end{aligned} }\]

$$ \begin{aligned} &\mathbf{v}[t] = \fof\{\mathbf{g}[t]\}\\ &\mathbf{m}[t] = \matdengmomv\\ &\bar{\mathbf{d}}[t] = \mathbf{m}[t]^{\text{-}\frac{1}{2}}\\ &\nmatgradv = \bar{\mathbf{d}}[t] \mathbf{v}[t,i]\\ &\boxed{ \mathbf{a}[t] = \begin{cases} & 1 \\ & \matwngradvcorr \end{cases} }\\ &\mathbf{w}[t+1] = \mathbf{w}[t] - \mu[t]\,\mathbf{a}[t]\,\nmatgradv \end{aligned} $$

Matrix Inverse Square-root (Realizations)

Let $\mathbf{m}[t]$ be a positive symmetric definite square matrix.
There are several ways to realize its inverse square-root $\bar{\mathbf{d}}[t] = \mathbf{m}[t]^{\text{-}\frac{1}{2}}$. Typical cost is $O(n^{3})$ per iteration.

Full Eigenvalue Decomposition (EVD)

Compute the full eigenvalue decomposition $\mathbf{m}[t] = \mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$, then $\bar{\mathbf{d}}[t] = \mathbf{U}\mathbf{\Lambda}^{-1/2}\mathbf{U}^T$.

Slower for large $n$, with overhead of eigenvectors, but most accurate and numerically stable.

$$ \boxed{ \begin{aligned} &\text{1: Eigenvalue decomposition}\\ &\quad \mathbf{U}\,\mathbf{\Lambda}\,\mathbf{U}^T = \mathbf{m}[t] \quad\text{with}\quad \mathbf{\Lambda}=\operatorname{diag}(\lambda_1,\ldots,\lambda_n),\ \lambda_i>0\\ &\text{2: Form inverse square root of eigenvalues}\\ &\quad \mathbf{\Lambda}^{-1/2} = \operatorname{diag}(\lambda_1^{-1/2},\ldots,\lambda_n^{-1/2})\\ &\text{3: Reconstruct inverse square root}\\ &\quad \bar{\mathbf{d}}[t] = \mathbf{U}\,\mathbf{\Lambda}^{-1/2}\,\mathbf{U}^T\\ &\bar{\mathbf{d}}[t]\approx \mathbf{m}[t]^{-1/2} \end{aligned} } $$

To save time, an approximate result can be obtained via randomized eigenvalue decomposition, which involve using randomized methods to approximate the eigenvalue decomposition by computing only the top $l$ eigencomponents where $l \ll n$. Cost is now $O(n^2l)$. Such low-rank approximations are cheaper than full EVD while maintaining reasonable accuracy for moderate $n$.

Newton-Schulz Iteration

For symmetric positive definite matrices, we may compute $\bar{\mathbf{d}}[t] = \mathbf{m}[t]^{-1/2}$, by maintaining two coupled polynomial sequences in matrix $\mathbf{m}[t]$ that share the same residual (Higham, 2008).

Initialize with a normalization constant $c$, $\mathbf{y}_0 = \mathbf{m}[t]/c$ and $\mathbf{z}_0 = \mathbf{I}$.

\[\mathbf{y}_{k+1} = \frac{1}{2}\mathbf{y}_k\,(3\mathbf{I} - \mathbf{z}_k\mathbf{y}_k)\] \[\mathbf{z}_{k+1} = \frac{1}{2}(3\mathbf{I} - \mathbf{z}_k\mathbf{y}_k)\,\mathbf{z}_k\]

where as $k\to\infty$, $\mathbf{y}_k \to \mathbf{m}[t]^{1/2}/\sqrt{c}$ (normalized) and $\mathbf{z}_k \to \sqrt{c}\,\mathbf{m}[t]^{-1/2}$ (normalized). Upon convergence, $\bar{\mathbf{d}}[t] \approx \mathbf{z}_k/\sqrt{c}$.

Can monitor residual $\mathbf{s}_k \;=\; \mathbf{I} - \mathbf{z}_k\,\mathbf{y}_k$, and stop when $\| \mathbf{s}_k\| < \epsilon$. It is well known that optimized BLAS libraries on GPUs can reduce wall‑clock time to $\approx O(n^{2}l)$, $l \ll n$. Cheap for small number of iterations and moderate $n$ because matrix–matrix multiplications on GPUs achieve very high throughput and parallelism.

$$ \boxed{ \begin{aligned} &c = \|\mathbf{m}[t]\|\\ &\mathbf{y} = \mathbf{m}[t]/c\\ &\mathbf{z} = \mathbf{I}\\ &\textbf{for } k = 0, 1, 2, \ldots K\\ &\quad \mathbf{r} = 0.5\,(3\mathbf{I} - \mathbf{z}_k\,\mathbf{y}_k)\\ &\quad \mathbf{y} = \mathbf{y}\,\mathbf{r}\\ &\quad \mathbf{z} = \mathbf{r}\,\mathbf{z}\\ &\textbf{end for}\\ &\bar{\mathbf{d}}[t] \approx \mathbf{z}/\sqrt{c} \, \end{aligned} } $$

Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2), 223–311.
Moon, T. K., & Stirling, W. C. (2000). Mathematical Methods and Algorithms for Signal Processing. Prentice Hall.
Van Trees, H. L., Bell, K. L., & Tian, Z. (2013). Detection Estimation and Modulation Theory, Detection, Estimation, and Filtering Theory, Part I (2nd ed.). Wiley.
Diniz, P. S. R. (2020). Adaptive Filtering: Algorithms and Practical Implementation (5th edition). Springer International Publishing. https://doi.org/10.1007/978-3-030-29057-3
Haykin, S. (2014). Adaptive Filter Theory (5th, intern.). Pearson.
Papoulis, A. (1977). Signal Analysis. McGraw-Hill College.
Higham, N. J. (2008). Functions of Matrices. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898717778
Honig, M. L., & Messerschmitt, D. G. (1984). Adaptive Filters: Structures, Algorithms and Applications. Kluwer Academic Publishers.