Stochastic Gradient Learning Dynamics

Linear Time (Iteration) Varying System.

Oluwasegun Somefun. “Smooth Learning Dynamics.” AutoSGM Framework, 2025.

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation.

Table of contents

Stochastic Gradient Learning Dynamics

The AutoSGM framework exposes the exact update trajectory of each trainable parameter in a gradient-generating system like a deep neural network via the stochastic gradient algorithm under a lowpass filter (momentum) and iteration-dependent learning-rate oracle as the dynamics of a first-order linear time (iteration) varying (LTV) filter.

This LTV description makes it possible to apply linear systems, control and signal‑processing tools to reason about stability, transient response, noise attenuation and steady-state convergence tradeoffs.

Formally, at each iteration \(t\), a single parameter \(\mathbf{w}[t, i]\) update via its gradient component \(\mathbf{g}[t, i]\) follows the first-order linear filter trajectory

\[\boxed{\state{\Delta\mathbf{w}[t+1, i]} = \filter{\beta}\,\gain{r[t, i]}\cdot\state{\Delta\mathbf{w}[t,i]} + \filter{\eta}\,\gain{\alpha[t,i]}\cdot \input{\mathbf{e}[t,i]}},\]

Weight decay

Depending on the weight-decay mechanism, the generated input trajectory for what has been called decoupled weight-decay

\[\input{ \mathbf{e}[t, i]} = \bigl(\filter{\gamma}\,\statez{\mathbf{g}[t-1, i]} - \statez{\mathbf{g}[t, i]} \bigr) + \gainx{\rho}\,\filter{\eta^{-1}}\bigl(\filter{\beta} \,\statex{\tilde{\mathbf{w}}[t-1, i]} - \statex{\tilde{\mathbf{w}}[t, i]}\bigr),\]

is exactly equivalent to the standard weight-decay

\[\input{\mathbf{e}[t, i]} = \bigl(\filter{\gamma}\,\statez{\mathbf{g}[t-1, i]} - \statez{\mathbf{g}[t, i]} \bigr) + \gainx{\rho^{\prime}}\bigl(\filter{\gamma} \,\statex{\mathbf{w}[t-1, i]} - \statex{\mathbf{w}[t, i]} \bigr),\]

when \(\gainx{\rho^\prime} = \gainx{\rho}\,\filter{\eta^{-1}}\), \(\filter{\beta}=\filter{\gamma}\), and \(\statex{\tilde{\mathbf{w}}[t, i]} = \statex{\mathbf{w}[t, i]}\).

Finally, by integrating the LTV filter, the actual parameter update is recovered

\(\boxed{\statex{\mathbf{w}[t+1, i]} = \statex{\mathbf{w}[t,i]} + \state{\Delta \mathbf{w}[t+1,i]}}\),

where

\(\gain{\alpha[t,i]} = {\gain{\mu}}\,\gain{\digamma[t}] \cdot \frac{\gain{\mathbf{a}[t,i]}}{\gain{\mathbf{d}[t,i]}}\) is an iteration-dependent learning rate oracle, essentially composed of a trust-region constant \(0 < \gain{\mu} < 1\), a window function \(0 \le \gain{\digamma[t]} \le 1\) as the learning-rate schedule, together with the oracle’s numerator and denominator functions denoted respectively as \(\gain{\mathbf{a}[t,i]}\), \(\gain{\mathbf{d}[t,i]}\).
\(\gain{r[t,i]} = \gain{\alpha[t,i]}/\gain{\alpha[t-1,i]}\) is the learning rate ratio,
\(\filter{\beta}\) is the lowpass filter’s pole parameter selected for stability \(0 \le \filter{\beta} < 1\),
\(\filter{\gamma}\) is the lowpass filter’s zero parameter, selected such that \(\filter{\gamma} < \filter{\beta}\),
\(\filter{\eta}\) is a constant selected as \((1-\filter{\beta})/(1-\filter{\gamma})\) such the steady-state (DC) gain of the lowpass filter is unity,
\(\gainx{\rho, \rho^\prime} \ge 0\) denote small weight-decay constants, can be selected relative to \(\eta\),
\(\statex{\tilde{\mathbf{w}}[t, i]} = \gain{\mathbf{d}[t,i]}\,\statex{\mathbf{w}[t, i]}\) is the scaled parameter via the learning-rate’s denominator.

Under the mild assumptions of local smoothness of the loss function generating the gradient, and bounded gradient moments, the trajectory input \(\input{\mathbf{e}[t, i]}\) is bounded.

A Proximal Subproblem

Any first‑order linear filter like

\[\boxed{\state{\Delta\mathbf{w}[t+1, i]} = \filter{\beta}\,\gain{r[t, i]}\cdot\state{\Delta\mathbf{w}[t,i]} + \filter{\eta}\,\gain{\alpha[t,i]}\cdot \input{\mathbf{e}[t,i]}}\]

can be written as the solution of a proximal subproblem.

This first‑order linear filter is the explicit closed‑form optimal solution to a regularized weighted least squares objective: that balances the next trajectory step \(\Delta\mathbf{w}[t+1,i]\) between aligned with \(\mathbf{e}[t,i]\), and being close to a weighted version of the previous trajectory step \(\Delta\mathbf{w}[t,i]\) .

\[\Delta\mathbf{w}[t+1,i] = \arg\min_{\Delta} Q\bigl(\Delta\bigr)\] \[Q\bigl(\Delta\bigr)= \;-\; \eta\,\Delta \cdot \mathbf{e}[t,i] \;+\; \frac{1}{2\,\alpha[t,i]}\, \Bigl(\Delta - \beta\, r[t,i]\cdot\Delta\mathbf{w}[t,i]\Bigr)^2\]

Rewriting as a proximal operator,

\[\Delta\mathbf{w}[t+1,i] = \mathbf{prox}_{\gain{\alpha[t,i]},\:\eta\,\Delta \cdot\mathbf{e}[t,i]}\big(\beta\, r[t,i]\cdot\Delta\mathbf{w}[t,i]\big)\]

Despite the underlying deep learning loss being non‑convex, this per-iteration subproblem is strongly convex in \(\Delta\mathbf{w}[t,i]\) with modulus \(\alpha[t,i]^{-1}\), and driven by \(\mathbf{e}[t,i]\).

Apart from unification, this LTV filter and its resulting convex quadratic subproblem formulation, concretely highlights

Lowpass regularization: The smooth path of parameter changes empirically enhances the robustness and generalization of solutions.
Stability: Each parameter-change is the solution of a strongly convex quadratic subproblem. Provided the learning rate at each iteration is finite, this lowpass filter ensures bounded, well‑defined and stable parameter changes.

This makes the design of such practical stochastic graient learning algorithms, more principled, analyzable, and unifying across several methods.

Stability Conditions and Transient Behavior

As stated, the overall filtered SGM dynamics is \(\boxed{\state{\Delta\mathbf{w}[t+1, i]} = \filter{\beta}\,\gain{r[t, i]}\cdot\state{\Delta\mathbf{w}[t,i]} + \filter{\eta}\,\gain{\alpha[t,i]}\cdot \input{\mathbf{e}[t,i]}}\)

The lowpass pole \(\filter{\beta}\) and learning-rate ratio \(r[t,i]\) shape both the the low-frequency properties of the filter (response to slow-changing inputs), and exponential stability margin of the system \(\lvert \filter{\beta^{-1}}\gain{ r[t,i]} \rvert < 1\).

Together, \(\filter{\beta}\) and \(\gain{\alpha[t,i]}\) shape how quickly and smoothly the learning dynamics settle into steady-state.
If selected properly, they ensure bounded and convergent behavior over time.

For BIBO stability, and uniform exponential stability (boundedness and asymptotic behavior of the trajectory solution): the necessary and sufficient conditions are: \(0 < \filter{\beta} < 1\), and \(\sup_t \gain{\alpha[t,i]} < \infty\). Ensures a bounded input \(\input{\mathbf{e}[t,i]}\) leads to a bounded trajectory output \(\state{\Delta\mathbf{w}[t+1, i]}\).

In addition, if \(0 < \filter{\beta} < 1\) to improve exponential stability, we need \(\gain{\digamma[t]} \to 0\) as \(t \to \infty\), hich imply both \(\gain{r[t]} \to 0\) and \(\gain{\alpha[t]} \to 0\). This explains why learning-rate annealing is important.