$$ \newcommand{\errt}{\textcolor{RubineRed}{\varepsilon_t}} \newcommand{\errtm}{\textcolor{RubineRed}{\varepsilon_{t-1}}} \newcommand{\errtp}{\textcolor{RubineRed}{\varepsilon_{t+1}}} \newcommand{\errti}{\textcolor{RubineRed}{\varepsilon_0}} \newcommand{\derrtp}{\textcolor{RubineRed}{\Delta\varepsilon_{t+1}}} \newcommand{\derrt}{\textcolor{RubineRed}{\Delta\varepsilon_{t}}} \newcommand{\derrti}{\textcolor{RubineRed}{\Delta\varepsilon_{0}}} $$ $$ \newcommand{\mean}[1]{\mathbb{E}[#1]} \newcommand{\mgradt}{\mean{\gradt}} \newcommand{\mgradtm}{\mean{\gradtm}} \newcommand{\merrt}{\mean{\errt}} \newcommand{\merrtm}{\mean{\errtm}} \newcommand{\merrtp}{\mean{\errtp}} \newcommand{\mderrtp}{\mean{\derrtp}} \newcommand{\mderrt}{\mean{\derrt}} $$ $$ \newcommand{\sgradt}{{\gradt^2}} \newcommand{\sgradtm}{{\gradtm^2}} \newcommand{\serrt}{{\errt^2}} \newcommand{\serrtm}{{\errtm^2}} \newcommand{\serrtp}{{\errtp^2}} \newcommand{\sderrtp}{{\derrtp^2}} \newcommand{\sderrt}{{\derrt^2}} $$ $$ \newcommand{\vgradt}{\mean{\gradt^2}} \newcommand{\vgradtm}{\mean{\gradtm^2}} \newcommand{\verrt}{\mean{\errt^2}} \newcommand{\verrtm}{\mean{\errtm^2}} \newcommand{\verrtp}{\mean{\errtp^2}} \newcommand{\vderrtp}{\mean{\derrtp^2}} \newcommand{\vderrt}{\mean{\derrt^2}} $$
1 / 3

AutoSGM

Momentum: a principled signal-processing framework.

Too Long Can't Read 🎧

The word momentum has a physical meaning defined as mass times velocity. But, what exactly does momentum represent in the context of stochastic gradient learning? Typical explanations lean on this physical analogy to describe a set of mathematical transformations on the gradient that often help accelerate gradient descent, making the word `momentum` stick. Yet, they rarely addresss the deeper question: what exactly are these set of operations and how do we design them.

This is where the AutoSGM framework comes in.

In this notes, using a simple problem setup that captures the essence of more complex scenarios, we recast the stochastic gradient learning dynamics as a linear-time invariant system, introduce the AutoSGM framework, derive explicit stability certificates, uncover automatic learning-rate rules and admissible system parameter values for the exponential convergence (in the mean) of the learning algorithm.

✨ Why AutoSGM?

The AutoSGM framework unifies `momentum` as gradient smoothing via a first-order linear filter in the stochastic gradient method. As we certify accelerated convergence for this simple problem setup, at the end of these notes, we will see that the AutoSGM framework allows us to discover Nesterov's Accelerated Gradient (NAG) as a precisely engineered system from first-principles.

The results for this setup hold exactly under a quadratic loss, diagonal Hessian setting.

Outside such setting, there is no literal guarantee of a two-step convergence, and exact parameter choices for the system may deviate from optimal values in this ideal setting.

Nevertheless, our system-theoretic perspective continues to offer a interpretable lens and valuable intuition towards principled parameter choices, and observations in more complicated setups.

Problem setup

Consider the simple but representative problem of minimizing, at each time-step or iteration $t>0$, a twice-differentiable, weighted quadratic loss function, $$f(\rwt) = \tfrac{1}{2} u_t^2 \cdot \errt^2,$$ where the model is a 1-dimensional parameter-vector $\rwt$, the error relative to an optimum value $\rw^*$ is $\errt = \rwt - \rw^* $, and the weighting variable $u_t$ is modeled as a wide-sense stationary, zero-mean Gaussian data process, statistically independent of the error.

This implies $u_t$ has a mean $\mean{u_t} = 0$, a finite second-moment (or variance) $\mean{u_t^2} = λ$, with $0<\lambda<\infty$, and uncorrelated values across time $\mean{u_t\,u_k} = 0, \hbox{for } t\ne k$. Independence further implies that for any powers $i,j$, the joint moment $\mean{u_t^i \errt^j} = \mean{u_t^i}\mean{\errt^j}$.

Differentiating $f(\rwt)$ with respect to $\rwt$, we get the gradient $\gradt$ and Hessian $\rH_t$ as follows, $$ \begin{align} \gradt = \nabla f(\rwt) = u_t^2\cdot \varepsilon_t, & \quad \mean{\gradt} = \lambda\, \mean{\errt}.\\ \rH_t = \nabla^2 f(\rwt) = u_t^2,& \quad \mean{\rH_t} = \lambda. \end{align} $$ To minimize $f(\rwt)$, an optimization algorithm of choice is known as the Stochastic Gradient Method (SGM). The simplest form of this algorithm is an iterative update rule of the form $$ \rwtp = \rwt - \alpha_t \, \gradt.$$ At each iteration, the update rule receives an $\alpha_t > 0$ from a learning rate oracle and $\gradt$ from a gradient-generating oracle. Although, the algorithm updates $\rwt$ by taking a step in the direction of the negative gradient, it does not strictly assume its input is either the exact gradient or stochastic, nor does it guarantee a descent of $f(\rwt)$ per iteration. Nonetheless, the notions of stochasticity and descent have become commonly associated with the update rule, even if they are not required for it to function. $$ \errtp = \errt - \alpha_t \, \gradt.$$

We will further assume that the system dynamics are linear time-invariant (LTI). This means $\alpha_t = \alpha, \forall t$, a constant learning rate across iterations, and so $$ \errtp = \errt - \alpha \, \gradt.$$ Goal. In this controlled and interpretable setting, we desire that the algorithm drives its mean error, starting from an initial non-zero value $\errti$, to converge to its optimum value of zero in approximately one or two iterations.

As we move forward, we will discover that ensuring convergence in this setting is equivalent to certifying the stability conditions of LTI filters.

Error Dynamics (Mean Convergence)

Recall that $\mean{u_t^2} = \lambda$, and $\mgradt = \lambda\,\merrt$, we can write the error equation as $\errtp = \errt + \derrtp$, where $\mderrtp = -\alpha \, \mgradt$. From which we get $\mgradt = \lambda\,\merrtm + \lambda\, \mderrt$, and the dynamics

$$ \begin{equation} \label{eq:smc1} \begin{aligned} \mderrtp &= -\alpha\,\lambda\,\derrt -\alpha\,\lambda\, \merrtm \\ \merrt &= \mderrt + \merrtm. \end{aligned} \end{equation} $$ Denote $\rmt = \merrt$, $\rst = \mderrt$, $\beta_p = -\alpha \lambda$, and $k_p = -\alpha \lambda$, the expected error dynamics is $$ \begin{equation}\label{eq:smeq} \boxed{ \begin{aligned} \rstp &= \beta_p\,\rst + k_p \, \rmtm \\ \rmt &= \rst + \rmtm \end{aligned}}. \end{equation} $$ Equation $(\ref{eq:smeq})$, tells us that the expected error $\rmt$ is the sum of the expected change-level error $\rst$, and that the expected change-level error dynamics is that of a single-pole LTI filter, $$\rstp = \beta_p\,\rst + k_p \, \rmtm$$ with input $\rmtm$, its transfer-function is $$H_{s}(z) = \frac{k_p}{1-\beta_p\,z^{-1}}.$$ The change-level system $H_{s}(z)$ is both asymptotically and exponentially stable if its single pole satisfies $\lvert \beta_p\rvert < 1$. Since $\alpha >0$, $\beta_p = -\alpha \lambda$ is negative, this translates to $-1 < -\alpha\lambda < 0$, that is $1 > \alpha\lambda$, and $\alpha\lambda > 0$, which imply $$\begin{equation}\label{eq:mewin} \boxed{0 < \alpha < \tfrac{1}{\lambda}}. \end{equation}$$ The overall dynamics of the expected error, written in matrix form is $$ \begin{equation} \underbrace{ \begin{bmatrix} \rstp \\ \rmt \end{bmatrix} }_{\rxt} = \underbrace{ \begin{bmatrix} \beta_p & k_p \\ 1 & 1 \end{bmatrix} }_{\rA} \underbrace{ \begin{bmatrix} \rst \\ \rmtm \end{bmatrix} }_{\rxtm}. \label{eq:smeqmat} \end{equation} $$ Compactly, the two-state LTI system is $\rxt = \rA^{t}\,\rxi$. Again, exponential and asymptotic stablility of the two-state system requires its eigenvalues, $z_1 = 0$, and $\textcolor{Mahogany}{z_2 =1-\alpha\lambda}$ roots of the characteristic polynomial $\det(z\mathbf{I}-\mathbf{A}) = z(z-(1-\alpha\lambda)) = 0 $ to be distinct and have a magnitude strictly less than 1. The maximum system eigenvalue $\lvert z_2 \rvert < 1$, implies $-1 < 1-\alpha\lambda < 1$, and therefore $2>\alpha\lambda$, and $\alpha\lambda > 0$, meaning $0 < \alpha < \tfrac{2}{\lambda}$, and more stricter $0 < \alpha \le \tfrac{1}{\lambda}$.

By exploiting the rank-1 system matrix in $\rxt = \rA^{t}\,\rxi$, and taking any standard norm $\|\cdot\|$ (1-norm, 2-norm, $\infty$-norm) of both sides, with the corresponding induced matrix norm, we can obtain a certificate on the exponential convergence of the expected error states, for $t\ge 1$ as $$ \boxed{\| \rxt \| \le 2 \cdot |\textcolor{Mahogany}{z_2}|^{t-1}\, \|\rxi\|}.$$

Reflections: Mean Convergence

The system analysis led to $$ \boxed{\alpha = \frac{1}{\lambda}}. $$
Play with the controls
  1. Click on the RAW button
  2. Ensure the Input selector reads Step or Impulse
  3. Click Play, to watch the sweep $0 < \alpha \le \tfrac{1}{\lambda}$
  1. Click on the Pause button
  2. Click on the RAWx2 button
  3. Ensure the Input selector reads Step or Impulse
  4. Click Play, to watch the sweep $0 < \alpha < \tfrac{2}{\lambda}$
Observations
  1. Coupling. Once, the change-level error diverges, the actual error also diverges.
  2. Values of $\alpha$ extremely close to zero yield slow but monotonic convergence.
  3. In contrast, approaching $\alpha = \tfrac{1}{\lambda}$ practically, leads to the design goal.

If we analyze convergence of the error in the mean-square sense, we get a tighter range $$ \boxed{0 < \tfrac{1}{\sqrt{3}}\cdot\tfrac{1}{\lambda} \le \alpha \le \tfrac{1}{\lambda}} $$

and further, matching the mean-square convergence rate to the mean convergence rate, that is $\lvert 1 - 3\,\alpha^2\lambda^2 \rvert = \lvert 1 - \alpha \lambda \rvert$, so that approximately, both converge to $0$ at the same rate, leads to the choice $$ \boxed{\alpha = \tfrac{1}{3}\cdot\tfrac{1}{\lambda}}. $$

In the next page, we will introduce the AutoSGM structure that involves smoothing the raw gradient input to the SGM via the means of a LTI lowpass filter.

This has been popularly referred to as momentum.

In other words, the SGM algorithm now uses a smooth version of the gradient, replace $\gradt$ with a smooth version $\gradvt$, and the error recursion becomes $$ \errtp = \errt - \alpha \, \gradvt.$$

Still, the same SGM, the key difference is that the algorithm's input is now adorned with a more protective cloth.

AutoSGM

The SGM algorithm now uses a smooth version of the gradient, we replace $\gradt$ with a smooth version $\gradvt = H_{\beta,\gamma}(z) \{ \gradt \}$, \[ H_{\beta,\gamma}(z) = \eta\,\frac{1 - \gamma z^{-1}}{1 - \beta z^{-1}}. \] In the transfer-function, we have the filter's pole $ 0 < \beta < 1$ , the filter's zero $ \gamma \le \beta $, and $0 < \eta \le 1$. Here, we will always ensure the D.C gain of the filter is $1$, that is, $\eta = \tfrac{1-\beta}{1-\gamma}$.

The update rule is now $ \rwtp = \rwt - \alpha \, \gradvt, $ using the error, we have $$ \errtp = \errt - \alpha \, \gradvt. $$

A causal realization of this first-order LTI filter is $$ \gradvt = \beta \, \gradvtm + \eta\,\big(\gradt - \gamma \, \gradtm \big). $$ Therefore, we directly get an explicit change-level dynamics that inherits the lowpass structure of $H_{\beta,\gamma}(z)$, \[ \derrtp = \errtp - \errt = -\alpha \, \gradvt. \]

In general, for any choice of $(\beta,\gamma)$, the filter's canonical realization is \[ \boxed{ \begin{aligned} \rqt &= \beta \, \rqtm + \gradt\\ \gradvt &= \eta\,(\rqt - \gamma\,\rqtm) \end{aligned}}. \]

Tuning $\gamma$ recovers classic methods, which often use $\eta= \tfrac{1}{1-\gamma}$.

  • Heavy-Ball sets $\gamma = 0$, the realization simplifies to $$\gradvt = \beta \, \gradvtm + \eta\,\gradt$$.
  • Nesterov's Accelerated Gradient (NAG) sets $\gamma = \tfrac{\beta}{1+\beta}$, the realization simplifies to \[ \begin{aligned} \rqt &= \beta \, \rqtm + \,\gradt\\ \gradvt &= \frac{\eta(\beta \, \rqt + \gradt)}{1+\beta} . \end{aligned} \]
The change-level now takes the appearance of a lowpass filter \[ \derrtp = \beta \, \derrt + \alpha\,\eta\,\bigl(\gamma\,\gradtm - \gradt \bigr), \] instead of the former version \[ \derrtp = -\alpha\,\gradt. \]
Next, we analyze the mean error dynamics. We will discover that in this setup, the choice $\gamma = \tfrac{\beta}{1+\beta}$ is an optimal choice for any admissible $\beta$.

AutoSGM: Error Dynamics (Mean Convergence)

Recall that $\mgradt = \lambda\,\merrt$, and that $\merrt = \merrtm + \mderrt$, so that $\mgradt = \lambda\,\merrtm + \lambda\,\mderrt$, and $\mderrtp = \beta \, \mderrt + \alpha\,\eta\,\bigl(\gamma\,\mgradtm - \mgradt \bigr) $.

Denote $\beta_p = \beta - \alpha\,\eta\,\lambda $, and $k_p = - \alpha\,\eta\,\lambda\,(1-\gamma)$, $\rmt = \merrt$, $\rst = \mderrt$, to get $$ \begin{equation} \label{eq:ac1} \begin{aligned} \rstp &= \beta_p\,\rst + k_p\,\rmtm \\ \rmt &= \rst + \rmtm. \end{aligned} \end{equation} $$

The lowpass filtered change-level system is both asymptotically and exponentially stable if its single pole satisfies $-1 < \beta_p < 1$. Since $ 0< \beta < 1 $, and $\alpha > 0$, then $ -1 < \beta - \alpha\,\eta\,\lambda < 1 $ translates to $-(1-\beta) < \alpha\,\eta\,\lambda < 1+\beta$, which imply $$\begin{equation}\label{eq:alrwin} \boxed{0 < \alpha < \tfrac{1+\beta}{\eta}\,\tfrac{1}{\lambda}}. \end{equation} $$ The overall dynamics of the expected error, written in matrix form is $$ \begin{equation} \underbrace{ \begin{bmatrix} \rstp \\ \rmt \end{bmatrix} }_{\rxt} = \underbrace{ \begin{bmatrix} \beta_p & k_p \\ 1 & 1 \end{bmatrix} }_{\rA} \underbrace{ \begin{bmatrix} \rst \\ \rmtm \end{bmatrix} }_{\rxtm}. \label{eq:amsyseq} \end{equation} $$ Compactly, the two-state LTI system is $\rxt = \rA^{t}\,\rxi$.

The characteristic polynomial $\det(z\mathbf{I}-\mathbf{A}) = 0$ gives $z^2 -(1+\beta_p)\,z + (\beta_p-k_p) = 0 $, where $\textcolor{Mahogany}{\beta_p-k_p = \beta - \alpha\,\eta\,\lambda\,\gamma}$. Let discriminant $D = (1+\beta_p)^2 - 4\,(\beta_p-k_p)$, the two system eigenvalues are $ z_{1,2} = \tfrac{1+\beta_p}{2} \pm \tfrac{\sqrt{D}}{2}. $ Again, exponential and asymptotic stablility of the system requires $1 > \lvert z_2 \rvert \ge \lvert z_1 \rvert$ (the ordering labels $z_2$ as the maximum eigenvalue).
  1. Repeated, real roots: $\lvert z_1 \rvert = \lvert z_2 \rvert = \textcolor{Mahogany}{\tfrac{1+\beta_p}{2}} < 1$. This translates to $\beta_p < 1$, already satisfied by $\beta_p < \beta$.
  2. Distinct, real roots: $\lvert z_1 \rvert < \lvert z_2 \rvert = \textcolor{Mahogany}{\tfrac{1+\beta_p}{2} + \tfrac{\sqrt{D}}{2} } < 1$. This translates to $\gamma < 1$, already satisfied by $\gamma \le \beta$.
  3. Distinct, complex-conjugate roots: $\lvert z_1 \rvert = \lvert z_2 \rvert = \textcolor{Mahogany}{\sqrt{\beta_p-k_p}}< 1$. Since $-(1-\beta) < \alpha\,\eta\,\lambda < 1+\beta$, then $0 \le \beta - \alpha\,\eta\,\lambda\,\gamma < 1$ translates to $\gamma \ge -\tfrac{(1-\beta)}{1+\beta}$, and matching $\beta - \alpha\,\eta\,\lambda\,\gamma = 0$ leads to the choice $\gamma = \tfrac{\beta}{1+\beta}$,
$$\begin{equation}\label{eq:azero} \boxed{-\tfrac{(1-\beta)}{1+\beta} \le \gamma < 1}. \end{equation} $$

For distinct roots, by eigenvalue decomposition of the system matrix in $\rxt = \rA^{t}\,\rxi$, and taking any standard norm $\|\cdot\|$ (1-norm, 2-norm, $\infty$-norm) of both sides, with the corresponding induced matrix norm, we can obtain a certificate for both exponential, and asymptotic convergence in the mean as $$ \boxed{\| \rxt \| \le \tfrac{8}{\lvert\sqrt{D}\rvert} \cdot |\textcolor{Mahogany}{z_2}|^{t}\, \|\rxi\|}.$$ Observe that, alternatively, in this setup, freely selecting $\gamma = 0$, and matching $\beta - \alpha\,\eta\,\lambda\,\gamma = 0$ implies $\beta=0$ for the fastest convergence, which is a fall-back to the plain SGM.

AutoSGM: Reflections

Recall: Our holy grail is to get the mean error to converge to zero in at most two iterations.

The system analysis leads to $$ \boxed{ \begin{aligned} 0 < \beta < 1,\\ \gamma = \frac{\beta}{1+\beta},\\ \alpha = \frac{1+\beta}{\eta}\, \frac{1}{\lambda}. \end{aligned} } $$

Play with the controls
  1. Click on the OPT button
  2. Ensure the Input selector reads Step
  3. Slide the $\beta$ controls, and observe the system.
  1. Click on the PHB button
  2. Ensure the Input selector reads Step
  3. Slide the $\beta$ controls, and observe the system.
  1. Repeat the above steps for the NAG button.
Observations using the derived learning rate choice.
  1. Optimal $\gamma = \tfrac{\beta}{1+\beta}$. Design goal met for any $0 < \beta < 1$.
  2. PHB: $\gamma = 0$. Optimal if $\beta = 0$, and the learning rate simplifies to $\alpha = \tfrac{1}{\lambda}$.
  3. NAG: $\gamma = \tfrac{\beta}{1+\beta}$. Optimal in this setup, if $\alpha = \tfrac{1+\beta}{\eta}\tfrac{1}{\lambda}$.

In this setup, published optimal parameter choices for both PHB and NAG collapse to $\alpha=\tfrac{1}{\lambda}, \beta=\gamma=0$. This suggests that the primary function of gradient smoothing is not accelerating convergence. In this setup, the AutoSGM framework exposes a closed-form parameterization of the learning algorithm that certifies exponential, asymptotic convergence, such that the mean error can be driven to zero in at most two iterations for any $0 \le \beta < 1$, which includes the plain method.

This analysis highlights fundamental themes in linear systems, signal processing and control theory. We believe this perspective, better motivates what has long been called momentum in accelerated learning applications, not as an heuristic but as a precisely engineered dynamic system.

Notice that the overall LTI system dynamics, written in matrix form, both before smoothing the gradient, Equation $(\ref{eq:smeq})$ and after smoothing the gradient, Equation $(\ref{eq:amsyseq})$ are both of the same form: $$ \begin{equation*} \underbrace{ \begin{bmatrix} \rstp \\ \rmt \end{bmatrix} }_{\rxt} = \underbrace{ \begin{bmatrix} \beta_p & k_p \\ 1 & 1 \end{bmatrix} }_{\rA} \underbrace{ \begin{bmatrix} \rst \\ \rmtm \end{bmatrix} }_{\rxtm}. \end{equation*} $$ Compactly, $\rxt = \rA^{t}\,\rxi$. However, after gradient smoothing was introduced, the system matrix $\rA$ transitioned from a rank-1 structure to a full rank matrix, with $\beta_p \ne k_p$. This transformation expands the admissible parameter space, allowing $\beta_p$ and $k_p$ to operate over a relatively wider dynamic range. In effect, the act of modifying a rank-deficient matrix to full-rank serves as regularizing the matrix. We refer to this effect of gradient smoothing as a Lowpass regularization of the overall system dynamics.

Bibliography

  1. Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17.
  2. Nesterov, Y. (1983). A method for solving the convex programming problem with convergence rate O(1/k²). Soviet Mathematics Doklady, 27(2), 372-376.
  3. Feuer, A., & Weinstein, E. (1985). Convergence analysis of LMS filters with uncorrelated Gaussian data. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(1), 222-230. https://doi.org/10.1109/TASSP.1985.1164493
  4. Feuer, A., & Berman, N. (1986). Performance analysis of the smoothed least mean square (SLMS) algorithm. Signal Processing, 11(3), 265-276. https://doi.org/10.1016/0165-1684(86)90036-8
  5. Rugh, W. J. (1996). Linear System Theory (2nd ed.). Prentice Hall.
  6. Proakis, J. G., & Manolakis, D. G. (2006). Digital Signal Processing: Principles, Algorithms, and Applications (4th ed.). Upper Saddle River, NJ: Pearson Prentice Hall.
  7. Lessard, L., Recht, B., & Packard, A. (2016). Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1), 57-95. https://doi.org/10.1137/15M1009597

First-order LTI change-level and overall error dynamics.

  1. First plot shows the iteration-domain response of the SGM's error $ε[t]$ from an optimum point and its change-level $\Delta\varepsilon[t]$.
  2. Second plot shows two roots of the overall error behavior.
  3. Third plot shows change-level pole position.
Highlighted in the second and third plots are stable, and actual unstable (outside the unit-circle) locations.