Momentum: a principled signal-processing framework.
The word momentum has a physical meaning defined as mass times velocity. But, what exactly does momentum represent in the context of stochastic gradient learning? Typical explanations lean on this physical analogy to describe a set of mathematical transformations on the gradient that often help accelerate gradient descent, making the word `momentum` stick. Yet, they rarely addresss the deeper question: what exactly are these set of operations and how do we design them.
This is where the AutoSGM framework comes in.
In this notes, using a simple problem setup that captures the essence of more complex scenarios, we recast the stochastic gradient learning dynamics as a linear-time invariant system, introduce the AutoSGM framework, derive explicit stability certificates, uncover automatic learning-rate rules and admissible system parameter values for the exponential convergence (in the mean) of the learning algorithm.
The AutoSGM framework unifies `momentum` as gradient smoothing via a first-order linear filter in the stochastic gradient method. As we certify accelerated convergence for this simple problem setup, at the end of these notes, we will see that the AutoSGM framework allows us to discover Nesterov's Accelerated Gradient (NAG) as a precisely engineered system from first-principles.
This implies $u_t$ has a mean $\mean{u_t} = 0$, a finite second-moment (or variance) $\mean{u_t^2} = λ$, with $0<\lambda<\infty$, and uncorrelated values across time $\mean{u_t\,u_k} = 0, \hbox{for } t\ne k$. Independence further implies that for any powers $i,j$, the joint moment $\mean{u_t^i \errt^j} = \mean{u_t^i}\mean{\errt^j}$.
Differentiating $f(\rwt)$ with respect to $\rwt$, we get the gradient $\gradt$ and Hessian $\rH_t$ as follows, $$ \begin{align} \gradt = \nabla f(\rwt) = u_t^2\cdot \varepsilon_t, & \quad \mean{\gradt} = \lambda\, \mean{\errt}.\\ \rH_t = \nabla^2 f(\rwt) = u_t^2,& \quad \mean{\rH_t} = \lambda. \end{align} $$ To minimize $f(\rwt)$, an optimization algorithm of choice is known as the Stochastic Gradient Method (SGM). The simplest form of this algorithm is an iterative update rule of the form $$ \rwtp = \rwt - \alpha_t \, \gradt.$$ At each iteration, the update rule receives an $\alpha_t > 0$ from a learning rate oracle and $\gradt$ from a gradient-generating oracle. Although, the algorithm updates $\rwt$ by taking a step in the direction of the negative gradient, it does not strictly assume its input is either the exact gradient or stochastic, nor does it guarantee a descent of $f(\rwt)$ per iteration. Nonetheless, the notions of stochasticity and descent have become commonly associated with the update rule, even if they are not required for it to function. $$ \errtp = \errt - \alpha_t \, \gradt.$$We will further assume that the system dynamics are linear time-invariant (LTI). This means $\alpha_t = \alpha, \forall t$, a constant learning rate across iterations, and so $$ \errtp = \errt - \alpha \, \gradt.$$ Goal. In this controlled and interpretable setting, we desire that the algorithm drives its mean error, starting from an initial non-zero value $\errti$, to converge to its optimum value of zero in approximately one or two iterations.
Recall that $\mean{u_t^2} = \lambda$, and $\mgradt = \lambda\,\merrt$, we can write the error equation as $\errtp = \errt + \derrtp$, where $\mderrtp = -\alpha \, \mgradt$. From which we get $\mgradt = \lambda\,\merrtm + \lambda\, \mderrt$, and the dynamics
$$ \begin{equation} \label{eq:smc1} \begin{aligned} \mderrtp &= -\alpha\,\lambda\,\derrt -\alpha\,\lambda\, \merrtm \\ \merrt &= \mderrt + \merrtm. \end{aligned} \end{equation} $$ Denote $\rmt = \merrt$, $\rst = \mderrt$, $\beta_p = -\alpha \lambda$, and $k_p = -\alpha \lambda$, the expected error dynamics is $$ \begin{equation}\label{eq:smeq} \boxed{ \begin{aligned} \rstp &= \beta_p\,\rst + k_p \, \rmtm \\ \rmt &= \rst + \rmtm \end{aligned}}. \end{equation} $$ Equation $(\ref{eq:smeq})$, tells us that the expected error $\rmt$ is the sum of the expected change-level error $\rst$, and that the expected change-level error dynamics is that of a single-pole LTI filter, $$\rstp = \beta_p\,\rst + k_p \, \rmtm$$ with input $\rmtm$, its transfer-function is $$H_{s}(z) = \frac{k_p}{1-\beta_p\,z^{-1}}.$$ The change-level system $H_{s}(z)$ is both asymptotically and exponentially stable if its single pole satisfies $\lvert \beta_p\rvert < 1$. Since $\alpha >0$, $\beta_p = -\alpha \lambda$ is negative, this translates to $-1 < -\alpha\lambda < 0$, that is $1 > \alpha\lambda$, and $\alpha\lambda > 0$, which imply $$\begin{equation}\label{eq:mewin} \boxed{0 < \alpha < \tfrac{1}{\lambda}}. \end{equation}$$ The overall dynamics of the expected error, written in matrix form is $$ \begin{equation} \underbrace{ \begin{bmatrix} \rstp \\ \rmt \end{bmatrix} }_{\rxt} = \underbrace{ \begin{bmatrix} \beta_p & k_p \\ 1 & 1 \end{bmatrix} }_{\rA} \underbrace{ \begin{bmatrix} \rst \\ \rmtm \end{bmatrix} }_{\rxtm}. \label{eq:smeqmat} \end{equation} $$ Compactly, the two-state LTI system is $\rxt = \rA^{t}\,\rxi$. Again, exponential and asymptotic stablility of the two-state system requires its eigenvalues, $z_1 = 0$, and $\textcolor{Mahogany}{z_2 =1-\alpha\lambda}$ roots of the characteristic polynomial $\det(z\mathbf{I}-\mathbf{A}) = z(z-(1-\alpha\lambda)) = 0 $ to be distinct and have a magnitude strictly less than 1. The maximum system eigenvalue $\lvert z_2 \rvert < 1$, implies $-1 < 1-\alpha\lambda < 1$, and therefore $2>\alpha\lambda$, and $\alpha\lambda > 0$, meaning $0 < \alpha < \tfrac{2}{\lambda}$, and more stricter $0 < \alpha \le \tfrac{1}{\lambda}$.By exploiting the rank-1 system matrix in $\rxt = \rA^{t}\,\rxi$, and taking any standard norm $\|\cdot\|$ (1-norm, 2-norm, $\infty$-norm) of both sides, with the corresponding induced matrix norm, we can obtain a certificate on the exponential convergence of the expected error states, for $t\ge 1$ as $$ \boxed{\| \rxt \| \le 2 \cdot |\textcolor{Mahogany}{z_2}|^{t-1}\, \|\rxi\|}.$$
If we analyze convergence of the error in the mean-square sense, we get a tighter range $$ \boxed{0 < \tfrac{1}{\sqrt{3}}\cdot\tfrac{1}{\lambda} \le \alpha \le \tfrac{1}{\lambda}} $$
and further, matching the mean-square convergence rate to the mean convergence rate, that is $\lvert 1 - 3\,\alpha^2\lambda^2 \rvert = \lvert 1 - \alpha \lambda \rvert$, so that approximately, both converge to $0$ at the same rate, leads to the choice $$ \boxed{\alpha = \tfrac{1}{3}\cdot\tfrac{1}{\lambda}}. $$ Still, the same SGM, the key difference is that the algorithm's input is now adorned with a more protective cloth.The SGM algorithm now uses a smooth version of the gradient, we replace $\gradt$ with a smooth version $\gradvt = H_{\beta,\gamma}(z) \{ \gradt \}$, \[ H_{\beta,\gamma}(z) = \eta\,\frac{1 - \gamma z^{-1}}{1 - \beta z^{-1}}. \] In the transfer-function, we have the filter's pole $ 0 < \beta < 1$ , the filter's zero $ \gamma \le \beta $, and $0 < \eta \le 1$. Here, we will always ensure the D.C gain of the filter is $1$, that is, $\eta = \tfrac{1-\beta}{1-\gamma}$.
The update rule is now $ \rwtp = \rwt - \alpha \, \gradvt, $ using the error, we have $$ \errtp = \errt - \alpha \, \gradvt. $$
A causal realization of this first-order LTI filter is $$ \gradvt = \beta \, \gradvtm + \eta\,\big(\gradt - \gamma \, \gradtm \big). $$ Therefore, we directly get an explicit change-level dynamics that inherits the lowpass structure of $H_{\beta,\gamma}(z)$, \[ \derrtp = \errtp - \errt = -\alpha \, \gradvt. \] In general, for any choice of $(\beta,\gamma)$, the filter's canonical realization is \[ \boxed{ \begin{aligned} \rqt &= \beta \, \rqtm + \gradt\\ \gradvt &= \eta\,(\rqt - \gamma\,\rqtm) \end{aligned}}. \] The change-level now takes the appearance of a lowpass filter \[ \derrtp = \beta \, \derrt + \alpha\,\eta\,\bigl(\gamma\,\gradtm - \gradt \bigr), \] instead of the former version \[ \derrtp = -\alpha\,\gradt. \]
Recall that $\mgradt = \lambda\,\merrt$, and that $\merrt = \merrtm + \mderrt$, so that $\mgradt = \lambda\,\merrtm + \lambda\,\mderrt$, and $\mderrtp = \beta \, \mderrt + \alpha\,\eta\,\bigl(\gamma\,\mgradtm - \mgradt \bigr) $.
Denote $\beta_p = \beta - \alpha\,\eta\,\lambda $, and $k_p = - \alpha\,\eta\,\lambda\,(1-\gamma)$, $\rmt = \merrt$, $\rst = \mderrt$, to get
$$
\begin{equation}
\label{eq:ac1}
\begin{aligned}
\rstp &= \beta_p\,\rst + k_p\,\rmtm \\
\rmt &= \rst + \rmtm.
\end{aligned}
\end{equation}
$$
For distinct roots, by eigenvalue decomposition of the system matrix in $\rxt = \rA^{t}\,\rxi$, and taking any standard norm $\|\cdot\|$ (1-norm, 2-norm, $\infty$-norm) of both sides, with the corresponding induced matrix norm, we can obtain a certificate for both exponential, and asymptotic convergence in the mean as $$ \boxed{\| \rxt \| \le \tfrac{8}{\lvert\sqrt{D}\rvert} \cdot |\textcolor{Mahogany}{z_2}|^{t}\, \|\rxi\|}.$$ Observe that, alternatively, in this setup, freely selecting $\gamma = 0$, and matching $\beta - \alpha\,\eta\,\lambda\,\gamma = 0$ implies $\beta=0$ for the fastest convergence, which is a fall-back to the plain SGM.
The system analysis leads to $$ \boxed{ \begin{aligned} 0 < \beta < 1,\\ \gamma = \frac{\beta}{1+\beta},\\ \alpha = \frac{1+\beta}{\eta}\, \frac{1}{\lambda}. \end{aligned} } $$
Play with the controls