An AutoSGM Framework: Conjugate Gradient Method

Solving a Linear System.

Page created: Sep 11 2025 at 12:00 AM

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation. Oluwasegun Somefun. “An AutoSGM Framework: Conjugate Gradient Method.” 2025.

The lowpass regularized stochastic gradient algorithm applies a first-order lowpass filter to its stochastic gradient, before multiplying by a possibly optimal iteration-dependent learning rate. We show here that this modular principle is exactly found in the standard conjugate gradient method applied to a linear system (Hestenes & Stiefel, 1952; Polyak, 1969; Shewchuk, 1994), and therefore the Conjugate Gradient Method represents an automatic stochastic gradient method (AutoSGM), where the algorithm’s hyperparameters are all iterative.

Formally, at each iteration \(t\), the parameter vector \(\mathbf{w}[t]\) is updated via the gradient \(\mathbf{g}[t]\) passed through a first-order lowpass filter

\[\boxed{ \begin{aligned} \statez{\mathbf{q}[t]} = \filter{\beta}\,\cdot \statez{\mathbf{q}[t-1]} + \input{\mathbf{g}[t]}\\ \input{\mathbf{v}[t]} = \statez{\mathbf{q}[t]} - \filter{\gamma}\,\cdot \statez{\mathbf{q}[t-1]} \\ \end{aligned} }\]

\[\boxed{ \begin{aligned} \state{\mathbf{w}[t+1]} = \state{\mathbf{w}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]}\\ \state{\mathbf{\varepsilon}[t+1]} = \state{\mathbf{\varepsilon}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]} \end{aligned} }\]

We will now assume that the filter parameters are possibly iteration-dependent.

\[\boxed{ \begin{aligned} \statez{\mathbf{q}[t]} = \filter{\beta[t-1]}\,\cdot \statez{\mathbf{q}[t-1]} + \input{\mathbf{g}[t]}\\ \input{\mathbf{v}[t]} = \statez{\mathbf{q}[t]} - \filter{\gamma[t-1]}\,\cdot \statez{\mathbf{q}[t-1]} \\ \state{\mathbf{w}[t+1]} = \state{\mathbf{w}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]}\\ \end{aligned} }\]

The learning algorithm is designed to minimize the objective

\[f\big(\state{\mathbf{w}[t+1]}\big) = f\big(\state{\mathbf{w}[t]}\big) + \state{\mathbf{\varepsilon}[t+1]}^\intercal \mathbf{R}\,\state{\mathbf{\varepsilon}[t+1]} + \mathbf{n}[t+1]^\intercal \state{\mathbf{\varepsilon}[t+1]}\]

where \(\mathbf{R}\) is an \(n\times n\) symmetric positive definite (SPD) matrix, and \(\mathbf{n}[t+1]\) is the output of linear system

\[\boxed{ \mathbf{n}[t+1] = \mathbf{n}[t] + \boldsymbol{\xi}[t] }\]

driven by a zero-mean noise process with finite moments, where \(\mathbb{E}[\boldsymbol{\xi}[t]] = 0, \mathbb{E}\|\boldsymbol{\xi}[t]\|^2=\sigma^2< \infty, \mathbb{E}\|\boldsymbol{\xi}[t]\|^4 < \infty.\)

Let \(\mathbf{b} = \mathbf{R} \state{\mathbf{w}^\ast}\), then

\[\boxed{ \begin{aligned} \input{\mathbf{g}[t]} = \mathbf{R} \state{\mathbf{\varepsilon}[t]} + \mathbf{n}[t] = \mathbf{R} \state{\mathbf{w}[t]} - \mathbf{b} + \mathbf{n}[t] \end{aligned}. }\]

Therefore, by linearity, we have the gradient generating system

\[\boxed{ \begin{aligned} \input{\mathbf{g}[t+1]} = \mathbf{R} \state{\mathbf{\varepsilon}[t+1]} = \input{\mathbf{g}[t]} - \gain{\alpha[t]}\cdot\mathbf{R}\input{\mathbf{v}[t]} + \boldsymbol{\xi}[t] \end{aligned}. }\]

The classic method assumes \(\sigma^2 = 0\), and solves \(\mathbf{R} \state{\mathbf{w}[t]} = \mathbf{b}\). In exact arithmetic, the algorithm converges in at most \(n\) iterations for the linear time-invariant system described by matrix \(\mathbf{R}\).

Using the matrix \(\mathbf{R}\)-orthogonality or conjugacy conditions:

\[\boxed{ \begin{aligned} -\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{v}[t]} = 0.\\ \statez{\mathbf{q}[t+1]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]} = 0. \\ \input{\mathbf{v}[t+1]}^\intercal \mathbf{R}\input{\mathbf{v}[t]} = 0. \end{aligned} }\]

where \(\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{v}[t]} = \state{\mathbf{\varepsilon}[t+1]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}\)

After removing any extraneous scaling constants, for the simplest case \(\sigma^2 = 0\), we will arive at

\[\boxed{ \begin{aligned} \gain{\alpha[t]} = \frac{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{v}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} = \frac{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{g}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} \end{aligned}, }\]

\[\boxed{ \begin{aligned} \filter{\beta[t]} = \frac{-\input{\mathbf{g}[t+1]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}}{\statez{\mathbf{q}[t]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}} \end{aligned}, }\]

\[\boxed{ \begin{aligned} \filter{\gamma[t]} = \filter{\beta[t]} + \frac{\input{\mathbf{g}[t+1]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}}{\statez{\mathbf{q}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} = \filter{\beta[t]} - \frac{\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{g}[t+1]}}{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{g}[t]}} \end{aligned}. }\]

The learning rate satisfies the stationarity condition of the objective function, which is an \(\mathbf{R}\)-orthogonality condition between the next parameter error and the current filter’s output.
The two filter parameters are designed to enforce \(\mathbf{R}\)-orthogonality conditions on successive states and outputs of the first-order filter. Since the update directions \(\mathbf{v}[t]\) are constructed from the filter states \(\mathbf{q}[t]\), imposing these two filter design conditions ensures that successive filter outputs represent non-overlapping information in the \(\mathbf{R}\)-energy space. First, \(\mathbf{q}[t+1]^\intercal\mathbf{R}\mathbf{q}[t]=0\) ensures that the internal filter state does not accumulate redundant components. Second, \(\mathbf{v}[t+1]^\intercal \mathbf{R}\mathbf{v}[t]=0\) enforces conjugacy of the update directions, ensuring that successive steps do not reintroduce previously eliminated components of the error. Any nonzero \(\mathbf{R}\)-inner product between successive filter states and update directions implies residual overlap in the quadratic energy, leading to redundant updates.

More generally,

\[\boxed{ \begin{aligned} \gain{\alpha[t]} = \frac{\big(\input{\mathbf{g}[t]} + \boldsymbol{\xi}[t]\big)^\intercal \input{\mathbf{v}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} \end{aligned}, }\]

\[\boxed{ \begin{aligned} \filter{\gamma[t]} = \filter{\beta[t]} - \frac{\input{\mathbf{g}[t+1]}^\intercal \big(\input{\mathbf{g}[t+1]} - \boldsymbol{\xi}[t]\big)}{\input{\mathbf{g}[t]}^\intercal \big(\input{\mathbf{g}[t]} + \boldsymbol{\xi}[t]\big)} \end{aligned}. }\]

For the simplest case, \(\sigma^2=0\), a possible pseudocode for this particular AutoSGM is:

\[\boxed{ \begin{aligned} \input{\hbox{INITIALIZE}}\\ \epsilon \approx 0\\ \state{\mathbf{w}[0]} = \mathbf{0}\\ \input{\mathbf{g}[0]} = \mathbf{R} \state{\mathbf{w}[0]} - \mathbf{b} + \mathbf{n}[0]\\ \statez{\mathbf{q}[0]} = \mathbf{0} + \input{\mathbf{g}[0]}\\ \input{\mathbf{v}[0]} = \statez{\mathbf{q}[0]} - \mathbf{0} \\ t = 0\\ \input{\hbox{LOOP}}\\ \gain{\alpha[t]} = \frac{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{v}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}}\\ \state{\mathbf{w}[t+1]} = \state{\mathbf{w}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]}\\ \\ \input{\hbox{Generate Gradient}}\\ \input{\mathbf{g}[t+1]} = \input{\mathbf{g}[t]} - \gain{\alpha[t]}\cdot\mathbf{R}\input{\mathbf{v}[t]}\\ \input{\hbox{or}}\\ \input{\mathbf{g}[t+1]} = \mathbf{R} \state{\mathbf{w}[t+1]} - \mathbf{b}\\ \input{\hbox{STOP, if }} \input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{g}[t+1]} < \epsilon^2 \\ \\ \filter{\beta[t]} = \frac{-\input{\mathbf{g}[t+1]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}}{\statez{\mathbf{q}[t]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}}\\ \filter{\gamma[t]} = \filter{\beta[t]} - \frac{\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{g}[t+1]}}{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{g}[t]}}\\ \\ \statez{\mathbf{q}[t+1]} = \filter{\beta[t]}\,\cdot \statez{\mathbf{q}[t]} + \input{\mathbf{g}[t+1]}\\ \input{\mathbf{v}[t+1]} = \statez{\mathbf{q}[t+1]} - \filter{\gamma[t]}\,\cdot \statez{\mathbf{q}[t]} \\ t = t+1\\ \input{\hbox{END LOOP}} \end{aligned} }\]

Hestenes, M. R., & Stiefel, E. (1952). Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 49(6), 409. https://doi.org/10.6028/jres.049.044
Polyak, B. T. (1969). The Conjugate Gradient Method in Extremal Problems. USSR Computational Mathematics and Mathematical Physics, 9(4), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4
Shewchuk, J. R. (1994). An Introduction to the Conjugate Gradient Method without the Agonizing Pain. School of Computer Science, Carnegie Mellon University.