Conjugate Gradient as AutoSGM

Solving a Linear Time-Invariant System.

Page created: Sep 11 2025 at 12:00 AM

Oluwasegun Somefun. “Conjugate Gradient as AutoSGM.” AutoSGM Framework, 2025.

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation.

Table of contents

Conjugate Gradient as AutoSGM

We show here that the AutoSGM framework captures what has been called Conjugate Gradient Method (Hestenes & Stiefel, 1952; Polyak, 1969; Shewchuk, 1994), and goes further by generalizing it.

Recall, that AutoSGM reframes stochastic gradient optimizers through the lens of a first-order lowpass filter applied to a stochastic gradient, and the existence of an optimal iteration-dependent learning rate choice. These two elements are exactly found in the standard conjugate gradient method, with the filter’s zero parameter turned off.

Formally, at each iteration \(t\), the parameter vector \(\mathbf{w}[t]\) is updated via the gradient \(\mathbf{g}[t]\) passed through a first-order lowpass filter

\[\boxed{ \begin{aligned} \statez{\mathbf{q}[t]} = \filter{\beta}\,\cdot \statez{\mathbf{q}[t-1]} + \input{\mathbf{g}[t]}\\ \input{\mathbf{v}[t]} = \statez{\mathbf{q}[t]} - \filter{\gamma}\,\cdot \statez{\mathbf{q}[t-1]} \\ \end{aligned} }\]

\[\boxed{ \begin{aligned} \state{\mathbf{w}[t+1]} = \state{\mathbf{w}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]}\\ \state{\mathbf{\varepsilon}[t+1]} = \state{\mathbf{\varepsilon}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]} \end{aligned} }\]

We will now assume that the filter parameters are possibly iteration-dependent, \(\filter{\lvert{\beta}[t]\rvert} < 1, \filter{\gamma}[t] \ne 1\).

\[\boxed{ \begin{aligned} \statez{\mathbf{q}[t]} = \filter{\beta[t-1]}\,\cdot \statez{\mathbf{q}[t-1]} + \input{\mathbf{g}[t]}\\ \input{\mathbf{v}[t]} = \statez{\mathbf{q}[t]} - \filter{\gamma[t-1]}\,\cdot \statez{\mathbf{q}[t-1]} \\ \state{\mathbf{w}[t+1]} = \state{\mathbf{w}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]}\\ \end{aligned} }\]

The learning algorithm is designed to minimize the objective

\[f\big(\state{\mathbf{w}[t+1]}\big) = f\big(\state{\mathbf{w}[t]}\big) + \state{\mathbf{\varepsilon}[t+1]}^\intercal \mathbf{R}\,\state{\mathbf{\varepsilon}[t+1]} + \mathbf{n}[t+1]^\intercal \state{\mathbf{\varepsilon}[t+1]}\]

where \(\mathbf{R}\) is an \(n\times n\) symmetric positive definite (SPD) matrix, and \(\mathbf{n}[t] \sim \mathcal{N}(0, \sigma^2), 0 \le \sigma^2 < \infty\). Given \(\mathbf{n}[t]\), \(\delta\mathbf{n}[t] \sim \mathcal{N}(-\mathbf{n}[t], \sigma^2)\) models the white noise process

\[\boxed{ \begin{aligned} \mathbf{n}[t+1]= \mathbf{n}[t] + \delta\mathbf{n}[t] \end{aligned}. }\]

Let \(\mathbf{b} = \mathbf{R} \state{\mathbf{w}^\ast}\), then

\[\boxed{ \begin{aligned} \input{\mathbf{g}[t]} = \mathbf{R} \state{\mathbf{\varepsilon}[t]} + \mathbf{n}[t] = \mathbf{R} \state{\mathbf{w}[t]} - \mathbf{b} + \mathbf{n}[t] \end{aligned}. }\]

Therefore, by linearity, we have the gradient generating system

\[\boxed{ \begin{aligned} \input{\mathbf{g}[t+1]} = \mathbf{R} \state{\mathbf{\varepsilon}[t+1]} = \input{\mathbf{g}[t]} - \gain{\alpha[t]}\cdot\mathbf{R}\input{\mathbf{v}[t]} + \delta\mathbf{n}[t] \end{aligned}. }\]

The classic method assumes \(\sigma^2 = 0\), and solves \(\mathbf{R} \state{\mathbf{w}[t]} = \mathbf{b}\). In exact arithmetic, the algorithm converges in at most \(n\) iterations for the linear time-invariant system described by matrix \(\mathbf{R}\).

Using the matrix \(\mathbf{R}\)-orthogonality or conjugacy conditions:

\[\boxed{ \begin{aligned} -\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{v}[t]} = 0.\\ \statez{\mathbf{q}[t+1]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]} = 0. \\ \input{\mathbf{v}[t+1]}^\intercal \mathbf{R}\input{\mathbf{v}[t]} = 0. \end{aligned} }\]

where \(\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{v}[t]} = \state{\mathbf{\varepsilon}[t+1]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}\)

After removing any extraneous scaling constants, for the simplest case \(\sigma^2 = 0\), we will arive at

\[\boxed{ \begin{aligned} \gain{\alpha[t]} = \frac{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{v}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} = \frac{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{g}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} \end{aligned}, }\]

\[\boxed{ \begin{aligned} \filter{\beta[t]} = \frac{-\input{\mathbf{g}[t+1]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}}{\statez{\mathbf{q}[t]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}} \end{aligned}, }\]

\[\boxed{ \begin{aligned} \filter{\gamma[t]} = \filter{\beta[t]} + \frac{\input{\mathbf{g}[t+1]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}}{\statez{\mathbf{q}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} = \filter{\beta[t]} - \frac{\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{g}[t+1]}}{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{g}[t]}} \end{aligned}. }\]

The learning rate satisfies the stationarity condition of the objective function, which is an \(\mathbf{R}\)-orthogonality condition between the next parameter error and the current filter’s output.
The two filter parameters are designed to enforce the \(\mathbf{R}\)-orthogonality conditions of successive states and outputs of the first-order filter.

More generally,

\[\boxed{ \begin{aligned} \gain{\alpha[t]} = \frac{\big(\input{\mathbf{g}[t]} + \delta\mathbf{n}[t]\big)^\intercal \input{\mathbf{v}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}} \end{aligned}, }\]

\[\boxed{ \begin{aligned} \filter{\gamma[t]} = \filter{\beta[t]} - \frac{\input{\mathbf{g}[t+1]}^\intercal \big(\input{\mathbf{g}[t+1]} - \delta\mathbf{n}[t]\big)}{\input{\mathbf{g}[t]}^\intercal \big(\input{\mathbf{g}[t]} + \delta\mathbf{n}[t]\big)} \end{aligned}. }\]

A Pseudocode for this AutoSGM following the prevalent style of presenting the conjugate gradient algorithm is:

\[\boxed{ \begin{aligned} \input{\hbox{INITIALIZE}}\\ \epsilon \approx 0\\ \state{\mathbf{w}[0]} = \mathbf{0}\\ \input{\mathbf{g}[0]} = \mathbf{R} \state{\mathbf{w}[0]} - \mathbf{b} + \mathbf{n}[0]\\ \statez{\mathbf{q}[0]} = \mathbf{0} + \input{\mathbf{g}[0]}\\ \input{\mathbf{v}[0]} = \statez{\mathbf{q}[0]} - \mathbf{0} \\ t = 0\\ \input{\hbox{LOOP}}\\ \gain{\alpha[t]} = \frac{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{v}[t]}}{\input{\mathbf{v}[t]}^\intercal \mathbf{R}\input{\mathbf{v}[t]}}\\ \state{\mathbf{w}[t+1]} = \state{\mathbf{w}[t]} - \gain{\alpha[t]} \cdot \input{\mathbf{v}[t]}\\ \\ \input{\hbox{Generate Gradient}}\\ \input{\mathbf{g}[t+1]} = \input{\mathbf{g}[t]} - \gain{\alpha[t]}\cdot\mathbf{R}\input{\mathbf{v}[t]}\\ \input{\hbox{or}}\\ \input{\mathbf{g}[t+1]} = \mathbf{R} \state{\mathbf{w}[t+1]} - \mathbf{b}\\ \input{\hbox{STOP, if }} \input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{g}[t+1]} < \epsilon^2 \\ \\ \filter{\beta[t]} = \frac{-\input{\mathbf{g}[t+1]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}}{\statez{\mathbf{q}[t]}^\intercal \mathbf{R}\statez{\mathbf{q}[t]}}\\ \filter{\gamma[t]} = \filter{\beta[t]} - \frac{\input{\mathbf{g}[t+1]}^\intercal \input{\mathbf{g}[t+1]}}{\input{\mathbf{g}[t]}^\intercal \input{\mathbf{g}[t]}}\\ \\ \statez{\mathbf{q}[t+1]} = \filter{\beta[t]}\,\cdot \statez{\mathbf{q}[t]} + \input{\mathbf{g}[t+1]}\\ \input{\mathbf{v}[t+1]} = \statez{\mathbf{q}[t+1]} - \filter{\gamma[t]}\,\cdot \statez{\mathbf{q}[t]} \\ t = t+1\\ \input{\hbox{END LOOP}} \end{aligned} }\]

Hestenes, M. R., & Stiefel, E. (1952). Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 49(6), 409. https://doi.org/10.6028/jres.049.044
Polyak, B. T. (1969). The Conjugate Gradient Method in Extremal Problems. USSR Computational Mathematics and Mathematical Physics, 9(4), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4
Shewchuk, J. R. (1994). An Introduction to the Conjugate Gradient Method without the Agonizing Pain. School of Computer Science, Carnegie Mellon University.