Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation. Oluwasegun Somefun. “An AutoSGM Framework: Conjugate Gradient Method.” 2025.
The lowpass regularized stochastic gradient algorithm applies a first-order lowpass filter to its stochastic gradient, before multiplying by a possibly optimal iteration-dependent learning rate. We show here that this modular principle is exactly found in the standard conjugate gradient method applied to a linear system (Hestenes & Stiefel, 1952; Polyak, 1969; Shewchuk, 1994), and therefore the Conjugate Gradient Method represents an automatic stochastic gradient method (AutoSGM), where the algorithm’s hyperparameters are all iterative.
Formally, at each iteration \(t\), the parameter vector \(\mathbf{w}[t]\) is updated via the gradient \(\mathbf{g}[t]\) passed through a first-order lowpass filter
driven by a zero-mean noise process with finite moments, where \(\mathbb{E}[\boldsymbol{\xi}[t]] = 0, \mathbb{E}\|\boldsymbol{\xi}[t]\|^2=\sigma^2< \infty, \mathbb{E}\|\boldsymbol{\xi}[t]\|^4 < \infty.\)
Let \(\mathbf{b} = \mathbf{R} \state{\mathbf{w}^\ast}\), then
The classic method assumes \(\sigma^2 = 0\), and solves \(\mathbf{R} \state{\mathbf{w}[t]} = \mathbf{b}\). In exact arithmetic, the algorithm converges in at most \(n\) iterations for the linear time-invariant system described by matrix \(\mathbf{R}\).
Using the matrix \(\mathbf{R}\)-orthogonality or conjugacy conditions:
The learning rate satisfies the stationarity condition of the objective function, which is an \(\mathbf{R}\)-orthogonality condition between the next parameter error and the current filter’s output.
The two filter parameters are designed to enforce \(\mathbf{R}\)-orthogonality conditions on successive states and outputs of the first-order filter. Since the update directions \(\mathbf{v}[t]\) are constructed from the filter states \(\mathbf{q}[t]\), imposing these two filter design conditions ensures that successive filter outputs represent non-overlapping information in the \(\mathbf{R}\)-energy space. First, \(\mathbf{q}[t+1]^\intercal\mathbf{R}\mathbf{q}[t]=0\) ensures that the internal filter state does not accumulate redundant components. Second, \(\mathbf{v}[t+1]^\intercal \mathbf{R}\mathbf{v}[t]=0\) enforces conjugacy of the update directions, ensuring that successive steps do not reintroduce previously eliminated components of the error. Any nonzero \(\mathbf{R}\)-inner product between successive filter states and update directions implies residual overlap in the quadratic energy, leading to redundant updates.
Hestenes, M. R., & Stiefel, E. (1952). Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards, 49(6), 409. https://doi.org/10.6028/jres.049.044
Polyak, B. T. (1969). The Conjugate Gradient Method in Extremal Problems. USSR Computational Mathematics and Mathematical Physics, 9(4), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4
Shewchuk, J. R. (1994). An Introduction to the Conjugate Gradient Method without the Agonizing Pain. School of Computer Science, Carnegie Mellon University.