Momentum as Lowpass Filter Design

Fastest mean convergence can be achieved without Momentum.

Page created: Sep 11 2025 at 12:00 AM

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation. Oluwasegun Somefun. “Momentum as Lowpass Filter Design.” 2025.


Here we introduce the concept of momentum (Polyak, 1969) as a lowpass regularized stochastic gradient algorithm. One that applies a first-order lowpass filter to its stochastic gradient, before multiplying by a possibly optimal iteration-dependent learning rate. For analytic purpose, we will study a minimal stochastic system, for which the algorithm’s hyperparameters are assumed time-invariant. More specifically, we show how Nesterov’s momentum arise a solution to the fastest mean error convergence of this problem. More importantly, we show that structurally, the Plain stochastic gradient method, Heavy-Ball momentum, and Nesterov’s momentum correspond fundamentally to different different filter designs of the same algorithm.


Understanding Momentum as a Lowpass Filter

The way we talk about training neural networks is often rooted in classical mechanics. We describe momentum as a ball rolling down a hilly landscape, gaining “inertia” to carry it over local bumps. While these physical analogies are intuitive, they are imprecise in the world of high-dimensional, non-convex optimization, “inertia” is a myth that obscures the underlying mechanics.

Traditionally, we view momentum as an estimation of the first moment of a gradient. However, if we look at the discrete, iteration-level operations, we see a literal mathematical realization of a first-order filter acting on the stochastic gradient.

By treating gradients as a noisy signal and momentum as a first-order lowpass filter, we can move from heuristic “tricks” to a modular framework for building the next generation of AI training tools.


Our central claim is that a literal mathematical interpretation already exists via classical signal processing tools related to first-order lowpass filters.


In the prevailing discourse of stochastic optimization, momentum is frequently characterized through the lens of “first-moment estimation”—a heuristic view where the algorithm maintains a moving average of past gradients to smooth out noise. However, this perspective fails to fully elucidate the system-theoretic mechanisms behind accelerated learning. By reframing momentum as a signal-processing operation, we move toward a more granular understanding. This abstraction allows us to treat optimizers not as black-box heuristics, but as modular dynamical systems whose behaviors are dictated by fundamental properties such as frequency response, group delay, and pole-zero stability.

The core thesis of this analysis is that momentum-based updates are mathematically equivalent to passing stochastic gradients through a first-order linear lowpass filter. This “gradient filtering” does more than just estimate a mean; it dynamically regularizes the update path by attenuating high-frequency stochastic noise while passing the low-frequency “signal” of the true gradient.

To understand how modern AI learns, we must look beyond the simple idea of “moving down a hill.” Traditional momentum is often described through analogies of physical inertia—giving the gradient “mass” to roll past local minima. However, this walkthrough explores a more rigorous and powerful perspective: Momentum is a first-order linear lowpass filter. By treating the noisy signals of machine learning as a data stream to be smoothed, we can design optimizers that are not just fast, but dynamically stable.

In this guide, we reframe momentum not merely as a way to “go faster”, but as a method for cleaning a noisy signal. It is the difference between a model that gets distracted by the “static” of individual data points and one that stays locked onto the fastest path to the truth. By treating the training process as a digital filter, we can design optimizers that are resilient, stable, and theoretically precise.

[!IMPORTANT] Key Insight
In AI optimization, momentum acts as a first-order Lowpass Filter for stochastic gradients. It attenuates (weakens) high-frequency noise—the random variance between data points—while preserving the low-frequency “signal” that points toward the global minimum.

This “cleaning” process is governed by the same mathematical principles used in audio engineering and control systems: the placement of Poles and Zeros.

###. Architectural Pivot: From Moment Estimation to Gradient Filtering

The prevailing paradigm in stochastic optimization interprets momentum primarily as a heuristic for “moment estimation,” where algorithms like Adam or Polyak’s Heavy-Ball are viewed as mechanisms to track the first moment of noisy gradient signals. This specification advocates for a strategic architectural pivot: abstracting momentum as a first-order linear lowpass filter acting on a stochastic gradient signal. By treating the optimization path as the output of a linear iteration-invariant (LII) system, we move beyond statistical averages toward a framework defined by pole-zero placement and dynamical regularization.

This shift is a prerequisite for true modularity in optimizer design. It allows the system architect to decouple the temporal filtering (which regulates the “memory” and spectral shape of the gradient) from the spatial normalization (which determines the learning-rate/step-magnitude). Crucially, while the source context demonstrates that the fastest mean error convergence can be achieved with or without filtering (e.g., in a deadbeat stochastic gradient method), the primary utility of the gradient filter is not acceleration alone, but the dynamical regularization of update steps to ensure stability across complex error surfaces

By passing the gradient through the filter \(H_{\beta,\gamma}\) designed via pole (\(\beta\)) and zero (\(\gamma\)) placement. By shifting these poles and zeros, we create different flavors of momentum.

We use this setup to show that momentum can not be mathematically viewed as a first-moment estimation of the gradient.

Foundation: A Simple 1-D Error Setup

Before tackling high-dimensional neural networks, we start in a minimalist “laboratory”: a 1D quadratic environment. Stripping away tensor algebra allows us to view the mathematical skeleton of error dynamics with absolute transparency.

We define our environment using the following variables:

  • \(w(t)\) (Parameter): The coordinate of our model at iteration \(t\).
  • \(\epsilon(t)\) (Error): The distance between our current parameter and the optimum \(\bigl(w(t) - w^*\bigr)\). Driving this to zero is our primary objective.
  • \(\psi(t)\) (Data Process): A zero-mean, wide-sense stationary random variable representing the stochastic “noise” in our data samples.
  • \(\lambda\) (Scale): The variance or second moment \(\mathbb{E}[\psi^2(t)]\). This represents the “curvature” or scale of the optimization landscape.

Section Insight: Why start here?
In this 1D world, the “so what” is clarity. By modeling the loss as

\[f(w(t)) = \frac{1}{2}\psi^2(t)\epsilon^2(t),\]

we can analytically track how the expected error evolves over time, turning optimization into a predictable linear system.

Transitional Sentence: Now that the environment is set, let’s look at how a basic algorithm—the “unfiltered” baseline—navigates this space.

Baseline: Plain Stochastic Gradient Method

Mean Error Dynamics

To uncover the structural facts that persist in higher dimensions, we utilize a one-dimensional stochastic quadratic setup. This “minimalist model” serves as a strategic laboratory, exposing the linear iteration-invariant dynamics that govern convergence. Within this model, the expected error

\[m(t) = \mathbb{E}[\epsilon(t)]\]

and expected error change

\[s(t) = \mathbb{E}[\Delta(t)]\]

form a two-state system.

In a plain Stochastic Gradient Method (SGM), where the update is

\[w(t+1) = w(t) - \alpha g(t),\]

the system matrix \(A\) is defined as:

\[A = \begin{bmatrix} 1 & k_p \\ 1 & a_p \end{bmatrix}\]

For plain SGM,

\[a_p = -\alpha \lambda \quad \text{and} \quad k_p = -\alpha \lambda.\]

Because

\[a_p = k_p,\]

the system is rank-one. In this memoryless configuration, the expected error can theoretically achieve “deadbeat” behavior (converging in a single step) when The fastest possible mean convergence occurs at a learning rate of

\[\alpha = \frac{1}{\lambda}.\]

In this mode, the expected error drops to zero in exactly one step—a behavior called Deadbeat. While mathematically “fast,” this is often disastrous in practice. Why? Because this “fastest” mode is hypersensitive to noise variability. We trade this one-step speed for a slower, smoother contraction to reduce variance and ensure the model doesn’t bounce out of the optimum.

Transitional Sentence: Because plain SGM is so sensitive to jitter, we need a way to smooth the path—introducing the gradient filter.

The Lowapss Filter

As established in the transfer function

\[H_{\beta,\gamma}(z)\]

the dynamics of any momentum variant are defined by three fundamental components:

  • Pole (\(\beta\)): Controls the memory and Bounded-Input, Bounded-Output (BIBO) stability of the filter
    \(|\beta| < 1\)

  • Zero (\(\gamma\)): Reshapes the phase and frequency response, acting as a lead or lag compensator to distinguish between different momentum styles.

  • Gain (\(\eta\)): A normalization constant, typically set to
    \(\frac{1-\beta}{1-\gamma}\)
    to ensure the steady-state frequency response is normalized to unity.

By treating the gradient as a signal and the update rule as a dynamical system, we can transition from qualitative descriptions of “inertia” to a formal framework that analyzes how these filters interact with the underlying error surface.

Instead of using the raw gradient \(g(t)\), we pass it through a first-order linear lowpass filter

\[H_{\beta,\gamma}(z).\]

This transforms the noisy input into a refined velocity \(v(t)\).

In signal processing, a lowpass filter attenuates high-frequency “chatter” (noise) while allowing the low-frequency “trend” (the true gradient) to pass through.

  • The Pole (\(\beta\)): Controls the filter’s memory and damping. It must stay inside the unit circle
    \(|\beta| < 1\)
    for stability.

  • The Zero (\(\gamma\)): Controls the phase and “lead” of the filter. For lowpass behavior, we typically require
    \(\gamma < \beta.\)

Section Insight: The Non-Expansive Property
The secret to stabilization lies in Equation (14):

\[\mathbb{E}[v^2(t)] \le \mathbb{E}[g^2(t)].\]

Mathematically, the filter is non-expansive because its power gain

\[|H(e^{j\omega})|^2 \le 1\]

for all frequencies. This ensures that the momentum-weighted update never contains more “energy” or variance than the raw noise that created it.

The gradient filtering subsystem is defined by the transfer function \(H_{\beta,\gamma}(z)\), which regulates how gradient information is attenuated or amplified over iterations. In an iteration-invariant setup, this filter conditions the wide-sense stationary stochastic gradient \(g(t)\) into a smoothed update signal \(v(t)\).

The Transfer Function

The behavior of the filtered gradient is governed by:

\[H_{\beta,\gamma}(z) = \eta \frac{1-\gamma z^{-1}}{1-\beta z^{-1}}\]

Where:

  • Pole (\(\beta\)): Controls the exponential decay of information. BIBO stability requires
    \(|\beta| < 1\)

  • Zero (\(\gamma\)): Reshapes the transient response and determines the “type” of momentum (e.g., Nesterov vs. Heavy-Ball).

  • Gain Constant (\(\eta\)): Defined as
    \(\eta = \frac{1-\beta}{1-\gamma}\)
    ensuring
    \(H(1) = 1\)

Canonical Realization

To implement this filter in software, we utilize a state-space representation. For any choice of \((\beta, \gamma)\), the canonical realization generates an intermediate state \(q(t)\):

\[q(t) = \beta q(t-1) + g(t)\] \[v(t) = \eta \bigl(q(t) - \gamma q(t-1)\bigr)\]

Stability and Energy Constraints

  • Lowpass Condition:
    \(\gamma < \beta\)

  • Non-expansive Energy:
    \(\mathbb{E}[v^2(t)] \le \mathbb{E}[g^2(t)]\)

    Equality holds iff: \(\beta = \gamma = 0\)

  • Spectral Behavior:
    The placement of \(\beta\) and \(\gamma\) reshapes the update dynamics rather than merely providing acceleration, allowing for a controlled attenuation of high-frequency noise in the stochastic signal.

Transitional Sentence: With the filter smoothing our inputs, we must now analyze how it reshapes the “engine” of learning.

Expected Error Dynamics

The “So What?” of Full-Rank Lowpass Regularization

The introduction of gradient filtering transforms \(A\) into a full-rank system:

\[a_p \neq k_p.\]

By decoupling \(a_p\) and \(k_p\), momentum creates a frequency-domain constraint. This allows for lowpass regularization, where the optimizer uses its additional state to buffer the parameter path against stochastic variability.

The stability of this system is determined by the eigenvalues

\[(z_1, z_2)\]

of \(A\), governed by the discriminant:

\[D = (1 + a_p)^2 - 4(a_p - k_p).\]
  • Repeated Real Roots (\(D = 0\)): Representing critical damping.
  • Distinct Real Roots (\(D > 0\)): Representing over-damped, non-oscillatory dynamics.
  • Complex-Conjugate Roots (\(D < 0\)): Resulting in under-damped, oscillatory transients that can provide faster decay toward the mean.

When momentum is applied, the optimizer no longer tracks a single value; it becomes a two-state system. We represent the expected error dynamics through the system matrix \(A\):

\[A = \begin{bmatrix} 1 & 1 \\ k_p & a_p \end{bmatrix}.\]

Here, the system tracks:

  • Expected Error \(m(t)\)
  • Expected Error Change \(s(t)\)

The constants \(k_p\) and \(a_p\) are derived from our filter parameters \((\beta, \gamma)\) and the scale \(\lambda\).

The stability of this engine is determined by its eigenvalues \((z_1, z_2)\):

  • Repeated Real Roots: Smooth, monotonic decay
  • Distinct Real Roots: Dominated by the slowest mode
  • Complex-Conjugate Roots: Underdamped oscillations that can accelerate convergence

Section Insight: The Stability Certificate
For the optimizer to be “safe,” the eigenvalues must remain within the unit circle:

\[|z| < 1.\]

If the eigenvalues exit this circle, the system becomes unstable and the error explodes.

Deadbeat Convergence in Nesterov Momentum

The framework identifies Nesterov momentum as the “fastest mode” of convergence for the expected error dynamics. By setting the zero at:

\[\gamma = \frac{\beta}{1+\beta}\]

and the learning rate such that:

\[\alpha \eta \lambda = 1 + \beta\]

the system achieves deadbeat behavior.

The characteristic polynomial:

\[z^2 - (1+a_p)z + (a_p - k_p) = 0\]

yields eigenvalues:

\[z_1, z_2 = 0\]

resulting in the fastest possible exponential decay of the mean error.

Nesterov Momentum (NAG)

Transitional Sentence: To see this engine in action, let’s compare trivial configurations of the pole and zero.

Heavy-Ball Momentum (HB)

Note that the expected gradient is … Therefore HB momentum can not be mathematically viewed as a first-moment estimation of the gradient.

Plain

We analyze the characteristic polynomial using Jury stability conditions.

Define:

\[\bar{\alpha} = \alpha \eta \lambda\]

Stability Region: Nesterov Momentum

For:

\[-\frac{1}{3} \le \beta < 1\] \[0 < \alpha \eta \lambda < \frac{2(1+\beta)^2}{1+2\beta}\]

Stability Region: Heavy-Ball Momentum

For:

\[0 \le \beta < 1\] \[0 < \alpha \eta \lambda < 2(1+\beta)\]

Insights

Polyak’s Heavy-Ball (HB) and Nesterov Accelerated Gradient (NAG) are specific choices of pole-zero placement.

The two pillars of momentum, Polyak’s Heavy-Ball (HB) and Nesterov Accelerated Gradient (NAG), are often viewed as distinct algorithms. Through our filter lens, however, they are simply different configurations of the same system.

  • Heavy-Ball (HB): The “degenerate” case. It places the zero at the origin
    \(\gamma = 0\)
    While it builds effective speed, it is prone to complex-conjugate roots, which cause the model to oscillate or “wobble” as it approaches the minimum.

  • Nesterov (NAG): A more sophisticated “deadbeat” filter. It places the zero strategically relative to the pole using the formula:
    \(\gamma = \frac{\beta}{1+\beta}\)
    This specific placement is designed to eliminate the dominant eigenvalue of the error system:
    \(|z_1| = |z_2| = 0\)

Section Insight: The Superiority of Nesterov in 1D
Nesterov can achieve a decay rate of:

\[0,\]

i.e., deadbeat behavior, matching perfectly tuned SGM while retaining filtering benefits.

Heavy-Ball is fundamentally limited:

\[\rho = \sqrt{\beta}.\]

Thus, Nesterov enables more aggressive lead-compensation by dynamically placing the zero.


Takeaway: The Surprise—Acceleration Isn’t Always the Point

We often think momentum exists to accelerate optimization. But in a perfectly tuned 1D problem: Plain SGD already achieves optimal speed. with step size:

\[\alpha = \frac{1}{\lambda}\]

This produces deadbeat control, that is perfect convergence in a single step.


So why use momentum?

Because real gradients are noisy.

Momentum is a lowpass filter, attenuating high-frequency noise:

Benefits:

  • Smooths updates by reducing gradient noise variance
  • Expands stability region by increasing the allowable range of \(\alpha\) before divergence.

The principal benefit of momentum is not acceleration alone, but in dynamical regularization of the parameter update steps.

  1. Mean Convergence vs. Stability
    Contrary to popular belief, Nesterov is not inherently “faster” than a perfectly tuned Plain Stochastic Gradient Method (SGM) in 1D mean convergence. Plain SGM can reach zero error just as quickly. The true value of NAG lies in its stability region.

  2. Stability Expansion
    Momentum (especially NAG) creates a significantly larger “safe zone” for learning rates. While Plain SGM requires a hyper-specific learning rate to be fast, NAG remains stable across a much wider range.

  3. Oscillation Suppression
    Heavy-Ball’s lack of a sophisticated zero leads to oscillations. NAG’s configuration minimizes this by enforcing: \(z_1 = z_2 = 0\)

The modern view is that momentum’s primary benefit is dynamic regularization.

  1. Noise Attenuation (Spectral Smoothing)
    Acting as a lowpass filter, momentum preserves the low-frequency dataset structure while removing high-frequency noise.

  2. Energy Non-Expansiveness
    These filters satisfy: \(\mathbb{E}[v^2(t)] \le \mathbb{E}[g^2(t)]\)
    ensuring the update never amplifies noise.

  3. Trust-Region Compliance
    Momentum helps keep updates within a safe radius \(\mu\), preventing unstable jumps in parameter space.


Open Question

If momentum is just a first-order filter, what about:

  • Resonant systems
  • Nonlinear filters

👉 What unexplored filter designs could unlock the next generation of optimizers?


Reality Check

These insights are derived under simplifying assumptions:

  • 1D quadratic loss
  • Stationary gradients
  • Linear dynamics

Real deep learning systems are:

  • High-dimensional
  • Non-stationary
  • Heavy-tailed

Thus:

  • Structural roles (filtering + normalization) generalize
  • Optimal parameter values do not

Conclusion

By treating optimization as a signal processing task, we replace opaque momentum coefficients with interpretable pole-zero placements. This modular strategy—separating temporal lowpass filtering from spatial trust-region normalization—provides a rigorous foundation for designing large-scale stochastic optimizers.

Through this lens, stability and convergence are no longer emergent—they are engineered via the system’s characteristic polynomial.

  1. Polyak, B. T. (1969). The Conjugate Gradient Method in Extremal Problems. USSR Computational Mathematics and Mathematical Physics, 9(4), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4


Back to top

Page last modified: Dec 24 2025 at 12:00 AM.