Lowpass filtering versus Mean estimation via EMAs

How Gradient smoothing called Momentum is not an EMA

Oluwasegun Somefun. “Momentum is not an EMA.” AutoSGM Framework, 2025.

Please cite this page if you use information from these notes for your work, research, or anything that requires academic or formal citation.


Table of contents
  1. Lowpass filtering versus Mean estimation via EMAs
    1. The Single‑Pole Low‑Pass Filter
      1. EMA vs. Typical Lowpass filtering Regimes
      2. Frequency‑Domain Representation
    2. Why a Zero Breaks EMA Behavior
      1. Non-monotone
    3. Takeaways
    4. References

It is tempting to conflate the general lowpass filtering operation of smoothing gradients, commonly called momentum with typical exponential moving average (EMA) of gradients, since both involve recursive exponential smoothing.

However, the two mechanisms are fundamentally different modes of the same single-pole filter. Thinking in signal‑processing terms makes the distinction clear.

An EMA is just an operating point in the single-pole lowpass filter, either as a long term average (typically, \(0.9 < \beta \approx 1\)) or a short-term average (often, \(0 < \beta < 0.9\)).

\[H(z) = \eta\,\frac{1}{1 - \beta z^{-1}},\]

This is different from the general first-order lowpass filter (\(0 < \beta < 1, \gamma \ne 1\))

\[H(z) = \eta\,\frac{1 - \gamma z^{-1}}{1 - \beta z^{-1}},\]

whose numerator zero \(\gamma\) fundamentally alters the impulse response. Because the filter is no longer a pure single‑pole integrator, it cannot behave as an EMA for any non-zero choice of parameters \(\beta, \gamma\). This first-order filter is the underlying mechanics commonly-called momentum.


The Single‑Pole Low‑Pass Filter

Consider a first-order filter with no zero, \(\gamma=0\), only a single pole \(\beta\). The filter reduces to a single‑pole IIR (infinite impulse recursive) filter recursion:

\[y_{t} = \beta y_{t-1} + (1-\beta) x_t,\]

where \(u_t\) is the input and \(x_t\) is the filtered output.

  • This is a causal linear filter with impulse response
    \(h[t] = (1-\beta)\beta^t, \quad t \geq 0,\) i.e. an exponentially decaying weighting of past inputs.

  • The update direction is therefore the cumulative contribution of past inputs, exponentially weighted by \(\lvert \beta \rvert < 1\).


EMA vs. Typical Lowpass filtering Regimes

  • EMA regime:
    The output behaves like a long time‑average statistic only if
    \(0.9 \ll \beta < 1\) and \(\gamma=0.\)
    In this high \(\beta\) regime (extreme smoothing), the filter is said to have a long-time memory, a larger time-lag, and functions to approximate statistical expectations by mostly assuming ergodicity of the input to the filter. This is why in practice, even when ergodic assumptions do not hold, values like \(0.99, 0.999, 0.9999\) are effective.
    The output behaves like a shorter time‑average statistic if \(0 < \beta \le 0.9\) (normal smoothing), and so the filter is an extremely poor estimator of expectation under ergodicity, but possess faster tracking behavior with respect to the input signal, due to shorter time-lag.

  • Typical Lowpass filtering regime:
    In this regime, we are not interested in estimating expected values, but merely reducing the high-frequency noise content in a signal. It is simply returning its low‑pass filtered version of the input. This is the typical momentum. Since we want smoothing, \(0 < \beta \le 0.9\) (normal smoothing), using \(0.9\) has long been used (before deep learning) as a good default value, when nothing is known about the frequency characteristics of the input signal.


Frequency‑Domain Representation

The transfer function of the single pole filter is

\[H(z) = (1-\beta)\,\frac{1}{1 - \beta z^{-1}},\]

on the unit circle \(z = e^{j\omega}\).

  • Magnitude response:
    Low frequencies (\(\omega\) nearer to \(0\)) pass with gain near 1.
    Both lower and higher frequencies are attenuated more strongly as \(\beta \to 1\).
    With high \(\beta\) acts as a very narrow-band low‑pass filter, and thus an EMA.

  • Phase response:
    The filter introduces a delay (phase lag) that grows with \(\beta\).


Why a Zero Breaks EMA Behavior

For the filter

\[H(z)=\eta\,\frac{1-\gamma z^{-1}}{1-\beta z^{-1}}, \qquad 0<\beta<1,\]

By taking the inverse \(z\) transform of \(H(z)\) we get the filter’s unit impulse response signal as

\[\begin{equation} \label{eq:ImpulseResponse} h[t] = \eta \Bigl(\beta^t u[t] - \gamma \beta^{t-1} u[t-1]\Bigr), \end{equation}\]

where \(u[t]\) is the unit step signal that takes the value \(1\) for \(t \ge 0\) and \(0\) otherwise. Using the convolution property of all linear, iteration-invariant systems, it immediately follows that the smoothed output is given by \(\begin{equation} \label{eq:Convolution} y_{t} = \sum_{k = 0}^t h[k]\,x_{t-k}. \end{equation}\)

The simplified impulse response becomes

\[h[0]=\eta,\qquad h[t]=\eta\,\beta^{t-1}(\beta-\gamma),\quad t\ge 1.\]

The filter is normalized if as \(t\to\infty\), \(\sum_{k=0}^{\infty} h[k] = 1\). To see this, take \(\sum_{k=0}^{\infty} h[k] = \eta + \eta(\beta-\gamma)\sum_{k=1}^{\infty}\beta^{k-1} = \eta\left(\frac{1-\gamma}{1-\beta}\right),\) which equals 1 only for a specific choice of \(\eta = \left(\frac{1-\beta}{1-\gamma}\right)\).

This factorization makes the key point immediate:

  • If \(\gamma < \beta\), the weights are positive, but not monotone, so the filter cannot represent an exponential average.
  • If \(\gamma > \beta\), then \(h[k] < 0\) for all \(k\ge 1\), so the filter assigns negative weights, which is impossible for an EMA.
  • If \(\gamma = \beta\), the numerator cancels the pole and the filter becomes non-integrating, again incompatible with EMA behavior.
  • Only when \(\gamma = 0\) does the zero vanish and the filter reduce to the pure single-pole form required for an EMA.

Therefore, any numerator zero other than \(\gamma = 0\) destroys the positive, monotone, normalized, decaying impulse response structure required for an EMA. A first-order low-pass filter with a zero is therefore never an EMA unless the zero is removed.

Non-monotone

Recall that the typical single-pole version has geometrically decaying weights \(h[k] = (1-\beta)\beta^k,\) which satisfy \(\frac{h[k]}{h[k-1]} = \beta < 1\), for \(k>0\).

For the complete first-order filter, when \(\gamma < \beta\), the factor \((\beta-\gamma)\) is positive, so the weighting sequence \(h[k]\) is positive. When \(\gamma > \beta\), the factor \((\beta-\gamma)\) is positive, so the weighting sequence \(h[k]\) is negative.

To see that this sequence is not exponentially monotone, we find that the filter starts with an initial ratio \(\frac{h[1]}{h[0]} = \beta - \gamma\), which is not necessarily a decay factor like \(\beta\), therefore inducing a structural jump at \(k=1\), then geometric decay for \(k\ge 1\), with ratio \(\frac{h[k]}{h[k-1]} = \beta\). So this weighting sequence does not possess a pure exponential decay shape. Because an EMA requires a single, consistent exponential decay ratio, this sequence cannot represent an estimator of a statistical average, even though the weights are positive and eventually monotone.

In particular, the EMA is the unique linear estimator of a mean whose weights form a normalized, non‑negative, pure exponential sequence. This structure is forced by the requirements of exponential forgetting, minimum‑variance estimation under i.i.d. noise, and the Markov sufficiency of the one‑step recursion. Any deviation from this exponential form, such as the introduction of a numerator zero in the filter structure, breaks the probabilistic interpretation entirely. A time average such as an EMA is therefore a specific statistical estimator of the mean whose weighting structure is uniquely determined by probability theory.


Takeaways

A single-pole lowpass filter’s interpretation as an EMA (short-term or long-term) depends on its pole location (the value of \(\beta\)) and the intent behind its use:

  • High \(\beta\) regime:
    The pole of the filter is very close to the unit circle (edge of stability often called marginal stability).
    • The filter has long memory and effectively integrates over a large window of past inputs.
    • This makes the output track only the slowly‑varying mean of the input (under ergodic assumptions).
    • High‑frequency content (rapid changes, oscillations, transient spikes) is heavily attenuated, i.e. a lot of information is lost.
    • In estimation, this is the EMA regime: the output is treated as a proxy for a statistical expectation.
  • Low–Moderate \(\beta\) regime:
    The pole is further inside the unit circle.
    • The filter has shorter memory and responds more directly to the current input.
    • The output is a smoothed version of the full input signal, not its mean estimate (under ergodic assumptions).
    • High‑frequency noise is reduced, but the underlying variations are still preserved.
    • In stochastic gradient learning, this is the momentum regime: the filter smooths and shapes the input trajectory rather than estimating a mean value. We call this lowpass regularization.

Momentum is not an EMA. Conflating the two misses the point: it is the same lowpass filter operating in a different regime, with different intent: lowpass smoothing vs. mean estimation.

EMA operations are achieved via a single-pole lowpass filter. But typical smoothing operations can be achieved using simple to more-complicated filter configurations.


References

Given a normalized EMA: \(y_{t} = \beta y_{t-1} + (1-\beta) x_t\), or its un-normalized version \(y_{t} = \beta y_{t-1} + x_t\),

The probabilistic foundations of this type of update, including its

  • uniqueness as a linear estimator with normalized, non‑negative, exponentially decaying weights;
  • minimum‑variance estimator property \(\beta \to 1\) under i.i.d. noise; and
  • representation as a one‑step Markov‑sufficient recursion;

are established in the classical literature on adaptive filtering, time‑series analysis, stochastic approximation, and Kalman filtering. Some of such works which provide standard derivations are:

  • Winters, P. R. (1960). Forecasting Sales by Exponentially Weighted Moving Averages. Management Science, 6(3), 324–342.
    Establishes exponential smoothing as the unique minimum‑variance linear estimator under i.i.d. noise with exponential discounting. This is the earliest formal derivation of the EMA as a statistical estimator, not a filter.

  • Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley.
    Provides the canonical derivation of exponential smoothing as the unique estimator whose weights form a normalized exponential sequence, and connects the EMA to ARIMA(0,1,1) structure.

  • Kalman, R. E. (1960). A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering, 82(1), 35–45.
    Shows that the scalar Kalman filter with constant gain reduces exactly to the EMA recursion, establishing its minimum‑variance and Markov‑sufficient properties. This provides the origin of recursive least squares (RLS) in adaptive filtering.

  • Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics, 22(3), 400–407.
    Demonstrates that the EMA arises as the unique stable linear estimator under constant‑step stochastic approximation.

  • Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press.
    The earliest rigorous derivation of constant‑gain recursive estimators from probability theory, establishing exponential forgetting, minimum‑variance estimation, and one‑step Markov sufficiency.

  • Tsypkin, Ya. Z. (1971). Adaptation and Learning in Automatic Systems. Academic Press.
    Provides one of the earliest unified treatments of adaptive estimation, exponential forgetting, and constant‑gain recursive filters, establishing the probabilistic structure underlying EMA‑type estimators.

  • Ljung, L., & Söderström, T. (1983). Theory and Practice of Recursive Identification. MIT Press.
    Derives the EMA as the unique solution to an exponentially weighted least‑squares estimation and connects it to recursive identification theory.

Particularly, the EMA can be viewed as

  • scalar Kalman filter with constant gain
  • constant‑step Robbins–Monro estimator
  • a unique exponentially‑weighted least‑squares estimator
  1. Kalman, R. E. (1960) show the EMA as the scalar Kalman filter with constant gain \(K=1−\beta\) produces \(y_{t} = \beta y_{t-1} + (1-\beta) x_t\). They show that this recursion is the optimal minimum‑variance estimator under Gaussian i.i.d. noise and exponential prior decay. This is the most rigorous probabilistic justification of the EMA.

  2. Robbins, H., & Monro, S. (1951) show that the update \(y_{t} = y_{t-1} + \alpha_t(x_t - y_{t-1})\) reduces to the EMA when \(\alpha_t\) is constant.

  3. Tsypkin, Ya. Z. (1971), Ljung, L., & Söderström, T. (1983) show that for a constant mean \(m=y\), minimizing \(\sum_{k = 0}^\infty \beta^k\,(x_{t-k} - y)^2\) yields the EMA and that this is the only estimator consistent with exponential forgetting and recursive sufficiency.

Together, these works show that as an EMA, the single pole lowpass filter is not an arbitrary smoother. As an EMA, it is the unique linear estimator of a mean whose weighting structure is fixed by probability theory: normalized, non‑negative, and purely exponential. Any deviation from this structure via a general filter structure, such as introducing a numerator zero into the transfer-function, breaks the probabilistic interpretation entirely.