Memory as Polynomial Projection: The Mathematics of Long-Context Predictive Modeling
Serendeep Rudraraju

The most-cited result in the SSM-vs-Transformer debate is Repeat After Me [1], which proves that state-space models with fixed-size hidden state cannot copy strings of unbounded length. Two-layer Transformers can copy strings of exponential length in their parameter count. This is mathematically tight, empirically reproduced, and reposted on Twitter every six weeks.
It is also irrelevant to the problem most state-space models were designed to solve.
I've spent the last year reading SSM papers like someone watching two different sports through the same window. There is the long-context-as-retrieval game (needle in a haystack, multi-hop tracing, copy-paste over a million tokens), where Transformers and hybrids are clearly winning. And there is the long-context-as-continuous-prediction game (predict the next tactile reading given a minute of vibration data; forecast a financial regime given a year of multivariate ticks), where state-space models are quietly running up the score. Hierarchical state-space models [2] beat causal Transformers, LSTMs, S4, and Mamba by at least 23% MSE on six real-world sensor datasets. Almost nobody is writing about that, because the benchmark isn't a chatbot.
This post follows one mathematical idea, projecting a signal onto a polynomial basis under a chosen measure, from a 2020 functional-analysis result through to Mamba-3 at ICLR 2026 and HiSS for sensor prediction, and argues that conflating the two long-context problems is why the discourse keeps drawing the wrong conclusions.
TL;DR
"Long context" is two problems. For exact recall of discrete tokens, attention wins by a theorem [1]. For bounded sufficient statistics of a continuous signal across temporal scales, state-space models win by construction. The mathematical idea unifying the SSM tradition, HiPPO [4], is the optimal projection of a continuous signal's history onto a polynomial basis under a chosen measure. Everything from S4 [5] through Mamba-3 [9] is an engineering refinement of that one idea, and HiSS [2] stacks it hierarchically across temporal resolutions. The 2026 production architecture is hybrid because the correct answer was never "pick one."
Two problems wearing the same costume
The long-context discourse has collapsed two structurally different problems into one benchmark genre. Naming them separately clears most of the architectural debate.
Long-context-for-retrieval. The objective is exact recall of a finite set of tokens from a long history. The benchmarks are RULER's needle-in-a-haystack, multi-hop tracing, aggregation, and variable tracking [3]. The cost function is binary: did the model pull the right needle. The information-theoretic shape requires storage proportional to what you must recall — if there are k needles and each could be anywhere in n tokens, you need roughly O(k log n) bits of state to localize them at retrieval time.
Long-context-for-prediction. The objective is the next continuous-valued step given a long history. The benchmarks are sensor forecasting [2], time-series prediction [20], financial regime-switching forecasts, biomedical signals. The cost function is MSE or NLL over a continuous distribution. The information-theoretic shape is forgiving: you don't need to remember the past exactly, only retain sufficient statistics, the smallest representation of the past that preserves the predictive distribution of the future. For a Gaussian process that's mean plus covariance. For an SSM it's the projection coefficients onto a chosen basis.
These have different complexity floors. Exact recall has a hard storage lower bound. Continuous prediction tolerates lossy compression as long as the compressed past stays predictively sufficient.
| Problem | Best architecture | Reason |
|---|---|---|
| Needle-in-haystack (single needle) | SSM or hybrid | Both pass; SSM cheaper per token |
| RULER multi-hop tracing, aggregation | Transformer or attention-heavy hybrid | SSMs degrade as length grows [3] |
| Phonebook lookup, exact copy | Transformer | Repeat-After-Me theorem [1] |
| Long-context summarization | Hybrid (Jamba-style) [10] | Mix of exact recall and bounded statistics |
| Continuous sensor prediction (tactile, IMU) | HiSS [2] | Multi-resolution physical processes match the hierarchy |
| Time-series forecasting with regime switches | Hybrid SSM or attention-based foundation model | Actively contested |
| Long-Range Arena Path-X (length 16K) | S4 / S5 | Transformers score near random [5][6] |
| Streaming inference at long context | Mamba-3 [9] | Constant time and memory per emitted token |
The benchmark you reach for silently encodes which problem you think you're solving. If you only have a hammer, every problem looks like NIAH.
HiPPO, and the polynomial that ate signal processing
The mathematical question behind all of this is older than transformers: given a continuous input signal $f(t)$, what is the optimal way to compress its history into a finite-dimensional state $c(t) \in \mathbb{R}^N$, such that the compression is updated online and approximates $f$ as well as possible under a chosen importance measure?
HiPPO [4] gives a closed-form answer. Pick a measure $\mu$ over the past (this is where you encode what "remembering" means) and a basis of orthogonal polynomials ${g_n}$ under that measure. At every time $t$, the best degree-$N$ polynomial approximation of $f$'s history is the unique projection
$$f(s) \approx \sum_{n=0}^{N-1} c_n(t), g_n(s), \quad s \in (-\infty, t]$$
The coefficients $c(t)$ that minimize the $L^2(\mu)$ error evolve according to a linear ODE
$$\dot{c}(t) = A, c(t) + B, f(t)$$
where $A$ and $B$ depend only on the choice of basis and measure. This is a theorem, not a heuristic. Different measures give different $A$. The most useful one, the LegS (scaled Legendre) measure, gives "remember everything, weighted by recency, in an exponentially-scaled window." Its $A$ matrix has a closed form that S4 inherits almost verbatim.
The structure of the LegS $A$ matrix is what makes it tractable:
pythondef hippo_legs_A(N): """HiPPO-LegS state matrix. Lower-triangular with diagonal decay.""" n = np.arange(N) A = -np.sqrt((2 * n[:, None] + 1) * (2 * n[None, :] + 1)) A = np.tril(A, k=-1) - np.diag(n + 0.5) return A
The numerical analysis of the resulting continuous-time ODE has its own paper [16] — the LegS ODE is singular at $t = 0$, but the well-posedness can be proved rigorously, which matters more than it should given how many later results depend on this object behaving well.
To go from continuous to discrete, you pick a discretization rule. Zero-order hold, bilinear (Tustin), exponential-Euler — each trades fidelity for compute. The discrete-time recurrence under exponential-Euler is
$$c_{t+1} = \bar{A}, c_t + \bar{B}, f_t, \quad \bar{A} = e^{\Delta A}, \quad \bar{B} = (\bar{A} - I) A^{-1} B$$
which, if you squint, is the linear RNN of an introductory deep learning course — except $A$ has been chosen by functional analysis instead of by random initialization plus prayer.
The Legendre Memory Unit [15] showed in 2019 that this trick can handle dependencies across 100,000 time steps with a tiny number of internal state variables. That paper got polite citations and not much else. The thing it foreshadowed, that polynomials (the original neural network) were unreasonably good at long-range memory, only landed when S4 walked through the same door three years later.
The family tree: S4 → S5 → Mamba → Mamba-2 → Mamba-3
Each step in the lineage refines the same idea, HiPPO's projection, for a different compute regime. The naming is confusing on purpose. The through-line is not.
S4 (Gu, Goel, Ré, 2022) [5] took HiPPO's continuous ODE, discretized via zero-order hold, and computed the discrete-time impulse response in closed form using a Cauchy-Vandermonde factorization. The convolutional view lets you train the model as a long causal convolution; the recurrent view lets you run it autoregressively at inference. The same operator, two compute regimes. S4 was the first architecture to clear the Long-Range Arena Path-X task at length 16,000, which every prior model (Transformers included) had scored at random.
S5 (Smith, Warrington, Linderman, 2022) [6] replaced S4's bank of single-input-single-output channels with one multi-input-multi-output SSM. It replaced the convolution view with a parallel scan. It diagonalized the state matrix, which S4D had already shown was a benign simplification. The result was an 87.2% LRA average and a much simpler implementation. The math is the same; the engineering is cleaner.
Mamba (Gu and Dao, 2023) [7] broke the most sacred property of the lineage so far: time-invariance. In S4 and S5, $A$, $B$, $C$ are fixed parameters; the model treats every input the same way. Mamba makes $B$, $C$, and the step size $\Delta$ depend on the input. The model can now selectively remember or forget. The cost is real: the FFT-based convolution trick disappears because the operator is no longer LTI. The gain is also real: a hardware-aware selective scan in custom CUDA, 5× higher inference throughput than comparable Transformers, and the ability to ignore irrelevant tokens instead of compressing them with equal weight.
Mamba-2 (Dao and Gu, ICML 2024) [8] is the one I keep rereading. The Structured State Space Duality (SSD) result proves that an SSM with $A = \alpha I$ (a scalar times the identity) is equivalent to a masked linear attention with a 1-semiseparable causal mask. SSMs and Transformers are different decompositions of the same token-mixing matrix. The linear form (recurrence) is what you use for inference; the quadratic form (matmul) is what you use for training. Mamba-2 runs 2-8× faster than Mamba-1 by computing the operator via block-decomposition of this semiseparable matrix, which is matmul-heavy and hardware-friendly. We spent a year of architectural debate over which paradigm wins. The chairs had been rearranged.
Mamba-3 (ICLR 2026) [9] ships three changes, all mathematical:
- Complex-valued state spaces. Real-valued linear systems are provably incapable of certain state-tracking tasks like parity and modular arithmetic at fixed depth. Complex eigenvalues recover these capabilities at no asymptotic cost.
- Exponential-trapezoidal discretization. Mamba-1 and Mamba-2 used a first-order exponential-Euler step. Mamba-3 uses a second-order accurate exponential-trapezoidal rule, which preserves more of the continuous-time dynamics at the same parameter count.
- MIMO formulation revisited. Improves inference-time hardware utilization. The post-training era is inference-heavy [18]; the architecture is being engineered for that.
The result is a model with half the state size of Mamba-2 at comparable perplexity. Mamba-1 to Mamba-2 to Mamba-3 ships with progressively smaller state sizes. It is the most counterintuitive product roadmap in machine learning, and it follows from a clear-eyed read of where the inference-cost curve has bent.
| Model | Year | Key innovation | State type | Selectivity | Compute model | Killer result |
|---|---|---|---|---|---|---|
| HiPPO | 2020 | Polynomial projection theory | Real, diagonal | No | — | Theoretical framework [4] |
| LMU | 2019 | ODE-derived recurrent memory | Real | No | RNN | 100K+ step memory [15] |
| S4 | 2021 | Structured $A$, closed-form impulse response | Real, DPLR | No | Long convolution / RNN | First to clear LRA Path-X [5] |
| S4D | 2022 | Diagonal simplification | Real, diagonal | No | Conv / RNN | Same spectrum as LegS |
| S5 | 2022 | MIMO + parallel scan | Real, diagonal | No | Parallel scan | 87.2% LRA average [6] |
| Mamba (S6) | 2023 | Input-dependent $A$, $B$, $\Delta$ | Real, diagonal | Yes | Selective scan (custom CUDA) | 5× throughput vs Transformer [7] |
| Mamba-2 | 2024 | Structured State Space Duality | Real, scalar $\times I$ | Yes | Block matmul via 1-semiseparable | 2-8× faster than Mamba-1 [8] |
| HiSS | 2024 | Two-level temporal hierarchy | Inherits base SSM | Inherits base | Two stacked SSMs | 23% MSE on sensor prediction [2] |
| Mamba-3 | 2026 | Complex states, exp-trapezoidal, MIMO | Complex | Yes | Refined for inference hardware | Half state size at parity [9] |
| Jamba | 2024 | Interleaved Mamba + attention + MoE | Mixed | Mixed | Mixed | 256K effective context [10] |
| Taipan | 2024 | Mamba-2 + selective attention | Mixed | Mixed | Mixed | Accurate to 1M tokens [11] |
The naming convention deserves one note. S4 to S5 to S6 was renamed Mamba because "S6" sounds like a midrange Audi. The rest of the field gave up on numbered nomenclature and started naming everything after either snakes or African capitals.
The Structured State Space Duality, or: how we spent a year on notation
The SSD result deserves its own beat because it reframed the entire debate.
A semiseparable matrix is one whose lower-triangular blocks have low rank — specifically, every contiguous submatrix below the diagonal has rank at most $k$ for some small $k$. These matrices have been studied in numerical linear algebra since the 1990s; they are the discrete-time generalization of rank-structured operators. Most "linear time" sequence-modeling algorithms turn out to be renamed variants of techniques you can find in textbooks on hierarchical and rank-structured matrices.
The SSD result [8] is that an SSM with $A = \alpha I$ produces, when unrolled across a sequence, a matrix that is exactly 1-semiseparable. The masked-attention matrices of a particular class (those with a structured causal mask whose entries follow a multiplicative decay) are also 1-semiseparable. The two operators are different ways of computing the same underlying matrix.
This gives you two algorithms for free:
- Linear form (recurrence): compute the output sequentially, $O(N)$ in length, with constant memory per step. Use this for inference.
- Quadratic form (matmul): materialize the full $N \times N$ operator and apply it via dense matrix multiplication. Use this for training, where matmuls are the GPU's preferred unit of work.
Same operator. Two compute regimes. Different hardware bottlenecks. Mamba-2 ships hybrid kernels that switch between them based on sequence length and batch shape.
What this means in practice: the question "are state-space models or Transformers the right architecture" has been a category error since May 2024. The actual question is which decomposition of the token-mixing matrix matches your compute budget and your task structure. SSMs and Transformers occupy adjacent regions of the same design space.
The Repeat-After-Me theorem: where SSMs lose, honestly
The case against SSMs as a general-purpose architecture is real, and it's worth stating cleanly.
The Repeat-After-Me theorem [1] states that a two-layer Transformer can copy strings of length exponential in its parameter count, while generalized SSMs with fixed hidden-state size cannot copy strings longer than what fits in their state. This is a capacity statement, not a training artifact. You can verify it by counting bits: a state of dimension $d$ can carry at most $O(d)$ bits of information; copying a length-$n$ string requires $O(n \log V)$ bits where $V$ is the vocabulary size. If $n > d / \log V$, the state cannot represent the input losslessly. The model has to forget something.
Empirically the gap is worse than the theorem predicts. Even when you enlarge Mamba's hidden state so it could in principle hold the input, Mamba needs roughly 100× more training data than a Transformer to learn the copying task [1]. The loss surface for SSM copying is hostile in ways the capacity argument doesn't capture. Mimetic initialization [12] closes part of this gap by initializing SSM weights to mimic attention patterns; the asymptotic ceiling stays where it is.
So: any task that requires exact-token recall (Phonebook lookup, retrieval over discrete tokens, in-context learning that depends on copying) will favor attention. The harder RULER tasks beyond NIAH are mostly of this type [3]. The discourse is correct about this.
What the discourse misses is that production SSMs in 2026 don't run alone. Jamba interleaves Mamba and attention blocks [10]. Zamba 2 ships one attention layer per Mamba block. Nemotron Nano 2 and distilled Llama hybrids replace up to 93% of attention sub-layers with Mamba-2. The hybrid pattern is the engineering response to the theorem. And for predictive modeling on continuous signals, the theorem doesn't apply — continuous prediction doesn't require exact recall, only sufficient statistics, and SSMs are sufficient statistics by construction.
The Repeat-After-Me theorem is the rare ML result that is both mathematically tight and immediately misread on Twitter.
Hierarchies: where HiSS quietly wins
This is the part of the story almost no LLM-focused content discusses.
The setup in HiSS [2] is conceptually simple. Take a sensor sequence of length $T$. Divide it into $\lceil T/k \rceil$ chunks of size $k$. Pass each chunk through a shared low-level SSM (typically S4-style). For each chunk, take the SSM's output at the $k$-th element of that chunk (the recurrent state after consuming the chunk's full input). Concatenate these $k$-th outputs across chunks to form a rarefied feature sequence of length $\lceil T/k \rceil$. Pass this rarefied sequence through a higher-level SSM. Take its output as the prediction.
pythondef hiss_forward(x, k, low_ssm, high_ssm, out_head): """x: (batch, T, d_in). k: chunk size.""" B, T, d_in = x.shape pad = (-T) % k x = F.pad(x, (0, 0, 0, pad)) # right-pad along time chunks = x.reshape(B, -1, k, d_in) # (B, T/k, k, d_in) local = low_ssm(chunks.reshape(-1, k, d_in)) # (B*T/k, k, d_hid) local = local.reshape(B, -1, k, local.shape[-1]) # (B, T/k, k, d_hid) rarefied = local[:, :, -1, :] # take last step per chunk global_feats = high_ssm(rarefied) # (B, T/k, d_hid) return out_head(global_feats)
The math is doing something specific. The low-level SSM gives you a local feature representation at the original sampling rate. The high-level SSM gives you a representation at $1/k$ of the sampling rate, operating on the low-level's terminal states. Two SSMs at different temporal resolutions stacked together approximate a system that processes multiple temporal frequencies simultaneously.
Why this works: physical processes are multi-scale. A tactile sensor on a robot's gripper measures vibrations at 1 kHz that encode contact dynamics evolving over 1 Hz that encode grasp configurations evolving over 0.1 Hz. A single-scale SSM is being asked to compress all three scales into one polynomial projection. The hierarchy gives each scale its own projection. The match between architecture and signal is what produces the 23% MSE improvement across the six datasets in the CSP-Bench benchmark [2]: tactile sensing on ReSkin pads, accelerometer-based IMU state prediction, and four other continuous-prediction tasks. The gap holds across dataset sizes and survives standard filtering preprocessing.
There is a natural extension. The chunk size $k$ doesn't have to be fixed. Dynamic Chunking [13] (Hwang et al., 2025) learns the chunk boundaries end-to-end, letting the model adapt its temporal resolution to the signal. This is the obvious next step and it's already published.
While LLM Twitter spent six months arguing whether SSMs could ever beat Transformers, HiSS beat them on tactile prediction by 23% and nobody noticed, because the benchmark wasn't a chatbot.
The 2026 frontier
Where the field is, as of May 2026, and what's actually being deployed.
Hybrid is the default at scale. Jamba (256K effective context, MoE) [10]. Zamba 2 (one-attention-per-block). Nemotron Nano 2. Distilled Llama hybrids that replace up to 93% of attention with Mamba-2 blocks. Every team I know shipping production at long context is shipping hybrid. The "pure" SSM language model is now mostly a research artifact; the "pure" Transformer at long context is increasingly an economic mistake.
Mamba-3 is the inference-era answer. [9] The smaller state and second-order discretization optimize for cost per token at deployment, not training-scale perplexity. Cartesia [18] and Together AI are betting on this. Their explicit framing is that the LLM market has shifted toward post-training and inference-heavy deployment, and the architecture follows the economics.
Time-series foundation models are mostly attention. Chronos-2, Moirai-2, TimesFM. The SSM-for-time-series story is technically promising (S-Mamba, DS3M), but the training is unstable in ways nobody has fully fixed [20]. The foundation-model players defaulted to encoder-decoder attention because it shipped first. I expect that to change as Mamba-3 propagates.
HiSS-style hierarchies are quietly spreading. Dynamic Chunking [13] is the obvious generalization. The pattern is moving from sensor-prediction-specific to general sequence modeling. The open empirical question is whether HiSS-style hierarchies scale to language. Most published work has stuck to continuous prediction. Almost nobody has run the scaling experiment for tokens. This is the most interesting open question in the family right now.
Training-free context extension exists. LongMamba [14] (ICLR 2025) extends Mamba's effective receptive field without retraining. Useful in practice if you're deploying a pretrained SSM and discover you need more context than the training recipe used.
Where this leaves you
Long-context is two problems. Exact recall over discrete tokens: Transformers, by theorem. Continuous prediction under bounded sufficient statistics: state-space models, by construction. The mathematical idea unifying the SSM tradition is HiPPO's polynomial projection: choose a measure, project the past onto an orthogonal polynomial basis, evolve the coefficients by an ODE. Everything from S4 to Mamba-3 is engineering on top of that one idea. The 23% MSE gap that HiSS opens on continuous sensor prediction comes from matching architecture to signal: multi-resolution to multi-resolution.
Concrete actions:
- For LLM workloads with retrieval components, default to a hybrid (Jamba, Zamba 2, or a Mamba-2-heavy distill). Pure attention is fine; pure SSM is theoretically limited.
- For continuous-signal prediction (sensors, IMU, vibration, biological signals), try HiSS before any Transformer. The math says it should win; the published benchmarks agree.
- For inference-cost-constrained deployments, evaluate Mamba-3. The complex-state and exponential-trapezoidal changes are genuinely new, not marketing.
- For research projects, the open question is whether HiSS-style hierarchies scale to language. Almost nobody has run that experiment yet, which is why somebody should.
The polynomial that ate signal processing in the 1800s is now eating long-context predictive modeling in the 2020s. Which, given how long the polynomial has been around, ought to embarrass everyone exactly the right amount.
Sources
- Jelassi, S., Brandfonbrener, D., Kakade, S.M., Malach, E. "Repeat After Me: Transformers are Better than State Space Models at Copying." ICML 2024.
- Bhirangi, R., Wang, C., Pattabiraman, V., Majidi, C., Gupta, A., Hellebrekers, T., Pinto, L. "Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling." 2024.
- Hsieh, C-P., et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?" 2024.
- Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C. "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS 2020.
- Gu, A., Goel, K., Ré, C. "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR 2022.
- Smith, J.T.H., Warrington, A., Linderman, S.W. "Simplified State Space Layers for Sequence Modeling." 2022.
- Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." 2023.
- Dao, T., Gu, A. "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024.
- Mamba-3 authors. "Mamba-3: Improved Sequence Modeling using State Space Principles." ICLR 2026.
- Lieber, O., et al. "Jamba: A Hybrid Transformer-Mamba Language Model." 2024.
- Pham, C., et al. "Taipan: Efficient and Expressive State Space Language Models with Selective Attention." 2024.
- Trockman, A., et al. "Mimetic Initialization Helps State Space Models Learn to Recall." 2024.
- Hwang, S., et al. "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling." 2025.
- "LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement." ICLR 2025.
- Voelker, A., Kajić, I., Eliasmith, C. "Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks." NeurIPS 2019.
- Bahri, M., Galuzzi, B., Mongelli, M. "Numerical Analysis of HiPPO-LegS ODE for Deep State Space Models." 2024.
- Dao, T. "State Space Duality (Mamba-2) Part I-III." Tri Dao's blog, 2024.
- Cartesia AI. "Mamba-3: An Inference-First State Space Model." 2026.
- Rush, A. "The Annotated S4."
- Yang, K., et al. "Towards Long-Context Time Series Foundation Models." 2024.
- Series prerequisite: "Mamba & State-Space Models." Earlier deep dive on the architecture; this post is the mathematical follow-up.
Enjoyed this post? Consider supporting the blog.
Buy me a coffee