The Prompt & The Ponder

Memory as Polynomial Projection: The Mathematics of Long-Context Predictive Modeling

Serendeep Rudraraju — Tue, 19 May 2026 15:11:38 GMT

Memory as Polynomial Projection: The Mathematics of Long-Context Predictive Modeling

The most-cited result in the SSM-vs-Transformer debate is Repeat After Me [1], which proves that state-space models with fixed-size hidden state cannot copy strings of unbounded length. Two-layer Transformers can copy strings of exponential length in their parameter count. This is mathematically tight, empirically reproduced, and reposted on Twitter every six weeks.

It is also irrelevant to the problem most state-space models were designed to solve.

I've spent the last year reading SSM papers like someone watching two different sports through the same window. There is the long-context-as-retrieval game (needle in a haystack, multi-hop tracing, copy-paste over a million tokens), where Transformers and hybrids are clearly winning. And there is the long-context-as-continuous-prediction game (predict the next tactile reading given a minute of vibration data; forecast a financial regime given a year of multivariate ticks), where state-space models are quietly running up the score. Hierarchical state-space models [2] beat causal Transformers, LSTMs, S4, and Mamba by at least 23% MSE on six real-world sensor datasets. Almost nobody is writing about that, because the benchmark isn't a chatbot.

This post follows one mathematical idea, projecting a signal onto a polynomial basis under a chosen measure, from a 2020 functional-analysis result through to Mamba-3 at ICLR 2026 and HiSS for sensor prediction, and argues that conflating the two long-context problems is why the discourse keeps drawing the wrong conclusions.

TL;DR

"Long context" is two problems. For exact recall of discrete tokens, attention wins by a theorem [1]. For bounded sufficient statistics of a continuous signal across temporal scales, state-space models win by construction. The mathematical idea unifying the SSM tradition, HiPPO [4], is the optimal projection of a continuous signal's history onto a polynomial basis under a chosen measure. Everything from S4 [5] through Mamba-3 [9] is an engineering refinement of that one idea, and HiSS [2] stacks it hierarchically across temporal resolutions. The 2026 production architecture is hybrid because the correct answer was never "pick one."

Two problems wearing the same costume

The long-context discourse has collapsed two structurally different problems into one benchmark genre. Naming them separately clears most of the architectural debate.

Long-context-for-retrieval. The objective is exact recall of a finite set of tokens from a long history. The benchmarks are RULER's needle-in-a-haystack, multi-hop tracing, aggregation, and variable tracking [3]. The cost function is binary: did the model pull the right needle. The information-theoretic shape requires storage proportional to what you must recall — if there are k needles and each could be anywhere in n tokens, you need roughly O(k log n) bits of state to localize them at retrieval time.

Long-context-for-prediction. The objective is the next continuous-valued step given a long history. The benchmarks are sensor forecasting [2], time-series prediction [20], financial regime-switching forecasts, biomedical signals. The cost function is MSE or NLL over a continuous distribution. The information-theoretic shape is forgiving: you don't need to remember the past exactly, only retain sufficient statistics, the smallest representation of the past that preserves the predictive distribution of the future. For a Gaussian process that's mean plus covariance. For an SSM it's the projection coefficients onto a chosen basis.

These have different complexity floors. Exact recall has a hard storage lower bound. Continuous prediction tolerates lossy compression as long as the compressed past stays predictively sufficient.

Problem	Best architecture	Reason
Needle-in-haystack (single needle)	SSM or hybrid	Both pass; SSM cheaper per token
RULER multi-hop tracing, aggregation	Transformer or attention-heavy hybrid	SSMs degrade as length grows [3]
Phonebook lookup, exact copy	Transformer	Repeat-After-Me theorem [1]
Long-context summarization	Hybrid (Jamba-style) [10]	Mix of exact recall and bounded statistics
Continuous sensor prediction (tactile, IMU)	HiSS [2]	Multi-resolution physical processes match the hierarchy
Time-series forecasting with regime switches	Hybrid SSM or attention-based foundation model	Actively contested
Long-Range Arena Path-X (length 16K)	S4 / S5	Transformers score near random [5][6]
Streaming inference at long context	Mamba-3 [9]	Constant time and memory per emitted token

The benchmark you reach for silently encodes which problem you think you're solving. If you only have a hammer, every problem looks like NIAH.

HiPPO, and the polynomial that ate signal processing

The mathematical question behind all of this is older than transformers: given a continuous input signal $f(t)$, what is the optimal way to compress its history into a finite-dimensional state $c(t) \in \mathbb{R}^N$, such that the compression is updated online and approximates $f$ as well as possible under a chosen importance measure?

HiPPO [4] gives a closed-form answer. Pick a measure $\mu$ over the past (this is where you encode what "remembering" means) and a basis of orthogonal polynomials ${g_n}$ under that measure. At every time $t$, the best degree-$N$ polynomial approximation of $f$'s history is the unique projection

$$f(s) \approx \sum_{n=0}^{N-1} c_n(t), g_n(s), \quad s \in (-\infty, t]$$

The coefficients $c(t)$ that minimize the $L^2(\mu)$ error evolve according to a linear ODE

$$\dot{c}(t) = A, c(t) + B, f(t)$$

where $A$ and $B$ depend only on the choice of basis and measure. This is a theorem, not a heuristic. Different measures give different $A$. The most useful one, the LegS (scaled Legendre) measure, gives "remember everything, weighted by recency, in an exponentially-scaled window." Its $A$ matrix has a closed form that S4 inherits almost verbatim.

The structure of the LegS $A$ matrix is what makes it tractable:

def hippo_legs_A(N):
    """HiPPO-LegS state matrix. Lower-triangular with diagonal decay."""
    n = np.arange(N)
    A = -np.sqrt((2 * n[:, None] + 1) * (2 * n[None, :] + 1))
    A = np.tril(A, k=-1) - np.diag(n + 0.5)
    return A

The numerical analysis of the resulting continuous-time ODE has its own paper [16] — the LegS ODE is singular at $t = 0$, but the well-posedness can be proved rigorously, which matters more than it should given how many later results depend on this object behaving well.

To go from continuous to discrete, you pick a discretization rule. Zero-order hold, bilinear (Tustin), exponential-Euler — each trades fidelity for compute. The discrete-time recurrence under exponential-Euler is

$$c_{t+1} = \bar{A}, c_t + \bar{B}, f_t, \quad \bar{A} = e^{\Delta A}, \quad \bar{B} = (\bar{A} - I) A^{-1} B$$

which, if you squint, is the linear RNN of an introductory deep learning course — except $A$ has been chosen by functional analysis instead of by random initialization plus prayer.

The Legendre Memory Unit [15] showed in 2019 that this trick can handle dependencies across 100,000 time steps with a tiny number of internal state variables. That paper got polite citations and not much else. The thing it foreshadowed, that polynomials (the original neural network) were unreasonably good at long-range memory, only landed when S4 walked through the same door three years later.

The family tree: S4 → S5 → Mamba → Mamba-2 → Mamba-3

Each step in the lineage refines the same idea, HiPPO's projection, for a different compute regime. The naming is confusing on purpose. The through-line is not.

S4 (Gu, Goel, Ré, 2022) [5] took HiPPO's continuous ODE, discretized via zero-order hold, and computed the discrete-time impulse response in closed form using a Cauchy-Vandermonde factorization. The convolutional view lets you train the model as a long causal convolution; the recurrent view lets you run it autoregressively at inference. The same operator, two compute regimes. S4 was the first architecture to clear the Long-Range Arena Path-X task at length 16,000, which every prior model (Transformers included) had scored at random.

S5 (Smith, Warrington, Linderman, 2022) [6] replaced S4's bank of single-input-single-output channels with one multi-input-multi-output SSM. It replaced the convolution view with a parallel scan. It diagonalized the state matrix, which S4D had already shown was a benign simplification. The result was an 87.2% LRA average and a much simpler implementation. The math is the same; the engineering is cleaner.

Mamba (Gu and Dao, 2023) [7] broke the most sacred property of the lineage so far: time-invariance. In S4 and S5, $A$, $B$, $C$ are fixed parameters; the model treats every input the same way. Mamba makes $B$, $C$, and the step size $\Delta$ depend on the input. The model can now selectively remember or forget. The cost is real: the FFT-based convolution trick disappears because the operator is no longer LTI. The gain is also real: a hardware-aware selective scan in custom CUDA, 5× higher inference throughput than comparable Transformers, and the ability to ignore irrelevant tokens instead of compressing them with equal weight.

Mamba-2 (Dao and Gu, ICML 2024) [8] is the one I keep rereading. The Structured State Space Duality (SSD) result proves that an SSM with $A = \alpha I$ (a scalar times the identity) is equivalent to a masked linear attention with a 1-semiseparable causal mask. SSMs and Transformers are different decompositions of the same token-mixing matrix. The linear form (recurrence) is what you use for inference; the quadratic form (matmul) is what you use for training. Mamba-2 runs 2-8× faster than Mamba-1 by computing the operator via block-decomposition of this semiseparable matrix, which is matmul-heavy and hardware-friendly. We spent a year of architectural debate over which paradigm wins. The chairs had been rearranged.

Mamba-3 (ICLR 2026) [9] ships three changes, all mathematical:

Complex-valued state spaces. Real-valued linear systems are provably incapable of certain state-tracking tasks like parity and modular arithmetic at fixed depth. Complex eigenvalues recover these capabilities at no asymptotic cost.
Exponential-trapezoidal discretization. Mamba-1 and Mamba-2 used a first-order exponential-Euler step. Mamba-3 uses a second-order accurate exponential-trapezoidal rule, which preserves more of the continuous-time dynamics at the same parameter count.
MIMO formulation revisited. Improves inference-time hardware utilization. The post-training era is inference-heavy [18]; the architecture is being engineered for that.

The result is a model with half the state size of Mamba-2 at comparable perplexity. Mamba-1 to Mamba-2 to Mamba-3 ships with progressively smaller state sizes. It is the most counterintuitive product roadmap in machine learning, and it follows from a clear-eyed read of where the inference-cost curve has bent.

Model	Year	Key innovation	State type	Selectivity	Compute model	Killer result
HiPPO	2020	Polynomial projection theory	Real, diagonal	No	—	Theoretical framework [4]
LMU	2019	ODE-derived recurrent memory	Real	No	RNN	100K+ step memory [15]
S4	2021	Structured $A$, closed-form impulse response	Real, DPLR	No	Long convolution / RNN	First to clear LRA Path-X [5]
S4D	2022	Diagonal simplification	Real, diagonal	No	Conv / RNN	Same spectrum as LegS
S5	2022	MIMO + parallel scan	Real, diagonal	No	Parallel scan	87.2% LRA average [6]
Mamba (S6)	2023	Input-dependent $A$, $B$, $\Delta$	Real, diagonal	Yes	Selective scan (custom CUDA)	5× throughput vs Transformer [7]
Mamba-2	2024	Structured State Space Duality	Real, scalar $\times I$	Yes	Block matmul via 1-semiseparable	2-8× faster than Mamba-1 [8]
HiSS	2024	Two-level temporal hierarchy	Inherits base SSM	Inherits base	Two stacked SSMs	23% MSE on sensor prediction [2]
Mamba-3	2026	Complex states, exp-trapezoidal, MIMO	Complex	Yes	Refined for inference hardware	Half state size at parity [9]
Jamba	2024	Interleaved Mamba + attention + MoE	Mixed	Mixed	Mixed	256K effective context [10]
Taipan	2024	Mamba-2 + selective attention	Mixed	Mixed	Mixed	Accurate to 1M tokens [11]

The naming convention deserves one note. S4 to S5 to S6 was renamed Mamba because "S6" sounds like a midrange Audi. The rest of the field gave up on numbered nomenclature and started naming everything after either snakes or African capitals.

The Structured State Space Duality, or: how we spent a year on notation

The SSD result deserves its own beat because it reframed the entire debate.

A semiseparable matrix is one whose lower-triangular blocks have low rank — specifically, every contiguous submatrix below the diagonal has rank at most $k$ for some small $k$. These matrices have been studied in numerical linear algebra since the 1990s; they are the discrete-time generalization of rank-structured operators. Most "linear time" sequence-modeling algorithms turn out to be renamed variants of techniques you can find in textbooks on hierarchical and rank-structured matrices.

The SSD result [8] is that an SSM with $A = \alpha I$ produces, when unrolled across a sequence, a matrix that is exactly 1-semiseparable. The masked-attention matrices of a particular class (those with a structured causal mask whose entries follow a multiplicative decay) are also 1-semiseparable. The two operators are different ways of computing the same underlying matrix.

This gives you two algorithms for free:

Linear form (recurrence): compute the output sequentially, $O(N)$ in length, with constant memory per step. Use this for inference.
Quadratic form (matmul): materialize the full $N \times N$ operator and apply it via dense matrix multiplication. Use this for training, where matmuls are the GPU's preferred unit of work.

Same operator. Two compute regimes. Different hardware bottlenecks. Mamba-2 ships hybrid kernels that switch between them based on sequence length and batch shape.

What this means in practice: the question "are state-space models or Transformers the right architecture" has been a category error since May 2024. The actual question is which decomposition of the token-mixing matrix matches your compute budget and your task structure. SSMs and Transformers occupy adjacent regions of the same design space.

The Repeat-After-Me theorem: where SSMs lose, honestly

The case against SSMs as a general-purpose architecture is real, and it's worth stating cleanly.

The Repeat-After-Me theorem [1] states that a two-layer Transformer can copy strings of length exponential in its parameter count, while generalized SSMs with fixed hidden-state size cannot copy strings longer than what fits in their state. This is a capacity statement, not a training artifact. You can verify it by counting bits: a state of dimension $d$ can carry at most $O(d)$ bits of information; copying a length-$n$ string requires $O(n \log V)$ bits where $V$ is the vocabulary size. If $n > d / \log V$, the state cannot represent the input losslessly. The model has to forget something.

Empirically the gap is worse than the theorem predicts. Even when you enlarge Mamba's hidden state so it could in principle hold the input, Mamba needs roughly 100× more training data than a Transformer to learn the copying task [1]. The loss surface for SSM copying is hostile in ways the capacity argument doesn't capture. Mimetic initialization [12] closes part of this gap by initializing SSM weights to mimic attention patterns; the asymptotic ceiling stays where it is.

So: any task that requires exact-token recall (Phonebook lookup, retrieval over discrete tokens, in-context learning that depends on copying) will favor attention. The harder RULER tasks beyond NIAH are mostly of this type [3]. The discourse is correct about this.

What the discourse misses is that production SSMs in 2026 don't run alone. Jamba interleaves Mamba and attention blocks [10]. Zamba 2 ships one attention layer per Mamba block. Nemotron Nano 2 and distilled Llama hybrids replace up to 93% of attention sub-layers with Mamba-2. The hybrid pattern is the engineering response to the theorem. And for predictive modeling on continuous signals, the theorem doesn't apply — continuous prediction doesn't require exact recall, only sufficient statistics, and SSMs are sufficient statistics by construction.

The Repeat-After-Me theorem is the rare ML result that is both mathematically tight and immediately misread on Twitter.

Hierarchies: where HiSS quietly wins

This is the part of the story almost no LLM-focused content discusses.

The setup in HiSS [2] is conceptually simple. Take a sensor sequence of length $T$. Divide it into $\lceil T/k \rceil$ chunks of size $k$. Pass each chunk through a shared low-level SSM (typically S4-style). For each chunk, take the SSM's output at the $k$-th element of that chunk (the recurrent state after consuming the chunk's full input). Concatenate these $k$-th outputs across chunks to form a rarefied feature sequence of length $\lceil T/k \rceil$. Pass this rarefied sequence through a higher-level SSM. Take its output as the prediction.

def hiss_forward(x, k, low_ssm, high_ssm, out_head):
    """x: (batch, T, d_in). k: chunk size."""
    B, T, d_in = x.shape
    pad = (-T) % k
    x = F.pad(x, (0, 0, 0, pad))                              # right-pad along time
    chunks = x.reshape(B, -1, k, d_in)                        # (B, T/k, k, d_in)
    local = low_ssm(chunks.reshape(-1, k, d_in))              # (B*T/k, k, d_hid)
    local = local.reshape(B, -1, k, local.shape[-1])          # (B, T/k, k, d_hid)
    rarefied = local[:, :, -1, :]                             # take last step per chunk
    global_feats = high_ssm(rarefied)                         # (B, T/k, d_hid)
    return out_head(global_feats)

The math is doing something specific. The low-level SSM gives you a local feature representation at the original sampling rate. The high-level SSM gives you a representation at $1/k$ of the sampling rate, operating on the low-level's terminal states. Two SSMs at different temporal resolutions stacked together approximate a system that processes multiple temporal frequencies simultaneously.

Why this works: physical processes are multi-scale. A tactile sensor on a robot's gripper measures vibrations at 1 kHz that encode contact dynamics evolving over 1 Hz that encode grasp configurations evolving over 0.1 Hz. A single-scale SSM is being asked to compress all three scales into one polynomial projection. The hierarchy gives each scale its own projection. The match between architecture and signal is what produces the 23% MSE improvement across the six datasets in the CSP-Bench benchmark [2]: tactile sensing on ReSkin pads, accelerometer-based IMU state prediction, and four other continuous-prediction tasks. The gap holds across dataset sizes and survives standard filtering preprocessing.

There is a natural extension. The chunk size $k$ doesn't have to be fixed. Dynamic Chunking [13] (Hwang et al., 2025) learns the chunk boundaries end-to-end, letting the model adapt its temporal resolution to the signal. This is the obvious next step and it's already published.

While LLM Twitter spent six months arguing whether SSMs could ever beat Transformers, HiSS beat them on tactile prediction by 23% and nobody noticed, because the benchmark wasn't a chatbot.

The 2026 frontier

Where the field is, as of May 2026, and what's actually being deployed.

Hybrid is the default at scale. Jamba (256K effective context, MoE) [10]. Zamba 2 (one-attention-per-block). Nemotron Nano 2. Distilled Llama hybrids that replace up to 93% of attention with Mamba-2 blocks. Every team I know shipping production at long context is shipping hybrid. The "pure" SSM language model is now mostly a research artifact; the "pure" Transformer at long context is increasingly an economic mistake.

Mamba-3 is the inference-era answer. [9] The smaller state and second-order discretization optimize for cost per token at deployment, not training-scale perplexity. Cartesia [18] and Together AI are betting on this. Their explicit framing is that the LLM market has shifted toward post-training and inference-heavy deployment, and the architecture follows the economics.

Time-series foundation models are mostly attention. Chronos-2, Moirai-2, TimesFM. The SSM-for-time-series story is technically promising (S-Mamba, DS3M), but the training is unstable in ways nobody has fully fixed [20]. The foundation-model players defaulted to encoder-decoder attention because it shipped first. I expect that to change as Mamba-3 propagates.

HiSS-style hierarchies are quietly spreading. Dynamic Chunking [13] is the obvious generalization. The pattern is moving from sensor-prediction-specific to general sequence modeling. The open empirical question is whether HiSS-style hierarchies scale to language. Most published work has stuck to continuous prediction. Almost nobody has run the scaling experiment for tokens. This is the most interesting open question in the family right now.

Training-free context extension exists. LongMamba [14] (ICLR 2025) extends Mamba's effective receptive field without retraining. Useful in practice if you're deploying a pretrained SSM and discover you need more context than the training recipe used.

Where this leaves you

Long-context is two problems. Exact recall over discrete tokens: Transformers, by theorem. Continuous prediction under bounded sufficient statistics: state-space models, by construction. The mathematical idea unifying the SSM tradition is HiPPO's polynomial projection: choose a measure, project the past onto an orthogonal polynomial basis, evolve the coefficients by an ODE. Everything from S4 to Mamba-3 is engineering on top of that one idea. The 23% MSE gap that HiSS opens on continuous sensor prediction comes from matching architecture to signal: multi-resolution to multi-resolution.

Concrete actions:

For LLM workloads with retrieval components, default to a hybrid (Jamba, Zamba 2, or a Mamba-2-heavy distill). Pure attention is fine; pure SSM is theoretically limited.
For continuous-signal prediction (sensors, IMU, vibration, biological signals), try HiSS before any Transformer. The math says it should win; the published benchmarks agree.
For inference-cost-constrained deployments, evaluate Mamba-3. The complex-state and exponential-trapezoidal changes are genuinely new, not marketing.
For research projects, the open question is whether HiSS-style hierarchies scale to language. Almost nobody has run that experiment yet, which is why somebody should.

The polynomial that ate signal processing in the 1800s is now eating long-context predictive modeling in the 2020s. Which, given how long the polynomial has been around, ought to embarrass everyone exactly the right amount.

Sources

Jelassi, S., Brandfonbrener, D., Kakade, S.M., Malach, E. "Repeat After Me: Transformers are Better than State Space Models at Copying." ICML 2024.
Bhirangi, R., Wang, C., Pattabiraman, V., Majidi, C., Gupta, A., Hellebrekers, T., Pinto, L. "Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling." 2024.
Hsieh, C-P., et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?" 2024.
Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C. "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS 2020.
Gu, A., Goel, K., Ré, C. "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR 2022.
Smith, J.T.H., Warrington, A., Linderman, S.W. "Simplified State Space Layers for Sequence Modeling." 2022.
Gu, A., Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." 2023.
Dao, T., Gu, A. "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024.
Mamba-3 authors. "Mamba-3: Improved Sequence Modeling using State Space Principles." ICLR 2026.
Lieber, O., et al. "Jamba: A Hybrid Transformer-Mamba Language Model." 2024.
Pham, C., et al. "Taipan: Efficient and Expressive State Space Language Models with Selective Attention." 2024.
Trockman, A., et al. "Mimetic Initialization Helps State Space Models Learn to Recall." 2024.
Hwang, S., et al. "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling." 2025.
"LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement." ICLR 2025.
Voelker, A., Kajić, I., Eliasmith, C. "Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks." NeurIPS 2019.
Bahri, M., Galuzzi, B., Mongelli, M. "Numerical Analysis of HiPPO-LegS ODE for Deep State Space Models." 2024.
Dao, T. "State Space Duality (Mamba-2) Part I-III." Tri Dao's blog, 2024.
Cartesia AI. "Mamba-3: An Inference-First State Space Model." 2026.
Rush, A. "The Annotated S4."
Yang, K., et al. "Towards Long-Context Time Series Foundation Models." 2024.
Series prerequisite: "Mamba & State-Space Models." Earlier deep dive on the architecture; this post is the mathematical follow-up.

18 min read | Read on the blog | Buy me a coffee

Energy-Based Transformers: The 1982 Architecture Finally Got Compatible Training Tricks

Serendeep Rudraraju — Thu, 07 May 2026 15:54:38 GMT

Energy-Based Transformers: The 1982 Architecture Finally Got Compatible Training Tricks

In July 2025, Alexi Gladstone and his collaborators put a paper on arXiv claiming that a neural-network idea first written down in 1982 scales 35% faster than the modern Transformer. Ten months later, no independent lab has published a replication. Both of these things are true. Both of them matter.

The Transformer scaling story has been monolithic since 2020. Bigger pretraining, more data, Chinchilla-optimal mixtures. Energy-Based Models, the framework John Hopfield introduced in 1982 and that won the 2024 Nobel in Physics, were left for dead by ~2012. Then there's a paper. An ICLR 2026 oral. $1.03B raised by Yann LeCun's AMI Labs in March 2026 for the EBM-flavored cousin. And a replication gap nobody is talking about.

TL;DR

Energy-Based Transformers replace softmax-over-logits with a scalar energy and an iterative inference loop, sidestepping the partition function that historically broke EBMs. They scale 35% faster than Transformer++ (under 800M params), match the System 2 thesis Yann LeCun has been making since 2022, and have triggered a 2026 ecosystem of follow-up work. EBTs are the first EBM to cross the threshold without collapsing. They have not yet been independently replicated. Both halves of that sentence are load-bearing.

What an Energy-Based Transformer Actually Computes

Strip away the framing and an EBT is a Transformer that outputs a single scalar instead of a distribution, and treats prediction as gradient descent on that scalar. That's it. The novelty is in how you train it.

Mechanically: an EBT maps an input x and a candidate prediction ŷ to one scalar E_θ(x, ŷ) ∈ ℝ. Lower energy means more compatible. The unnormalized joint is p_θ(x, ŷ) ∝ exp(−E_θ(x, ŷ)) — the same Boltzmann form Hopfield wrote down 44 years ago. LeCun's 2006 Tutorial on Energy-Based Learning puts it cleanly in the abstract: "Energy-Based Models capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy."

The break from a normal Transformer happens at inference. A standard decoder hands you the answer in one forward pass: softmax over the logits, argmax or sample, done. An EBT initializes a random guess ŷ_0 ~ N(0, I) and runs gradient descent on it:

def ebt_inference(x, model, n_steps=8, alpha=0.1):
    y = torch.randn_like(target_shape)             # random init
    for _ in range(n_steps):
        energy = model(x, y)                       # scalar
        grad = torch.autograd.grad(energy, y)[0]   # ∇_y E
        y = y - alpha * grad                       # descend
    return y                                       # converged prediction

Note carefully: gradients are with respect to ŷ, not the weights. The weights are frozen at inference; what's being optimized is the prediction itself, treated as a free variable on the energy landscape that the model has learned. The architecture compares to three things at once:

flowchart TB
    subgraph S["Standard Transformer — 1 pass"]
        direction LR
        s1[x] --> s2[forward] --> s3[ŷ]
    end
    subgraph D["Diffusion Transformer — N denoising steps"]
        direction LR
        d1[x, noise] --> d2["forward × N<br/>(predict ε)"] --> d3[ŷ]
    end
    subgraph E["Energy-Based Transformer — N gradient steps on ŷ"]
        direction LR
        e1["x, ŷ₀"] --> e2["forward × N<br/>(scalar E, ∇E on ŷ)"] --> e3[ŷ_N]
    end

Two thinking modes flow from this structure. Increase N and the model "thinks longer" — more gradient steps, deeper basin in the energy landscape. Or sample M random initializations, run each to convergence, and pick argmin_j E_θ(x, ŷ_{N,j}) — the model verifies its own attempts and ships the best one. Both buy quality with FLOPs, at inference, with no architectural change.

Why This Didn't Work for 40 Years

Three failures stacked. The 1982 framework had real, structural reasons not to scale. The 2025 paper didn't fix the framework — it routed around it.

Failure one: the partition function. EBMs need Z_θ = ∫ exp(−E_θ(x, y')) dy' to produce a real probability. That integral is usually intractable. Maximum-likelihood training has a gradient that depends on Z, so every update needs samples from the model itself. Goodfellow, Bengio, and Courville devote an entire chapter — Ch. 18, "Confronting the Partition Function" — to the problem. The textbook framing: the integral is "intractable for many interesting models," so the field built models that "do not involve computing p(x) at all." Softmax classifiers. Autoregressive language models. Transformers. Every dominant deep-learning architecture is structured to dodge the EBM tax.

LeCun himself, in the 2006 tutorial, conceded the cost in one of the dryer lines in machine-learning literature:

"Hence probabilistic modeling comes with a high price, and should be avoided when the application does not require it."

— Yann LeCun et al., A Tutorial on Energy-Based Learning, 2006

Even the framework's leading advocate said the math wasn't worth the cost most of the time.

Failure two: contrastive divergence is broken. The standard workaround was contrastive divergence with short-run MCMC, due to Hinton in 2002. Du and Mordatch's 2020 paper is blunt about what was happening: CD has "a gradient term neglected in the popular contrastive divergence formulation" that "is important in avoiding training instabilities that previously limited applicability and scalability of energy-based models." The 2010s ML establishment didn't ignore EBMs out of fashion. They had a documented instability problem, and nobody could confidently train an EBM past the size where it stopped fitting on a single GPU.

Failure three: nobody made one work at scale. From RBMs in 2009 through Du and Mordatch's 2019 ImageNet result, no EBM crossed a billion parameters with stable training. The EBT paper itself, in §3.4, puts a number on it: "zero publicly known Foundation EBMs" prior to its publication. From 2009 to 2025, while feed-forward Transformers crossed trillion parameters, the EBM camp had nothing at the scale anyone in industry would notice.

The Royal Swedish Academy gave Hopfield and Hinton the 2024 Nobel in Physics for "foundational discoveries and inventions that enable machine learning with artificial neural networks." Hopfield's network is "described in a manner equivalent to the energy in the spin system found in physics." This is the end-of-an-era citation. The framework is recognized as foundational at exactly the moment the field decides it's also salvageable.

Then July 2025 happened.

What Gladstone et al. Changed

EBTs aren't a new kind of EBM. They're a new training procedure for the same old framework, that happens to dodge every classical failure mode by accident.

The training trick is the headline. No contrastive divergence. No MCMC. No partition-function approximation. The training loss is the standard supervised loss between the converged prediction ŷ_N and the ground-truth y (cross-entropy for tokens, MSE for image patches), backpropagated through the entire N-step inference trajectory. Side by side:

# Classical EBM training (the historical approach that didn't scale)
def classical_ebm_step(x, y, model, optimizer):
    pos_energy = model(x, y)                       # data sample
    y_neg = mcmc_sample(model, x, n_chain_steps=K) # sample from p_θ
    neg_energy = model(x, y_neg)
    loss = pos_energy - neg_energy + log_Z_approx  # CD-style, biased
    loss.backward()                                 # unstable in practice
    optimizer.step()

# EBT training (Gladstone et al. 2025)
def ebt_step(x, y, model, optimizer):
    y_pred = ebt_inference(x, model, n_steps=N)    # full inference loop
    loss = supervised_loss(y_pred, y)              # cross-entropy / MSE
    loss.backward()                                # backprop *through* the loop
    optimizer.step()                               # Hessian-vector products

The training signal becomes: teach the energy landscape such that gradient descent on ŷ from a random start lands at the right answer. The verifier and the generator in one model. The partition function never appears.

Three stability tricks earn their keep (§3.3 of the paper). A replay buffer recycles previously-optimized ŷ trajectories so the energy landscape is well-defined far from initialization. Langevin noise in the inference update (ŷ_{i+1} = ŷ_i − α∇E + η_i) lets the model escape spurious local minima rather than collapse onto one mode. Randomized step size and step count keep the model from overfitting to a specific optimization schedule. None of these is novel on its own. The combination is what hadn't been tried at this scale.

The lead author concedes the obvious on his blog: "There is a long way to go in scaling these models up (I'm mainly looking at you, potential stability issues)." Stable enough for an 800M-parameter paper. Not yet stable enough to bet a frontier model on.

The System 2 connection is structural, not rhetorical. LeCun's 2022 paper explicitly proposed reasoning as energy minimization in an actor module — same equation form the EBT inference procedure uses. The structural lineage is real: descend the energy landscape until convergence, output the basin you land in. Unlike o1 or DeepSeek-R1, where System 2 emerges from RL on tasks with verifiable rewards (math, code), EBTs claim System 2 emerges from pretraining alone, on any modality. That's a stronger claim. Whether it survives at frontier scale is the open question.

My conjecture, label as such: the deeper unlock isn't any single trick. It's that compute is now cheap enough to backprop through 8–32 inference steps during training. Hessian-vector products were prohibitive at the scale 2019 EBMs were trying. Today they're a constant-factor overhead on top of a Transformer that costs $10M to train anyway.

Lead with the Win, Concede the Caveats

The headline numbers are real and peer-reviewed (ICLR 2026 oral). Every one comes with a caveat that a senior engineer will find on the second read of the table.

The wins. EBTs achieve "an up to 35% higher scaling rate" than Transformer++ across data, batch size, parameters, FLOPs, and depth — a slope improvement on the fitted scaling curves, not absolute speed at a fixed point. On image denoising, EBTs land higher PSNR (27.25 vs 26.58) and lower MSE (122.55 vs 142.98) than DiT at σ=0.1 noise. With 99% fewer forward passes. Given more inference compute, EBTs improve "29% more than the Transformer++". A delta-of-deltas, but a non-trivial one. The architecture sibling EBT-Policy (Davies et al., October 2025) beats Diffusion Policy on simulated and real robotic manipulation, converges in 2 inference steps versus 100 (~50× reduction), and recovers zero-shot from failed action sequences without retry training. That last result, in robotics, is the cleanest production-shape win EBTs have so far.

The architecture comparison reads like this:

Dimension	Standard Transformer	Diffusion Transformer (DiT)	Energy-Based Transformer (EBT)
Output	Softmax over vocab logits	Predicted noise / velocity	Single scalar energy `E_θ(x, ŷ)`
Training	Cross-entropy on next token	Denoising score-matching	Supervised loss on `ŷ_N`, backprop through N-step inference
Inference	1 forward pass	N denoising steps (default 250)	K gradient steps × (forward + backward through energy head)
Test-time compute lever	Beam search, CoT	Number of denoising steps	N (steps) and M (random restarts)
Scaling claim	Chinchilla-optimal	Monotonic FID gains to 675M	"Up to 35% higher rate" vs Transformer++
Production deployments	Universal	Stable Diffusion, Flux, Sora-class	None known. 800M research artifact.
Math sidesteps	None — softmax is closed	Score parameterization	Never computes `Z`; backprops through inference loop

The caveats — and every one is in the paper. The scale ceiling is 800M parameters. Every claim is extrapolated from sub-1B scaling curves. Frontier transformers are 100B–10T. Whether the 35% slope holds, accelerates, or collapses past 1B is unknown. EBT loses to Transformer++ on GSM8K (43.3 vs 49.6 with thinking, per The Batch's reading of Table 3) — the strongest reasoning benchmark in the table is one EBT doesn't win. Pretraining perplexity is worse (33.43 vs 31.36). The "EBT generalizes better than its perplexity suggests" framing is real but selectively true.

The bidirectional EBT for masked text collapses. The paper's own admission: predicts "the" for all masked tokens. Classical EBM mode collapse, not fully solved — just routed around in the autoregressive variant. Training compute overhead is real: The Decoder reports 3.3×–6.6× more FLOPs to train, The Batch reports ~10× to reach matched perplexity. Both numbers measure different things. Both are caveats.

And the "29% more System 2 improvement"? Measured on perplexity. Not AIME, not MMLU-Pro, not HumanEval. The paper does not benchmark against o1 or DeepSeek-R1.

Lead with the win. But say the rest.

The 2026 Ecosystem

Ten months in, the field is treating this seriously. There's a paper trail, a workshop, a billion dollars, and a critique. There is no replication.

The theory side has caught up. Mathieu Blondel and collaborators at Google DeepMind (arXiv 2512.15605, three revisions in 2026) prove a function-space bijection: autoregressive language models are energy-based models. Not metaphorically. Bijectively. Nobody has yet retrofitted Llama or Mistral with EBT-style inference, but the math now says you could.

The practitioners are running the experiment. Ying Nian Wu's group at UCLA (arXiv 2602.06584, February 2026) get a 0.2B model with 30 "rethinking" iterations to beat baselines 10–15× larger on math reasoning. Same energy-as-optimization framing, different head. And the small-model-beats-big-model result is exactly the "test-time compute beats parameter count" thesis the field has been arguing about since DeepSeek-R1.

The robotics side is shipping. EBT-Policy converges in 2 steps versus Diffusion Policy's 100. Recovery from failed action sequences without retraining. Robotics has fewer ideological tribes than language modeling. The architecture wins on the metrics or it doesn't, and EBT-Policy is winning.

The EBM workshop is back at ICLR. The NFAM workshop at ICLR 2026 (April 26, 2026, Rio) is the first dedicated associative-memory and EBM workshop at a top-tier venue in years. Speakers include Jay McClelland, Paul Liang, Ben Hoover. The fact that the workshop exists is the field signaling it's worth a workshop again.

The money is real, even if it's not branded EBT. AMI Labs raised $1.03B at $3.5B pre-money in March 2026 to build JEPA-based world models. Logical Intelligence launched in 2026 as "the World's First Energy-Based Model for Critical Systems," with LeCun as Founding Chair of the Technical Research Board. JEPA is EBM-flavored. World models are EBM-flavored. The 2026 industry narrative is energy-based, even if the EBT brand isn't the carrier.

The critique exists, and it's pointed. NRGPT v3 by Dehmamy, Hoover, Saha, Kozachkov, Slotine, and Krotov (arXiv 2512.16762, most recent revision May 1, 2026) calls out EBT for "implementation challenges, primarily due to the potential for information leakage in naïve implementations." It's the closest thing to a 2026 EBT skeptic paper, and it comes from a group that includes Dmitry Krotov, a long-time Hopfield-network theorist with no axe to grind against EBMs in general.

The gap nobody is closing: as of May 2026, no independent lab has published a replication of Gladstone's 35% scaling-rate claim. The supporting theory is real. The empirical confirmation is missing. Ten months in, that itself is the story.

The Strongest Skeptic Case

The case against EBTs is strong enough to engage seriously, and most of the argument comes from inside the paper.

The steelman, paraphrased: EBTs are an iterative-refinement Transformer with an "energy" framing. DiT already does iterative refinement at inference (250 denoising steps standard). PonderNet, Universal Transformers. The lineage of "think longer at inference" architectures predates EBTs by years. If you took an EBT, dropped the Boltzmann interpretation, and called the scalar output "a learned step-size signal," the contribution becomes "stability tricks for training iterative-refinement Transformers at scale." Real. But more incremental than "40-year-old idea beats Transformers."

The benchmark wins concentrate on out-of-distribution and structured-reasoning tasks. The in-distribution losses (GSM8K, pretraining perplexity) are also real. Selecting only for the wins is selection bias, and a senior reviewer would catch it.

Inference cost is obscured. A single EBT forward pass requires forward + backward through the energy head: roughly 2× the FLOPs of a vanilla Transformer pass. With N=4 gradient steps, that's 8× a standard Transformer's per-token inference cost, before you've gotten any "thinking" benefit relative to a standard Transformer that also gets 8× via beam search or longer chain-of-thought.

The "EBM rebrand" version of the case, named: Gladstone et al. trained an iterative-refinement transformer with end-to-end backprop through the inference loop. The energy interpretation is mathematically clean but functionally close to a learned step-size schedule. The contribution is "stability tricks for iterative-refinement transformers at scale." That's a real contribution. It's not "the framework Hopfield invented in 1982 is back to beat Transformers."

My honest take, label opinion: the strongest skeptic case is a hybrid. The empirical wins at this scale are mixed. The architectural lineage from existing iterative-refinement work is closer than the paper's framing implies. Until someone trains a 70B-parameter EBT and beats a 70B Llama on reasoning, "EBMs vindicated" is a thesis with promising data, not a settled result.

The right framing is narrower: the first time an EBM crossed 100M parameters and didn't collapse, with intriguing scaling that we can't verify at frontier scale yet.

That's calibration, not dismissal.

Where This Leaves You

Don't bet production on EBTs. Track them. Know what would change your mind.

Track the GitHub repo. github.com/alexiglad/EBT is Apache-2.0, ~627 stars at time of writing, and includes custom flash-attention with second-derivative support. If a third-party fork crosses 5B parameters with the scaling rate maintained, that's the signal.
Read NRGPT v3. arXiv 2512.16762 is the most rigorous 2026 alternative framing. The "information leakage in naïve implementations" critique is specific enough to read before you commit engineering time to a fork.
Watch JEPA and AMI Labs more than the EBT brand. $1.03B is going into the EBM-flavored cousin, not the EBT label. If the next big architectural deployment is energy-based, it's likely JEPA-shaped, not EBT-shaped — and the deployment will tell you which version of the framework actually shipped.
Don't migrate inference budgets yet. EBT inference is roughly 2N× standard Transformer per token. Without a frontier-scale reasoning win, the FLOP economics don't pencil for production serving.
Update your priors when one of three things happens. A 10B+ EBT is published. Someone independently reproduces the 35% scaling claim. A major lab announces an EBT-based deployment. None of these has happened. Two of them might in 2026.

The frame for senior engineers: EBTs are the post-Transformer architecture worth paying attention to because they could be wrong. The framework is old. The training trick is new. The scaling claim is unreplicated. The field is treating it seriously enough to fund the EBM-flavored adjacent. That's the configuration where unexpected results land.

Closing

The 1982 framework was right about the math. Wrong about the training. The 2025 paper didn't change the math; it ducked it.

Hopfield wrote down the energy function 44 years ago. LeCun wrote the tutorial 19 years ago. Gladstone wrote the training loop last summer. The hard part is what it always was: showing it scales when nobody is paying you to ignore the caveats.

Sources

Gladstone et al., Energy-Based Transformers Are Scalable Learners and Thinkers (arXiv 2507.02092) — the central EBT paper, ICLR 2026 oral; cited for every benchmark number
Davies et al., EBT-Policy (arXiv 2510.27545) — robotics application; cited for the 50× speedup over Diffusion Policy
Blondel et al., Autoregressive Language Models Are Secretly Energy-Based Models (arXiv 2512.15605) — Google DeepMind theory paper proving the function-space bijection
Dehmamy et al., NRGPT v3 (arXiv 2512.16762) — 2026 alternative-framing paper, closest to a critique
Kong et al., Inference-Time Rethinking with Latent Thought Vectors (arXiv 2602.06584) — UCLA group, 0.2B model beating 10–15× larger baselines
LeCun, Chopra, Hadsell, Ranzato, Huang, A Tutorial on Energy-Based Learning (2006) — verbatim quotes on the partition-function tax
LeCun, A Path Towards Autonomous Machine Intelligence (2022) — System 2 = energy minimization argument
Du and Mordatch, Improved Contrastive Divergence Training of EBMs (arXiv 2012.01316) — documents the gradient term CD ignores
Goodfellow, Bengio, Courville, Deep Learning Ch. 18: Confronting the Partition Function — textbook framing of why EBM training is hard
The Royal Swedish Academy of Sciences, Nobel Prize in Physics 2024 Press Release — Hopfield and Hinton citation
Peebles and Xie, Scalable Diffusion Models with Transformers (arXiv 2212.09748) — the DiT comparison architecture
TechCrunch, AMI Labs Raises $1.03B (March 2026) — the funding context
Logical Intelligence — Yann LeCun page — "World's First Energy-Based Model for Critical Systems"
NFAM Workshop @ ICLR 2026 — first dedicated EBM workshop at a top-tier venue in years
Yann LeCun's LinkedIn endorsement of EBT (July 2025) — primary-source industry signal
Alexi Gladstone's blog post on EBTs (2025) — lead author's own framing, including the stability concession
DeepLearning.AI The Batch on EBTs (September 2025) — practitioner coverage; source for the ~10× FLOPs caveat
The Decoder on EBTs (July 2025) — source for the 3.3×–6.6× training-FLOP overhead
github.com/alexiglad/EBT — official EBT codebase, Apache-2.0

16 min read | Read on the blog | Buy me a coffee

The Open Source AI Lie: Weight-Washing, Broken Definitions, and Who Benefits

Serendeep Rudraraju — Wed, 08 Apr 2026 18:38:17 GMT

The Open Source AI Lie: Weight-Washing, Broken Definitions, and Who Benefits

Meta says Llama is open source. The Open Source Initiative, the organization that has maintained the definition of "open source" since 1998, says it isn't. Meta ignores them. A billion downloads later, the man who wrote the original Open Source Definition says the whole attempt to define open source AI has failed.

You've probably called Llama "open source" in an architecture doc at some point. I have. Most of us have. And we were wrong, in ways that have legal and regulatory consequences that aren't obvious until they bite you.

I should warn you up front: I started writing this to make a clean argument against weight-washing, and the other side's numbers kept getting in the way. A billion Llama downloads. Surgical copilots and maternal health chatbots in East Africa built on these models. Thirteen million HuggingFace users who never needed training data to build useful things. The case against caring about the definition is stronger than I wanted it to be.

TL;DR

No major AI model meets the Open Source AI Definition. Not Llama, not DeepSeek, not Mistral, not Qwen, not Gemma. Releasing weights without training data is the AI equivalent of distributing a compiled binary and calling it open source. The EU AI Act grants regulatory benefits to "open source" AI, which means getting the label right has financial consequences. Meanwhile, the people who wrote the original definition are fighting each other about whether their own compromise went too far.

A 28-Year-Old Definition Meets a Trillion-Dollar Industry

Quick history, because it matters.

In 1986, Richard Stallman published the Free Software Definition. Four freedoms: use, study, modify, share. All of them depend on one prerequisite: access to the source code. Without it, "study" and "modify" are empty promises.

In 1998, Christine Peterson coined the term "open source" at a meeting in Palo Alto. Bruce Perens adapted the Debian Free Software Guidelines into the Open Source Definition. He and Eric Raymond founded the OSI to steward it. The definition's core requirement: access to the "preferred form of the work for making modifications." Source code. Not binaries. Not bytecode. The human-readable thing.

For 26 years, nobody argued about what "source" meant.

Then we started shipping AI models, and the word stopped being obvious.

An AI model isn't one thing. It's several: architecture code, training code, training data, and model weights. The weights are the output of training. The result, not the recipe. When Meta releases Llama's weights, it's handing you the end product of a process you can't see, can't reproduce, and can't audit. The architecture is there. The inference code is there. But the training data, the thing that shaped what the model actually learned, is nowhere.

Bruce Schneier put it bluntly in November 2024:

"Since for a neural network, the training data is the source code—it's how the model gets programmed—the definition makes no sense."

— Bruce Schneier, "AI Industry Is Trying to Subvert the Definition of Open Source AI"

Here's how the analogy maps:

graph LR
    subgraph Traditional Software
        A[Source Code] -->|compile| B[Binary / .exe]
    end
    subgraph AI Model
        C[Training Data] -->|train| D[Model Weights]
        E[Training Code] -->|train| D
    end

    style A fill:#22c55e,color:#000
    style B fill:#ef4444,color:#fff
    style C fill:#ef4444,color:#fff
    style D fill:#ef4444,color:#fff
    style E fill:#ef4444,color:#fff

    classDef released fill:#22c55e,color:#000
    classDef withheld fill:#ef4444,color:#fff

Green = what "open source" requires you to release. Red = what most AI companies actually withhold. The weights are the compiled artifact. The training data is the source.

That comparison sticks. Releasing weights without training data is like shipping a .exe and calling it open source. Sure, you can run it. You can even fine-tune it, the way you might hex-edit a binary and hope for the best. What you can't do is figure out how it was built, reproduce it, check whether the safety claims hold up, or fix the training process when something goes wrong.

The Honesty Audit

Enough abstraction. I went through the five most-downloaded "open" AI models and checked what they actually give you.

	Llama 3	DeepSeek R1	Mistral 7B	Qwen 2.5	Gemma 2
Weights	Yes	Yes	Yes	Yes	Yes
Inference code	Yes	Yes	Yes	Yes	Yes
Training code	No	Partial	No	No	No
Training data	No	No	No	No	No
License	Custom (Meta)	MIT	Apache 2.0	Apache 2.0	Custom (Google)
OSI-approved license	No	Yes	Yes	Yes	No
Commercial restrictions	700M MAU cap	None	None	None	Yes
Use restrictions	Acceptable use policy	Separate policy	None	None	Yes
Calls itself "open source"	Yes	Yes	Varies	Yes	No
Passes OSAID 1.0	No	No	No	No	No

Stare at that table for a second. Every row below "Inference code" is some variation of No. Zero training data across the board. Not one passes OSAID 1.0.

The details are worth unpacking, though, because these companies aren't all doing the same thing.

Llama is the worst offender. Meta wrote its own license—not OSI-approved—that caps commercial use at 700 million monthly active users. Think about who that cap targets. It's not protecting indie developers. It's letting Meta harvest community contributions while making sure Google, Amazon, and Microsoft can't compete with Llama derivatives. There's an acceptable use policy restricting whole categories of applications. The Free Software Foundation classified Llama 3.1 as nonfree in January 2025. Google and Microsoft, when asked, agreed to stop calling their restricted models "open source." Meta refused.

DeepSeek R1 comes closest to honesty. MIT license, same one used by jQuery, Rails, and Node.js. No MAU caps, no use restrictions, nothing weird in the grant. But no training data, no full training pipeline. Sit with this for a moment: a Chinese company backed by a quantitative trading firm ships under a more permissive license than the American social media company that won't shut up about "open source AI" as a force for democracy.

Mistral earned enormous goodwill by releasing Mistral 7B under Apache 2.0 in September 2023. Then they pivoted. Larger, more capable models went behind proprietary licenses or API-only access. CEO Arthur Mensch reframed the strategy as "open science" rather than "open source." Credit where it's due: at least that's a more honest label than what Meta uses.

Qwen 2.5 (Alibaba) ships under Apache 2.0, no restrictions. Same playbook as DeepSeek. Whether that's genuine openness or market penetration dressed up nicely, I'll leave to you.

Gemma surprised me. Google calls it "open weights," not "open source." The license is custom and restrictive, which is annoying. But the labeling is honest. Google watched Meta catch heat and apparently decided that not lying about what they're releasing was worth more than the marketing bump.

The models that actually pass the definition? Pythia from EleutherAI. OLMo from AI2. T5 from Google Research. Amber from LLM360. Full code, full weights, full training data. You've almost certainly never shipped any of them to production.

The Institutional Crisis Nobody's Talking About

The OSI spent two years trying to fix this. Twenty-five organizations at the table: Microsoft, Google, Meta, Amazon, the usual suspects. On October 28, 2024, at the All Things Open conference, they published OSAID 1.0.

The compromise: you need code, weights, and "sufficiently detailed information about the data used to train the system, so that a skilled person can build a substantially equivalent system." Not the actual data. A description of the data.

Purists hated it. A description isn't a dataset. Pragmatists ignored it. The community was already building on weights and didn't care what any definition said. The OSI managed to publish something both sides could attack, which is impressive in its own way.

Then it got worse.

In March 2025, Bradley Kuhn of the Software Freedom Conservancy and Richard Fontana of Red Hat ran for the OSI board. Their platform: repeal OSAID 1.0. They made it through the election. Then, about an hour after voting closed, OSI emailed non-incumbent candidates with a Board Member Agreement they had 47 hours to sign. Buried in it: a clause requiring board members to "support publicly all Board decisions, especially those that do not have unanimous consent."

Kuhn and Fontana struck the gag clause and sent it back with alternative language allowing public dissent. OSI said the modifications were invalid. Disqualified both. Threw out every vote cast for them.

Before that, a Debian developer named Luke Faraone had been rejected as a candidate because he submitted his application at 9 PM Pacific time, but OSI retroactively declared the deadline was UTC, which made him late. A community petition demanding full vote counts pulled 88% support. OSI didn't release them.

Bruce Perens, the man who wrote the Open Source Definition in 1998, watched all of this play out and said what a lot of people were thinking:

"The problem before the Open Source AI Definition was openwashing, saying that something was open source when it was not. They hoped that an AI-specific definition would reduce openwashing. If you look at the OSI's own anniversary report, the problem now that the definition is a year old, is... openwashing."

— Bruce Perens, FOSS Force, September 2025

He's now working on something called the "Post-Open" framework, a licensing model that moves beyond open source entirely. The guy who co-founded the OSI has decided the concept he helped create can't stretch to cover AI. I don't know what clearer signal you need that this is broken.

The Counter-Argument You Can't Dismiss

This is the part where the argument I've been building runs into a wall.

Thirteen million HuggingFace users. Two million public models, nearly all built by fine-tuning or distilling weights that came with no training data attached. A billion Llama downloads. Qwen alone spawned 113,000 derivative models. According to Epoch AI, open-weight models lag closed-source state-of-the-art by about three months now, down from a much larger gap. On some benchmarks the difference shrank from 8% to 1.7% in a single year.

Nobody needed training data for any of that.

And the downstream impact is concrete:

Domain	Project	What It Does
Healthcare	Mendel AI (Llama 3)	36% improvement in clinical record extraction
Surgery	Activ Surgical (Llama 3)	Real-time AI surgical copilot
Medical QA	DeepSeek-R1-Distill	>92% accuracy on USMLE Step 1
Agriculture	Digital Green (Llama)	Multilingual advisory for developing nations
Maternal health	Jacaranda PROMPTS (Llama)	AI clinical help desk across Kenya, Ghana, Eswatini

Mendel didn't need Meta's training data to hit 36% improvement. Jacaranda didn't audit Llama's training pipeline before building an SMS-based maternal health system for three African countries. These are shipping products. People are healthier because of them. And they were built on weights that fail every open source purity test I've outlined above.

Yann LeCun, formerly Meta's chief AI scientist and now running AMI Labs, frames it as a matter of principle:

"In the future, our entire information diet is going to be mediated by [AI] systems. They will constitute basically the repository of all human knowledge. And you cannot have this kind of dependency on a proprietary, closed system."

— Yann LeCun, Yann LeCun On How An Open Source Approach Could Shape AI

The pragmatist's case goes further than vibes. Training data release is a legal minefield. The US Copyright Office ruled in May 2025 that AI training on copyrighted works is not categorically fair use. These datasets contain trillions of tokens scraped from millions of copyrighted sources. Nobody is getting redistribution rights for all of that. In healthcare, GDPR and HIPAA make the data unshareable by law. And even if someone handed you the complete training data and code for Llama 3, you'd need north of $100 million in compute to reproduce the training run. The data is meaningful in theory and useless in practice to basically everyone who would download it.

Then there's geography. Chinese models (DeepSeek under MIT, Qwen under Apache 2.0) now make up 41% of HuggingFace downloads, more than US-origin models. If stricter openness requirements make American companies look less open by comparison, the ecosystem just shifts further east. That's not an argument for or against anything, but it's a thing that's happening.

I keep turning this over. Open weights aren't open source, but they're enormously better than the closed alternative. Making the definition stricter might produce fewer open releases, not more. That argument is mostly right. But it's not entirely right.

Why It Still Matters

Open weights being valuable doesn't make calling them "open source" harmless. Those are different claims.

The regulatory loophole is already being exploited. The EU AI Act, Article 53, gives lighter compliance obligations to "open source" AI. That exemption was written by people who assumed the phrase meant something specific. If Meta can stick "open source" on Llama and pocket the regulatory relief, that's not a definitional quibble. It's money. The exemption has a hole in it, and companies are walking through.

You can't audit what you can't see. About 5% of AI researchers share code in their papers. Model cards on HuggingFace use 947 different section naming conventions, so there's no consistency in what gets documented. When a company claims their model was tested for bias, deduped for harmful content, filtered for quality, and then hands you only the weights, what you have is a claim without evidence. You can observe the model's outputs. You cannot investigate its inputs. If it exhibits bias, you can describe the symptoms. You can't diagnose the cause.

Copyright law might not work here at all. The D.C. Circuit ruled in Thaler v. Perlmutter (March 2025, cert denied 2026) that AI cannot hold copyright. Follow the logic: if AI-generated code can't be copyrighted, then open source licenses, which are copyright licenses, might not attach to AI output. The entire legal mechanism that makes open source work might not apply. This isn't a hypothetical edge case. It's an unresolved question that affects everyone building on these models, and I haven't seen a convincing answer from anyone.

And the erosion compounds. "Open source" accumulated meaning over 28 years through a specific deal: you can see what you're running. Inspect it. Reproduce it. Improve it. Each time Meta puts that label on a model with a custom restrictive license and zero training data, the deal gets a little weaker. The words absorb more ambiguity. At some point "open source" just means "you can download it," which is what Meta wants, because then the label is free and the obligation is zero.

Where This Leaves You

I've been going back and forth on this for weeks, and I don't think there's a clean resolution.

If training data is the source code of AI, and I think Schneier's analogy holds, then nothing from Meta, DeepSeek, Mistral, Alibaba, or Google qualifies as open source. The four freedoms require that you can see and reproduce the thing you're using. Weights don't give you that.

But thirteen million people built useful things with weights alone. A maternal health system in Kenya doesn't care about definitional purity. The 1998 definition was written for a world where "source" meant text files you could read and compile. It doesn't map cleanly onto a trillion tokens scraped from the internet, tangled in copyright, privacy law, and trade secrets.

I land here: open weights are good. Calling them "open source" is bad. Both of those can be true at the same time.

Some things you can do with that:

Stop writing "open source" in your architecture docs when you mean Llama. Say "open weights." It's accurate, your compliance team won't get confused, and it doesn't corrode a phrase that still means something for actual software.

Read the license. I know, nobody does. But Llama's 700 million MAU cap has already bitten companies that assumed "open source" meant no strings. DeepSeek's MIT license actually has no strings. Those are different things and they matter when lawyers get involved.

If you need reproducibility, if you need to audit what a model learned or verify a safety claim or understand why it's producing biased output, use OLMo or Pythia. They're not as capable as Llama for most tasks. They're the only ones that earn the label.

Keep an eye on EU AI Act enforcement. The GPAI obligations kicked in August 2025. Regulators may end up caring about the definition more than the open source community does, and "we called it open source on our website" is going to be an awkward defense when it clearly isn't.

Open source meant something specific for 28 years. The AI industry would very much like you to forget what.

Sources

Bruce Schneier: "AI Industry Is Trying to Subvert the Definition of Open Source AI" — The source-code-as-training-data argument
OSI: Open Source AI Definition 1.0 — The official (contested) definition
Liesenfeld & Dingemanse, "Rethinking open source generative AI," ACM FAccT '24 — Academic framework for measuring AI openness
FOSS Force: Brock and Perens Reflect on a Year of Open Source AI Debate — Perens's one-year assessment
Software Freedom Conservancy: "OSAID Erodes the Meaning of Open Source" — Kuhn's opposition to OSAID
The Register: OSI Election AI Drama — Board election controversy
Mark Zuckerberg: "Open Source AI Is the Path Forward" — Meta's strategic case
Epoch AI: Open Weights vs. Closed Weights Models — Performance gap data
HuggingFace: State of Open Source Spring 2026 — Ecosystem statistics
EU AI Act, Article 53 — Open source exemption text
Hunton Andrews Kurth: "How Open Are Open Source AI Models Really?" — Legal analysis
Thaler v. Perlmutter, D.C. Circuit (March 2025) — AI cannot be a copyright author
TechCrunch: Llama Models Hit 1B Downloads — Adoption numbers
MIT Technology Review: DeepSeek — DeepSeek cost and capabilities

14 min read | Read on the blog | Buy me a coffee

The Post-Transformer Era: State Space Models, Mamba, and What Comes After Attention

Serendeep Rudraraju — Tue, 10 Feb 2026 18:42:35 GMT

The Post-Transformer Era: State Space Models, Mamba, and What Comes After Attention

What if the most important architectural innovation since Transformers isn't trying to replace attention — but to escape its quadratic scaling problem entirely?

I've been watching State Space Models go from "interesting paper" to "IBM ships it in production" in about two years. Mamba showed up in December 2023 as a research curiosity. By late 2025, IBM built Granite 4.0 on it. AI21 shipped Jamba with 256K context on a single GPU. Mistral released Codestral Mamba and it beat CodeLlama 34B at code generation — with a pure SSM, no attention at all.

The field moved fast enough that most practitioners I talk to are still working off outdated assumptions. "Mamba can't do in-context learning." "SSMs are just fancy RNNs." "You need special hardware." None of that is true anymore, and the gap between what people think and what's actually shipping is getting wider.

TL;DR

This post covers how selective state spaces work, why they scale linearly where Transformers scale quadratically, and which production models you should care about. The short version: Mamba achieves 5x higher throughput than Transformers with O(n) scaling. But pure SSMs still struggle with retrieval tasks. Hybrid architectures — a handful of attention layers mixed into a stack of Mamba layers — are winning in production. You'll walk away with a decision framework for when to use what.

The quadratic problem

Every Transformer layer computes this:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

That QK^T term is an n × n matrix, where n is your sequence length. Every token attends to every other token. The complexity is O(n² · d) per layer.

When Vaswani published "Attention Is All You Need" in 2017, sequences were 512 tokens long. The quadratic cost was a rounding error. Then context windows started growing.

Sequence Length	Attention Pairs	KV Cache (7B model)
2K (GPT-3 era)	4 million	~1 GB
4K tokens	16 million	~4 GB
32K tokens	1 billion	~32 GB
128K tokens	16.4 billion	100+ GB
1M tokens	1 trillion	Impractical

That 128K row is where things get ugly. A 7B parameter Transformer at 128K context can burn over 100 GB just on the KV cache. That's the memory cost of storing key and value tensors so each new token can attend to everything before it. The model weights themselves might only be 14 GB in half precision. The cache dwarfs the model.

# The scaling gap in one snippet
def attention_flops(seq_len):
    return seq_len ** 2  # O(n²)

def mamba_flops(seq_len):
    return seq_len       # O(n)

# At 128K tokens:
# Attention: 16,384² = 268,435,456 (268M pairwise ops per head)
# Mamba:     16,384   (16K state updates)
# That's a 16,384x difference. Per layer.

This wasn't a problem when GPT-3 had a 2K context window. It became a problem when the field decided it needed models that could read entire codebases, process hour-long transcripts, and maintain conversations that span days. Claude runs at 200K context. Gemini hit 1M+. Reaching those numbers with pure attention requires staggering amounts of memory and compute.

The whole industry spent 2023-2024 trying to fix this with engineering patches. FlashAttention. KV cache quantization. Sliding window attention. Ring attention. All useful. None of them change the fundamental math. The complexity is still quadratic. You're just making each unit of quadratic work cheaper.

State Space Models take a different approach: change the math.

How selective state spaces work

The lineage goes HiPPO (2020) → S4 (2021) → Mamba (2023). Each step solved a specific limitation.

HiPPO (Albert Gu, 2020) figured out that you could represent a running history of a sequence as coefficients of orthogonal polynomials — Legendre, Laguerre — updated continuously. Think of it as a mathematical compression scheme: instead of storing every past token, you project the history onto a set of basis functions and keep just the coefficients. This gave SSMs a principled way to compress long-range context into a fixed-size state without the information just decaying to zero like it does in vanilla RNNs.

S4 (2021-2022) proved that properly structured SSMs, initialized with HiPPO matrices, could handle sequences of tens of thousands of steps, demolishing Transformers on the Long Range Arena benchmark. S4 exploited a key equivalence: a linear time-invariant SSM can be computed as a convolution, allowing parallel training on GPUs. This spawned a family of variants (S4D, S5, DSS) through 2022, each simplifying the parameterization.

But S4 had a fatal limitation: its parameters were fixed. The A, B, C matrices didn't change based on what the model was actually reading. Every token got processed identically. The model couldn't decide "this token matters, pay attention" versus "this is noise, forget it." In the paper's language, S4 lacked content-based reasoning.

Mamba (Albert Gu & Tri Dao, December 2023) fixed exactly that problem. The core idea: make the SSM parameters functions of the input.

The underlying system is deceptively simple. You have a continuous-time state equation:

h'(t) = A · h(t) + B · x(t)
y(t)  = C · h(t)

State h gets updated based on input x, modulated by matrices A, B, C. Output y reads from the state through C. Discretize it (zero-order hold) and you get a recurrence:

h_t = Ā · h_{t-1} + B̄ · x_t
y_t = C · h_t

Each step is O(N) where N is the state dimension, constant with respect to sequence length. The state h is a fixed-size vector regardless of whether you've processed 100 tokens or 100,000.

What makes Mamba different from every SSM before it is selectivity. Instead of fixed parameters:

Δ: input-dependent  ← softplus(Parameter + s_Δ(x))
B: input-dependent  ← Linear(x)
C: input-dependent  ← Linear(x)
A: fixed            ← remains static

The step size Δ controls how much the model focuses on the current input versus preserving previous state. Large Δ means "gate open, let this in." Small Δ means "gate closed, keep what I have." B and C also adapt to the input, allowing content-dependent reading and writing of state.

This is formally a generalization of RNN gating. The Mamba paper proves it (Theorem 1): when N=1, A=−1, B=1, the selective SSM reduces to h_t = (1 − g_t) · h_{t-1} + g_t · x_t, which is exactly the classical gated recurrence. But with N=16 (the default), you get a state that's 16x richer than any gated RNN ever had.

flowchart TD
    Input([Input]) --> LP1[Linear Projection]
    Input --> Skip((skip))
    LP1 --> Conv[Conv1D]
    Conv --> SSM[Selective SSM]
    SSM --> LP2[Linear Projection]
    Skip --> LP2
    LP2 --> Output([Output])

Here's the catch: making parameters input-dependent breaks the convolution equivalence that S4 relied on for fast parallel training. You can't precompute a fixed convolution kernel when the kernel changes at every step. Mamba sidesteps this with a hardware-aware parallel scan algorithm.

Instead of materializing the full expanded state (shape B×L×D×N) in GPU HBM (slow memory), Mamba loads parameters into SRAM (fast memory), performs discretization and the recurrence in SRAM, and writes only the output (shape B×L×D) back to HBM. This gets 20-40x speedup over a naive implementation, up to 3x over naive recurrence on A100s. During training, intermediate states are recomputed during backprop instead of stored, trading compute for memory.

The architecture stacks these blocks with expansion factor 2, SiLU activation, and LayerNorm. No positional encoding needed. The recurrence inherently provides position information. Two Mamba blocks per layer match the parameter count (12D²) of a standard Transformer's MHA + MLP.

The result: Mamba-3B matches Transformer-6B quality on language modeling. Mamba-2.8B hits 63.3% zero-shot accuracy versus Pythia-2.8B's 59.1%. 5x higher generation throughput. Linear scaling to million-length sequences. On DNA modeling at 1M sequence length, Mamba's quality improves with context while HyenaDNA degrades.

Mamba-2 and Mamba-3

Mamba-1 proved the concept. The follow-ups refined it.

Mamba-2 (Tri Dao & Albert Gu, May 2024) introduced the State Space Duality (SSD) framework, a mathematical proof that SSMs and attention are dual representations of the same underlying computation on structured matrices. The paper title says it plainly: "Transformers are SSMs."

The key insight is that a selective SSM can be written as a lower-triangular matrix multiplication y = M · x, where M encodes both the causal mask (like attention) and the state decay (like a recurrence). When the decay factors are all 1, this reduces exactly to causal linear attention. The SSM view computes it in O(n) via recurrence. The attention view computes the same thing in O(n²) via matrix multiplication. Same function, two algorithms.

Practically, Mamba-2 is 2-8x faster than Mamba-1 on training. It replaces the scan-based computation with chunkwise matrix multiplications that GPUs are optimized for. The implementation is about 30 lines of PyTorch. Larger state sizes (up to 16x bigger than Mamba-1) substantially improve retrieval tasks.

Mamba-3 (2025) attacked three specific weaknesses:

Trapezoidal discretization: Mamba-1/2 used Euler's method (zero-order hold) to discretize the continuous system. Mamba-3 upgrades to the trapezoidal rule. Higher-order, more accurate, better quality at the same state size.
Complex-valued states: Mamba-2's real-valued states provably cannot solve certain state-tracking tasks. Mamba-3 switches to complex-valued state spaces. Look at the numbers:

Task	Mamba-2	Mamba-3
Parity	~0.9% (near random)	100%
Modular Arithmetic	Fails	Solves

0.9% to 100%. That's not an improvement, that's a different model. Complex-valued SSMs turn out to be connected to Data-Dependent Rotary Position Embeddings (RoPE), which bridges SSM theory with a technique Transformer practitioners already use.

MIMO formulation: Multi-Input Multi-Output increases arithmetic intensity, trading compute for lower perplexity without increasing memory. You get better hardware utilization without paying for it in VRAM.

Production hybrid architectures

The theory is interesting. What matters is what ships. Six production models tell the story.

AI21 Jamba

The first production-scale Mamba deployment. Jamba interleaves Mamba layers with attention layers at a 1:8 ratio — one attention layer for every eight total layers — plus Mixture-of-Experts routing.

Spec	Value
Active parameters	12B (52B total MoE)
Context length	256K tokens
Attention cache at 256K	4 GB
Equivalent Transformer cache	128 GB (Llama-2-70B)

Read those last two rows again. 4 GB versus 128 GB. That's the difference between "runs on one 80GB GPU" and "needs a multi-node cluster." Jamba fits 140K tokens of context on a single A100.

Benchmarks: 87.1% HellaSwag, 67.4% MMLU, 59.9% GSM8K (chain-of-thought). 3x faster token generation than Mixtral on long-context tasks.

A surprising design choice: Jamba uses Mamba-1, not Mamba-2. AI21 found that in a hybrid setup, Mamba-1 + Attention outperformed Mamba-2 + Attention. The engineering reality doesn't always follow the paper chronology.

IBM Bamba-9B

A hybrid with 29 SSM layers and 3 attention layers, built on Mamba-2. Trained on 2.2T tokens (v1) and 2.5T tokens (v2).

The inference numbers: 2.5x throughput improvement over standard Transformers in vLLM, 2x latency reduction. Quantized from 18 GB to 9 GB with minimal quality loss. Bamba-9B v2 outperforms Llama 3.1 8B on standard leaderboards — despite Llama training on 7x more data. That's architectural efficiency winning over brute-force scaling.

The v2 training process was unusual: IBM trained two separate models to 3T tokens with different learning rate schedules, merged them using MergeKit weighted averaging, then annealed on 100B high-quality tokens. Training recipes matter as much as architecture choices.

NVIDIA Hymba-1.5B

Hymba does something different: parallel hybrid heads. Instead of interleaving Mamba and attention in separate layers (like Jamba), Hymba runs both in the same layer simultaneously. Attention and Mamba process the same input in parallel, then their outputs combine.

Other interesting choices: 128 learnable meta tokens prepended to every sequence (they absorb global information and reduce attention overhead), cross-layer KV cache sharing between consecutive attention layers, and full attention in only 3 of its layers. First, middle, last. That's it.

At 1.5B parameters, Hymba outperforms Llama-3.2-1B and uses 10x less KV cache memory on A100.

IBM Granite 4.0

IBM went aggressive with the Mamba ratio: 9 Mamba-2 blocks per 1 Transformer block in a 7B MoE model. The results justify the bet — 82.41% on HumanEval, 70%+ lower memory requirements than comparable Transformers, 2x faster inference. Apache 2.0 license, 12-language support.

IBM isn't shipping this as a research preview. It's a production model with SLAs. That tells you where enterprise AI thinks this is going.

Mistral Codestral Mamba

This one surprised me. Codestral Mamba is pure Mamba-2, no attention layers at all, with 7.28B parameters.

Benchmark	Codestral Mamba	CodeGemma 7B	CodeLlama 34B
HumanEval	75.0%	61.0%	31.1%
HumanEval C++	59.8%	49.1%	—
HumanEval JS	61.5%	52.2%	—

A 7B pure SSM beating a 34B Transformer at code generation. Code has enough structure and locality that the selective state mechanism captures what matters without global attention. If you're building a code-focused product, pure Mamba is a real option.

NVIDIA Nemotron-H

Replaces 92% of attention layers with Mamba-2 blocks. Up to 3x throughput over LLaMA-3.1 and Qwen-2.5 at comparable sizes. Across all six of these models, the same pattern: the ratio of attention to Mamba keeps shrinking, and quality holds.

When to use what

After staring at benchmarks and ablation studies for weeks, here's the decision framework I'd use:

Scenario	Architecture	Why
Long context (>32K)	Hybrid	Linear memory + attention quality
Code generation	Pure Mamba	Structured tasks don't need global attention
Streaming / real-time	Pure Mamba	Constant memory per step
Complex reasoning	Transformer or Hybrid	Attention excels at in-context learning
Memory-constrained deployment	Mamba or Hybrid	Linear scaling wins
Retrieval-heavy RAG	Hybrid (mandatory)	Attention is required for retrieval
Edge deployment (<2B params)	Hymba-style parallel	Best efficiency at small scale

The retrieval row deserves emphasis. A 2025 ablation study on hybrid models (RecurrentGemma, Jamba) found that removing attention layers causes retrieval accuracy to drop to 0%. Not "gets worse." Zero. The Mamba layers contribute nothing to retrieval. Hybrid architectures are really specialized module systems: Mamba handles the bulk of sequence processing, attention handles the precision recall.

Architecture	Best At	Worst At
Pure Transformer	In-context learning, retrieval, reasoning	Quadratic scaling, long context memory
Pure Mamba	Throughput, long sequences, structured tasks	Associative recall, retrieval
Hybrid (Interleaved)	Balance of quality and efficiency	Slightly more complex to train
Hybrid (Parallel heads)	Maximum efficiency per parameter	Newest approach, less battle-tested

One thing from 2025 research that doesn't get enough attention: learning rate choice plays an outsized role in recurrent model performance. Some of the negative SSM results in the literature may reflect suboptimal hyperparameter tuning rather than architectural limitations. If you're benchmarking Mamba against Transformers internally, make sure you're actually tuning both fairly.

The emerging consensus: start hybrid. Use a small ratio of attention layers (1-in-8 or 1-in-10). Only go pure Mamba if you've validated that your workload doesn't need retrieval. Only go pure Transformer if context length is permanently short and you need maximum in-context learning.

Common misconceptions

"Mamba can't do in-context learning." This was plausible in early 2024. It's not true in 2026. Jamba hits 67.4% MMLU and 59.9% GSM8K. Granite 4.0 scores 82.41% HumanEval. Hybrids addressed the early limitations, and even pure Mamba models keep improving through better state representations (Mamba-3's complex-valued states).

"SSMs are just RNNs with better marketing." No. The selective mechanism is a different thing from fixed gating. Mamba's parameters change with the input. The model decides per-token how much state to preserve or overwrite. The state dimension (N=16 by default) gives it far more representational capacity than scalar RNN gates. And Mamba-3 solves tasks (Parity at 100%) that no RNN and no real-valued SSM can solve. Call that marketing if you want, but the math disagrees.

"Mamba will fix hallucinations." It won't. OpenAI's 2025 hallucination framework (Kalai et al.) proves mathematically that hallucination is architecture-agnostic. The core theorem: err >= 2 · err_iiv. Under binary evaluation (right/wrong), models are incentivized to guess rather than say "I don't know." This holds whether you use attention, SSMs, or anything else. Hallucination lives in the training objective, not the architecture.

"You need equal parts attention and Mamba in hybrids." Production models disagree. Jamba uses a 1-in-8 ratio. Granite 4.0 uses 1-in-10. Nemotron-H replaces 92% of attention layers. Sometimes just 3 attention layers total is enough for retrieval capability while Mamba handles everything else.

Practical implementation

If you want to start using hybrid models today, Jamba is the most accessible entry point. Here's an 8-bit quantized setup:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=["mamba"]  # Preserve Mamba layer precision
)

model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-v0.1",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    quantization_config=quantization_config,
)

Note the llm_int8_skip_modules=["mamba"]. Mamba layers are more sensitive to quantization than attention layers. Skipping them during 8-bit conversion preserves quality where it matters.

Dependencies:

pip install mamba-ssm causal-conv1d>=1.2.0
pip install transformers>=4.40.0 bitsandbytes

Deployment checklist before you ship anything:

Verify CUDA 11.8+ compatibility (mamba-ssm requires it)
Benchmark with representative workloads at your target context length
Monitor memory usage — it should scale linearly with sequence length, not quadratically. If it doesn't, something is wrong
Compare against a Transformer baseline at the same parameter count. The throughput gain should be 2-5x depending on context length
Test retrieval-dependent features specifically. If your application relies on finding specific information in long contexts, a hybrid is mandatory

For production inference, vLLM and llama.cpp both support Mamba-based models. Standard NVIDIA GPUs work fine.

What this means

I keep coming back to the bigger picture here. Transformers solved the long-range dependency problem that killed RNNs. Selective state spaces are solving the scaling problem that's slowly strangling attention. The Transformer's core assumption, that every token must attend to every other token, turned out to be sufficient but not necessary.

The same pattern plays out across deep learning. CNNs weren't the final word in computer vision. RNNs weren't the final word in sequence modeling. Transformers almost certainly aren't either. The question was never "will something better come along?" It was "what will it look like?"

Now we have an answer: it looks like a fixed-size state that learns what to remember and what to forget, processed in linear time, optionally augmented with a few attention layers for the tasks that genuinely need global token interaction.

The next time you're designing a system with long contexts, ask yourself: does every token need to attend to every other token? Or is selective state propagation enough?

For most workloads, the answer is shifting.

Sources

Vaswani, A., et al. "Attention is all you need." NeurIPS 2017.
Gu, A., & Dao, T. "Mamba: Linear-time sequence modeling with selective state spaces." 2023.
Dao, T., & Gu, A. "Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality." ICML 2024.
Gu, A., et al. "Mamba-3: Improved sequence modeling using state space principles." 2025.
Lieber, O., et al. "Jamba: A hybrid transformer-Mamba language model." ICLR 2025.
Kalai, A.T., et al. "Why language models hallucinate." OpenAI, 2025.
Gu, A. "Efficiently modeling long sequences with structured state spaces." ICLR 2022.
NVIDIA. "Hymba: A hybrid-head architecture for small language models." 2024.
IBM. "Granite 4.0: Hyper-efficient, high performance hybrid models." 2025.
Mistral AI. "Codestral Mamba." 2024.
IBM Research. "Meet Bamba." 2025.
Bick, A., et al. "Understanding the skill gap in recurrent language models." ICML 2025.
Grootendorst, M. "A visual guide to Mamba and state space models."
IBM. "What is Mamba?"

15 min read | Read on the blog | Buy me a coffee

VLA Models Demystified: How Robots Learned to See, Listen, and Act

Serendeep Rudraraju — Tue, 03 Feb 2026 12:45:21 GMT

VLA Models Demystified: How Robots Learned to See, Listen, and Act

What happens when you take a language model and give it a body?

It's not a thought experiment anymore. In 2025, a new architecture called VLA moved from academic papers into actual humanoid robots folding origami, sorting warehouse shelves, and working alongside humans at BMW factories. The answer to "can AI control physical things?" went from "theoretically, maybe" to "yes, and here's the API."

I've spent the last few months watching this field explode. Eight survey papers on arXiv in a single year. NVIDIA, Google, and a startup called Physical Intelligence all releasing competing models within months of each other. An open-source 450-million parameter model that runs on a MacBook and somehow keeps up with the billion-parameter behemoths.

What are VLA models?

VLA stands for Vision-Language-Action. The concept is simple enough: give a model an image of what the robot sees, a text instruction like "pick up the red cup," and have it output the actual motor commands to make that happen.

Input:
  - Camera feed of a table with objects
  - "Pick up the red cup and place it on the blue plate"

Output:
  - Joint positions, gripper commands
  - Continuous action sequences at 50Hz

Before VLAs, building a robot that could follow natural language meant stitching together separate systems. One model for vision. Another for language understanding. A third for motion planning. A fourth for low-level control. They'd pass information back and forth like a bad game of telephone, and things broke constantly.

VLAs collapse all of that into one model that learns the whole pipeline end-to-end. Show it enough examples of "instruction + camera image → successful action," and it figures out how to generalize.

The really interesting part isn't that it works. It's that the same model can work across different robots. Train on a robot arm, a wheeled robot, and a humanoid, and the model learns something transferable between them. OpenVLA was trained on 22 different robot types and 970,000 episodes. SmolVLA used data from 487 different community datasets.

Cross-embodiment transfer sounds like marketing speak until you realize it means not having to start from scratch every time someone builds a new robot.

The architecture that won: dual systems

If you look at the major 2025 VLA models (Helix from Figure AI, NVIDIA's GR00T N1, Google's Gemini Robotics) they all landed on the same basic structure. Two systems working together.

System 2 is the "thinking" part. It's a vision-language model, the same kind of thing that powers image understanding in ChatGPT or Gemini. It looks at the camera feed, reads the instruction, and builds an internal representation of "here's what I'm looking at, here's what I need to do." This runs slow, maybe 7-9 times per second. That's fine because thinking doesn't need to be fast.

System 1 is the "doing" part. It takes whatever representation S2 produced and translates it into actual motor commands. This needs to run fast. Helix outputs actions at 200Hz, meaning 200 control signals per second. You need that speed for smooth, precise movement. Try catching a ball at 7Hz and you'll see why.

graph TD
    subgraph S2["SYSTEM 2 — Vision-Language Model"]
        CAM[Camera Feed] --> SCENE[Scene Understanding]
        LANG[Language Instruction] --> PARSE[Language Parsing]
        SCENE --> REP[Internal Representation]
        PARSE --> REP
    end

    subgraph S1["SYSTEM 1 — Visuomotor Policy"]
        REP --> DECODE[Action Decoder]
        DECODE --> MOTOR[Motor Commands]
        MOTOR --> OUT["Output: 50–200Hz control"]
    end

This split borrows from cognitive psychology. Daniel Kahneman's "Thinking, Fast and Slow" describes human cognition as two systems: one slow and deliberate, one fast and automatic. VLA researchers took that literally.

The insight is that you don't need the full reasoning power of a language model to move a gripper two centimeters to the left. You need a fast, specialized controller that knows what "two centimeters left" means in the context S2 has established. So you train both systems end-to-end, and they learn to communicate efficiently.

Figure AI's S1 model is only 80 million parameters. That's tiny. The reason it works is that S2 does the heavy lifting of understanding, and S1 just needs to execute.

The 2025 model landscape

The field went from "a few research demos" to "multiple production-ready options" in about eighteen months.

The closed-source heavyweights

Gemini Robotics (Google DeepMind) builds on Gemini 2.0. The demos are impressive: robots folding origami, manipulating playing cards, doing tasks that require genuine dexterity. In June 2025, they released an on-device version optimized to run locally on the robot with low latency. That matters because you don't want your robot waiting for a cloud API response when it's about to drop something.

Helix (Figure AI) was the first VLA to control a full humanoid upper body: arms, hands, torso, head, individual fingers, all at high frequency. They also demonstrated something I haven't seen elsewhere: two robots collaborating on a shared task, controlled by the same model. Figure cut ties with OpenAI in favor of Helix, which tells you something about how confident they are.

π0 (Physical Intelligence) uses a technique called flow-matching instead of the standard autoregressive approach. The result is smoother action generation at 50Hz. They trained on eight different robot types, and the cross-embodiment results are impressive. Physical Intelligence is now valued at $2.4 billion, which seems like a lot until you consider they might be building the operating system for physical AI.

GR00T N1 (NVIDIA) followed Helix's dual-system architecture but trained on a mix of real robot data, human videos, and synthetic data generated in simulation. The weights are available, which puts it somewhere between "open" and "closed."

The open-source options

This is where things get interesting for people who actually want to experiment.

OpenVLA came out of Stanford and collaborators in June 2024. Seven billion parameters, trained on 970,000 episodes across 22 robot embodiments. It outperforms Google's RT-2 (which has 55 billion parameters) by 16.5% on manipulation tasks. Apache 2.0 license. You can run it on a single GPU with 16GB+ VRAM.

from transformers import AutoModelForVision2Seq, AutoProcessor

processor = AutoProcessor.from_pretrained("openvla/openvla-7b")
model = AutoModelForVision2Seq.from_pretrained("openvla/openvla-7b")

inputs = processor(
    images=observation_image,
    text="Pick up the red cube and place it on the blue plate",
    return_tensors="pt"
)

action = model.generate(**inputs)

SmolVLA (Hugging Face) is the one that surprised me. 450 million parameters, about 15x smaller than OpenVLA, and it matches or beats larger models on both simulation and real-world tasks. It runs on a MacBook. It was trained entirely on community-contributed datasets through LeRobot.

The fact that a model this small keeps up with the giants suggests we're nowhere near the efficiency ceiling. There's probably a lot of unnecessary complexity in the bigger models.

Model	Parameters	GPU Memory	Inference Speed	License
SmolVLA	450M	4GB	Real-time	Open
OpenVLA	7B	16GB+	2-5Hz	Apache 2.0
GR00T N1	2B	~24GB	Variable	Weights available
π0	Undisclosed	48GB+	50Hz	Partial

What VLAs still can't do well

I'd be doing you a disservice if I didn't talk about the limitations, because there are real ones.

Spatial reasoning is shaky. VLMs are trained on 2D images with text. They never learned to think in 3D. When a VLA needs to reason about depth, occlusion, or the physical relationship between objects in space, it often gets things wrong. There's active research on adding depth awareness, but it's not solved.

Memory is basically nonexistent. Each action decision is reactive to the current camera frame. The model doesn't really maintain a spatial history of "I already looked over there and it wasn't there." Some hierarchical approaches are starting to address this, but most VLAs are surprisingly forgetful.

Variable environments still trip them up. Change the lighting in a scene, add clutter, or introduce objects the model hasn't seen, and performance drops. Warehouses and labs work well because they're controlled. Your messy kitchen is a different story.

There's no standard benchmark. Different papers evaluate on different tasks with different metrics. LIBERO exists as a simulation benchmark with 130+ tasks, but comparing results across papers is frustrating. This is a problem the field needs to solve.

The sim-to-real gap persists. You can train in simulation cheaply and at scale, but what works in MuJoCo or Isaac Sim doesn't always transfer to physical robots. The gap is narrowing but it's not gone.

The connection to computer-use agents

Something I keep thinking about: VLA models and GUI agents like Anthropic's Computer Use or OpenAI's Operator are architecturally cousins.

Both take visual input (camera feed or screenshot), combine it with a language instruction, and output actions (robot commands or mouse clicks). Both use vision-language models as their reasoning backbone. Both benefit from chain-of-thought prompting to handle multi-step tasks.

The VLA researchers are solving "how do you go from seeing and understanding to physically doing" in the robot domain. The GUI agent researchers are solving the same problem for digital interfaces. They're reading each other's papers, and the techniques transfer.

If you're interested in autonomous agents generally, not just robots, the VLA literature is worth following. The "see, understand, act" paradigm is the same whether you're picking up a cup or clicking a button.

Getting started

If you want to experiment:

Start with SmolVLA. It's small enough to run on modest hardware, well-documented, and integrated with Hugging Face's LeRobot library. The barrier to entry is low.

Try simulation first. MuJoCo is free. Isaac Sim has a free tier. Running a real robot is expensive and things break. Get your bearings in simulation.

Use LeRobot. Hugging Face built this library specifically to make VLA research accessible. It handles data loading, training, and evaluation. There's a free tutorial if you want the basics.

Join the community. The LeRobot Discord and OpenVLA GitHub are where people are actually building things and sharing what works.

Your situation	Where to start
Curious, no robot	SmolVLA + simulation
Have a robot arm	OpenVLA fine-tuning
Serious research	LeRobot + LIBERO benchmark
Just want to understand	Read the Wikipedia page and Helix blog post

What this means

Language models gave us machines that could talk. Vision models gave us machines that could see. VLA models are giving us machines that can touch and manipulate the physical world.

Google's robot can fold origami. Figure's humanoid can sort items alongside warehouse workers. BMW deployed VLA-powered robots in manufacturing in January 2025, not as a pilot, but permanently.

The question used to be whether AI could control physical things. Now the question is what we want it to control, and under what constraints.

I keep coming back to those robots working through the night at the BMW factory. There's something both exciting and unsettling about machines that can see a problem, understand what needs to be done, and just... do it. No human in the loop. The technology has crossed a line, and I'm not sure we've fully processed what that means.

But that's probably a topic for another post.

Sources

Wikipedia: Vision-language-action model - Comprehensive overview and history
Figure AI: Helix announcement - Dual-system architecture details
Hugging Face: SmolVLA - Open-source compact VLA
OpenVLA - Open-source 7B VLA model
arXiv: VLA Survey - Comprehensive review of 102 models
Google DeepMind: RT-2 - Original VLA breakthrough
LeRobot GitHub - Open-source robotics library
arXiv: Gemini Robotics - Google's 2025 VLA

9 min read | Read on the blog | Buy me a coffee

JavaScript Date Is Broken. Here's What Replaces It

Serendeep Rudraraju — Fri, 23 Jan 2026 17:40:23 GMT

JavaScript Date Is Broken. Here's What Replaces It

Have you ever scheduled a meeting for 3 PM, only to have a colleague across the Atlantic show up at the wrong time? Or built an event system that worked perfectly in development, then watched it produce mystifying bugs the moment users in different timezones touched it?

If so, you've run into one of JavaScript's longest-running embarrassments: the Date object.

The good news: after years of committee work, JavaScript is finally getting a proper date and time API. The TC39 Temporal proposal is the biggest improvement to JavaScript's handling of dates since the language was created. In this tutorial, I'll walk through building a timezone-aware event scheduler with React and TypeScript that shows why Temporal matters and how to use it today.

TL;DR

We're building a scheduler where users can create events in any timezone, view them in their local timezone, and sort them correctly regardless of origin. The insight that makes this work: Temporal separates the instant in time from its human representation, which eliminates entire categories of bugs that plague Date-based applications.

You can follow along with the code on GitHub or steal these patterns for your own projects. The Temporal API is available now via polyfill.

Prerequisites

Before we start, you'll need:

Familiarity with React and TypeScript
Node.js 18 or later
A basic understanding of what timezones are (though not necessarily how they work internally)

No prior Temporal experience required.

What's wrong with Date, exactly?

The Date object was designed in ten days in 1995. It shows.

Here's why it's unsuitable for serious datetime work:

Mutability. Date objects can be modified in place. Call setMonth() and you've changed the original object. This makes Date objects dangerous to pass around.

Timezone confusion. A Date represents an instant in time, but its methods (getHours(), toString()) silently use the local timezone. There's no way to create a Date that "knows" it belongs to a specific timezone. You can't represent "3 PM in Tokyo" directly.

No duration arithmetic. Adding 30 days requires manual millisecond math. Adding "one month" is a question Date doesn't even attempt to answer.

Parsing nightmares. Date.parse() is implementation-dependent. The string "2026-01-20" parses as midnight UTC in some browsers and midnight local time in others. This is the bug that passes all your tests and breaks in production.

No separation of concepts. A calendar date, a wall-clock time, and a specific moment in history are different things. Date conflates them all.

Consider this:

const meeting = new Date('2026-03-15T14:00');
console.log(meeting.toISOString());

What time is this meeting? Depends on which timezone your JavaScript runtime is configured for. The same code produces different results on different machines. This isn't a bug. This is how Date was designed.

Enter Temporal: the mental model

Temporal introduces distinct types for distinct concepts:

Temporal.Instant - an exact moment in time (like a Unix timestamp, but with nanosecond precision)
Temporal.PlainDate - a calendar date with no time or timezone
Temporal.PlainTime - a wall-clock time with no date or timezone
Temporal.PlainDateTime - a date and time with no timezone
Temporal.ZonedDateTime - a date, time, and timezone together

graph TD
    subgraph "Temporal Type Hierarchy"
        PD["PlainDate<br/><i>2026-03-15</i>"]
        PT["PlainTime<br/><i>14:00:00</i>"]
        PDT["PlainDateTime<br/><i>2026-03-15T14:00</i>"]
        ZDT["ZonedDateTime<br/><i>2026-03-15T14:00[America/New_York]</i>"]
        INS["Instant<br/><i>absolute point on timeline</i>"]

        PD -->|"+ PlainTime"| PDT
        PT -->|"+ PlainDate"| PDT
        PDT -->|"+ timezone"| ZDT
        ZDT -->|".toInstant()"| INS
    end

This separation matters. When a user selects "March 15, 2026 at 2 PM" in a form, they're specifying a PlainDateTime. When they also select "America/New_York" as the timezone, that combination becomes a ZonedDateTime. And when you need to compare that event to one in Tokyo, both convert to Instant values on the same global timeline.

Temporal objects are immutable. Operations return new objects rather than modifying existing ones. One less thing to worry about.

Project setup

Create a new React project with Vite and install the Temporal polyfill:

npm create vite@latest timezone-scheduler -- --template react-ts
cd timezone-scheduler
npm install @js-temporal/polyfill
npm install -D tailwindcss @tailwindcss/vite

The @js-temporal/polyfill gives you a complete Temporal implementation today. When browsers ship native support, you can drop the polyfill and your code keeps working.

Configure your vite.config.ts to include Tailwind:

import { defineConfig } from 'vite';
import react from '@vitejs/plugin-react';
import tailwindcss from '@tailwindcss/vite';

export default defineConfig({
  plugins: [react(), tailwindcss()],
});

Building the Temporal utils layer

Rather than scattering Temporal calls throughout the application, I put common operations in a utility layer. Easier to test, easier to change later.

The simplest function shows how Temporal handles timezone detection:

import { Temporal } from '@js-temporal/polyfill';

export function getLocalTimezone(): string {
  return Temporal.Now.timeZoneId();
}

This returns an IANA timezone identifier like "America/New_York" or "Asia/Tokyo". Unlike the old approach of parsing Date.toString() or using Intl.DateTimeFormat, it just works.

Next, we need to create ZonedDateTime objects from user input. When someone fills out an event form, they provide a local datetime string (from an <input type="datetime-local">) and a timezone selection:

export function createZonedDateTime(
  dateTimeLocal: string,
  timezone: string
): Temporal.ZonedDateTime {
  const plainDateTime = Temporal.PlainDateTime.from(dateTimeLocal);
  return plainDateTime.toZonedDateTime(timezone);
}

Two steps: parse the "plain" datetime without timezone context, then attach the timezone they selected. The from() method accepts ISO 8601 strings like "2026-03-15T14:00".

Converting between timezones is where Temporal earns its keep:

export function convertTimezone(
  zonedDateTime: Temporal.ZonedDateTime,
  targetTimezone: string
): Temporal.ZonedDateTime {
  return zonedDateTime.withTimeZone(targetTimezone);
}

One method call. The withTimeZone method returns a new ZonedDateTime representing the same instant but displayed in a different timezone. The underlying moment doesn't change; only its human representation does. This is exactly what you want when showing a New York event to someone in London.

For relative times like "in 2 hours" or "3 days ago," Temporal provides duration arithmetic that actually works:

export function getRelativeTime(zonedDateTime: Temporal.ZonedDateTime): string {
  const currentTime = now();
  const isPastTime = Temporal.ZonedDateTime.compare(zonedDateTime, currentTime) < 0;
  const duration = zonedDateTime.since(currentTime, {
    largestUnit: 'days',
  });

  const totalMinutes = Math.abs(
    duration.days * 24 * 60 +
      duration.hours * 60 +
      duration.minutes
  );

  const suffix = isPastTime ? 'ago' : 'from now';

  if (Math.abs(duration.days) >= 1) {
    const days = Math.abs(duration.days);
    return `${days} day${days !== 1 ? 's' : ''} ${suffix}`;
  }

  if (totalMinutes >= 60) {
    const hours = Math.floor(totalMinutes / 60);
    return `${hours} hour${hours !== 1 ? 's' : ''} ${suffix}`;
  }

  if (totalMinutes >= 1) {
    return `${totalMinutes} minute${totalMinutes !== 1 ? 's' : ''} ${suffix}`;
  }

  return 'now';
}

The since() method returns a Temporal.Duration with properties for days, hours, minutes, and so on. The largestUnit option controls how the duration is balanced. With Date, implementing this correctly means careful handling of daylight saving time transitions. Temporal handles those automatically.

One more utility: displaying timezone offsets. Some timezones have non-integer hour offsets (India is UTC+5:30, Nepal is UTC+5:45), and Temporal gives you this directly:

export function getTimezoneOffset(timezone: string): string {
  const zdt = now(timezone);
  const offsetNanoseconds = zdt.offsetNanoseconds;
  const totalMinutes = offsetNanoseconds / (1000 * 1000 * 1000 * 60);
  const hours = Math.trunc(totalMinutes / 60);
  const minutes = Math.abs(totalMinutes % 60);
  const sign = totalMinutes >= 0 ? '+' : '';

  if (minutes === 0) {
    return `UTC${sign}${hours}`;
  }
  return `UTC${sign}${hours}:${String(minutes).padStart(2, '0')}`;
}

Notice the use of offsetNanoseconds rather than hours. Temporal provides nanosecond precision throughout, and offsets are expressed in the smallest unit to avoid floating-point issues.

Data flow: from form input to sorted display

Here's how a timezone-aware event moves through the application, from creation to display:

graph LR
    A["User Input<br/><code>datetime-local</code> + timezone"] --> B["PlainDateTime.from()"]
    B --> C["toZonedDateTime(tz)"]
    C --> D["Store as ISO string<br/>+ IANA timezone ID"]
    D --> E["parseISOToZonedDateTime()"]
    E --> F{"Display in<br/>original tz?"}
    F -->|Yes| G["Show original ZDT"]
    F -->|No| H["withTimeZone(viewTz)"]
    H --> I["Show converted ZDT"]

    D --> J["Sort via<br/>ZonedDateTime.compare()"]
    J --> K["Sorted event list<br/>(by instant, not wall clock)"]

State management with Context

The scheduler needs to track events and the user's current view timezone. React Context with useReducer works well here.

Type definitions first:

export interface ScheduledEvent {
  id: string;
  title: string;
  description?: string;
  startTime: string; // ISO 8601 format with timezone offset
  timezone: string;  // IANA timezone identifier
  createdAt: string;
}

export interface EventsState {
  events: ScheduledEvent[];
  viewTimezone: string;
}

Events store their datetime as ISO strings with offsets, plus the original timezone identifier. This preserves everything needed to display the event correctly in any timezone.

The context provider handles localStorage persistence with a pattern that prevents hydration bugs:

export function EventsProvider({ children }: { children: ReactNode }) {
  const [state, dispatch] = useReducer(eventsReducer, initialState);
  const [isLoaded, setIsLoaded] = useState(false);

  // Load events from localStorage on mount
  useEffect(() => {
    try {
      const stored = localStorage.getItem(STORAGE_KEY);
      if (stored) {
        const parsed = JSON.parse(stored) as unknown[];
        const events = parsed.filter(validateEvent);
        dispatch({ type: 'LOAD_EVENTS', payload: events });
      }
    } catch (error) {
      console.error('Failed to load events from localStorage:', error);
    }
    setIsLoaded(true);
  }, []);

  // Save events to localStorage whenever they change (after initial load)
  useEffect(() => {
    if (!isLoaded) return;
    try {
      localStorage.setItem(STORAGE_KEY, JSON.stringify(state.events));
    } catch (error) {
      console.error('Failed to save events to localStorage:', error);
    }
  }, [state.events, isLoaded]);
  // ...
}

The isLoaded flag prevents the second effect from running until hydration completes. Without it, the initial empty state would overwrite stored events. I've been bitten by this before.

The validation function keeps corrupted data from crashing the app:

function validateEvent(event: unknown): event is ScheduledEvent {
  return (
    typeof event === 'object' &&
    event !== null &&
    typeof (event as ScheduledEvent).id === 'string' &&
    typeof (event as ScheduledEvent).title === 'string' &&
    typeof (event as ScheduledEvent).startTime === 'string' &&
    typeof (event as ScheduledEvent).timezone === 'string'
  );
}

This type guard filters out malformed entries during hydration. TypeScript's type predicates make it both safe and ergonomic.

The event card: dual timezone display

The event card is where the Temporal utilities come together. Each card can display its time in either the original timezone or the user's view timezone:

const originalZdt = parseISOToZonedDateTime(event.startTime, event.timezone);
const viewZdt = convertTimezone(originalZdt, state.viewTimezone);

const displayZdt = showInOriginalTz ? originalZdt : viewZdt;
const eventIsPast = isPast(originalZdt);

The component maintains both representations and switches between them based on user preference. The isPast check uses the original timezone because that's the actual moment the event occurs, regardless of how it's displayed.

This solves a common problem: showing a meeting scheduled at "3 PM Tokyo" to someone in New York. They see "1 AM New York" with a toggle to view the original. No manual offset calculations, no daylight saving time bugs.

Sorting events across timezones

A list of events from different timezones needs to sort by actual occurrence, not by the numeric values of their local times. An event at 9 AM in Tokyo happens before an event at 9 AM in New York.

const sortedEvents = useMemo(() => {
  return [...state.events].sort((a, b) => {
    const aZdt = parseISOToZonedDateTime(a.startTime, a.timezone);
    const bZdt = parseISOToZonedDateTime(b.startTime, b.timezone);
    return Temporal.ZonedDateTime.compare(aZdt, bZdt);
  });
}, [state.events]);

Temporal.ZonedDateTime.compare() compares by instant, not wall-clock time. Exactly right for sorting a mixed-timezone list.

With Date, you'd convert both to UTC milliseconds and compare those. Possible, but error-prone. With Temporal, the comparison is explicit and correct by default.

What I learned

A few patterns worth noting:

Store timezone identifiers, not offsets. IANA identifiers like "America/New_York" encode daylight saving rules. UTC offsets like -05:00 don't. Store an offset and you lose the ability to correctly handle events that span a DST transition.

Parse to Instant for storage, ZonedDateTime for display. ISO strings with offsets (2026-03-15T14:00:00-05:00) can be parsed to either. Use Instant.from() when you need the universal timeline, ZonedDateTime.from() when you need to preserve original timezone context.

The polyfill is larger than you might expect. The @js-temporal/polyfill adds roughly 50KB minified. For many applications that's fine, but worth knowing.

Timezone data isn't static. The IANA Time Zone Database gets updates several times per year as governments change their DST rules. The polyfill bundles a snapshot, but production applications may need an update strategy.

Temporal.Duration can surprise you. A duration of "1 month" is ambiguous. January to February is 31 days; February to March is 28 or 29. Temporal handles this, but you should understand when you're working with "calendar" durations versus fixed durations.

Closing thoughts

The Temporal API introduces a new mental model for time in JavaScript, one that distinguishes between moments, calendar dates, wall-clock times, and the timezones that connect them.

For applications that deal seriously with time, this precision isn't academic. Scheduling systems, booking platforms, financial applications have all spent years working around Date's limitations. Temporal doesn't make these problems disappear, but it gives you tools that match the actual complexity.

The proposal reached Stage 3 in TC39, meaning the API is stable and browser implementation is underway. Use the polyfill today and your code will work unchanged when native support lands.

Time is hard. At least now our tools admit it.

10 min read | Read on the blog | Buy me a coffee

Open Source Sunday: Open Source Health Tools That Don't Sell You Out

Serendeep Rudraraju — Sun, 18 Jan 2026 15:34:15 GMT

Open Source Sunday: Open Source Health Tools That Don't Sell You Out

Every fitness app promises to help you "reach your goals." Most curricula never raise a relevant question: where does your heart rate data go after the app syncs it to the cloud, and who profits from the intimate story your body tells?

The answer involves a data supply chain spanning advertising networks, insurance companies, and data brokers you have never heard of. This is not about paranoia or tinfoil hats. It is about understanding what happens to the most personal data you generate — and the growing ecosystem of tools that let you keep it entirely.

TL;DR

Open source health tools have matured dramatically in 2025. You can now track fitness, manage medications, monitor vital signs with medical-grade accuracy, and aggregate your complete medical records — all without sending data to corporate servers or paying monthly subscriptions.

This guide covers the best open source alternatives across five categories: wearables, fitness tracking, medication management, mental health, and medical records. Each tool includes an honest difficulty rating (1-5 stars) and a frank assessment of what you gain and what you give up.

The minimum viable setup costs nothing and shares zero data. The advanced setup gives you complete sovereignty over health information that commercial apps routinely sell.

Why This Matters Now

The Privacy Illusion

Most people believe their health app data is protected by HIPAA. It is not.

HIPAA applies only to "covered entities" — healthcare providers, insurers, and their business associates. Your Fitbit, Oura ring, period tracker, or meditation app? Not covered. The data these apps collect falls entirely outside federal health privacy law.

The regulatory gap has consequences. A BMJ analysis found that 79% of health apps share user data with third parties. Those third parties then share with "fourth parties" — a cascading data supply chain you never consented to join.

This is not theoretical. In 2023, the FTC ordered BetterHelp to pay $7.8 million after the company shared users' mental health data with Facebook, Snapchat, and other advertising platforms. The company had promised to keep user data private. It did not.

Security vulnerabilities compound the privacy problem. A 2025 analysis found an average of 44 critical vulnerabilities per Android healthcare app, with over 2,000 high-severity issues across the apps studied. More than 176 million patients have been affected by protected health information breaches historically.

The uncomfortable question: if you would not post your medication schedule, sleep patterns, and menstrual cycle on social media, why are you sharing this data with apps that have fewer legal obligations than your doctor?

The Subscription Trap

The economics of wearables have inverted. The device is no longer the product — your data is.

Consider the real cost of "free" and subsidized wearables:

Device	Upfront Cost	Subscription	5-Year Total
Whoop 4.0	"Free" with subscription	$199-300/year mandatory	~$1,200
Oura Ring 4	$349	$5.99/month ($72/year)	~$709
Fitbit (with Premium)	$100-300	$9.99/month ($120/year)	$700-900

Whoop requires a 12-month commitment minimum. Fitbit's advanced analytics sit behind the Premium paywall. And Fitbit users face a deadline: move to a Google account by February 2026 or lose access.

You are paying monthly rent for access to your own body's data.

The 2025 Turning Point

Two developments have shifted the landscape.

On November 4, 2025, Senator Bill Cassidy introduced HIPRA — the Health Information Privacy Reform Act. Unlike HIPAA, HIPRA specifically addresses wearables and health apps. The legislation creates a new category called "Applicable Health Information" covering digital health metrics that fall outside traditional medical records. It requires consent before selling health data.

HIPRA signals that regulators are finally recognizing the gap between what consumers expect and what the law requires.

Late 2025 also saw major launches in the open source health ecosystem:

Open Wearables (December 2025): A unified API connecting 200+ wearable devices, MIT licensed
HealthyPi Move: An open source biometric monitor with medical-grade sensors, 329% crowdfunded
Gadgetbridge 0.88.0: Added Garmin support, expanding the universe of "liberated" wearables

These are not hobbyist experiments. They are production-grade tools that make data sovereignty practical.

The Open Source Health Stack

Before diving into specific tools, here is how the pieces fit together:

graph TD
    subgraph Phone["YOUR PHONE / COMPUTER"]
        Local["Data Stays Here (Local-First)"]
    end

    subgraph Hardware["Hardware (Wearable)"]
        HW1["PineTime"]
        HW2["HealthyPi"]
        HW3["ZSWatch"]
    end

    subgraph Apps["Apps (Tracking)"]
        A1["FitoTrack"]
        A2["Gadgetbridge"]
        A3["wger"]
    end

    subgraph Aggregator["Aggregator (Records)"]
        AG1["Fasten"]
        AG2["Open Wearables"]
    end

    Hardware --> Local
    Apps --> Local
    Aggregator --> Local

The key principle: data flows to your device, never from it to corporate servers.

Gadgetbridge illustrates this concretely. The app literally cannot send data anywhere — it has no network permission in its Android manifest. This is not a policy decision that could change. It is an architectural guarantee enforced by the operating system.

Category 1: Wearables & Companion Apps

Gadgetbridge — The Universal Liberation Tool

Gadgetbridge is an open source Android app that replaces the official companion apps for smartwatches and fitness bands. Instead of pairing your Amazfit to the Zepp app (which uploads your data to servers in China), you pair it to Gadgetbridge (which stores everything locally).

Supported devices:

Amazfit: Bip, GTR, GTS, T-Rex, Balance, Active, Falcon, Cheetah
Xiaomi: Mi Band 4-8, Smart Band series
Garmin: Partial support since v0.81.0
PineTime, Bangle.js, Casio, Fossil
And 50+ more

What you gain:

Zero network permission = mathematically impossible to leak data
All data stored locally in exportable SQLite database
Works offline indefinitely
No account creation required

What you lose:

Cannot update watch firmware automatically (you must download files manually)
Some advanced features may be missing compared to official apps
Initial setup requires auth key extraction for newer Amazfit/Xiaomi devices

Difficulty rating: ⭐⭐ (2/5)

The app install itself is trivial — get it from F-Droid. The complication is that newer Amazfit and Xiaomi devices require "server-based pairing." You must pair with the official app once, extract an authentication key, then enter that key into Gadgetbridge. It sounds tedious. It takes about fifteen minutes, and you never touch the official app again.

Quick setup for Amazfit/Xiaomi devices:

Install Gadgetbridge from F-Droid
Install the official Zepp Life app temporarily
Create an account and pair your device normally
Use the Huami-token tool to extract your auth key
In Gadgetbridge, add your device and paste the key (prefix with 0x)
Uninstall Zepp Life

From this point forward, your health data never leaves your phone.

Source: Gadgetbridge Official

Gadgetbridge main interface — all your health data stays on your device

HealthyPi Move — Medical-Grade Open Hardware

HealthyPi Move is an open source biometric monitor in a watch form factor. Unlike consumer wearables that track steps and heart rate, HealthyPi measures eight vital signs with medical-grade sensors:

Single-lead ECG for heart rhythm analysis
PPG for heart rate, HRV, and SpO₂
EDA/GSR for stress and emotional response
Body temperature
Blood pressure trends (via finger-based PPG attachment)
6-axis IMU for activity tracking

The hardware is fully open (CERN-OHL-P v2 license). The firmware runs on Zephyr RTOS. The companion app is built with Flutter and supports Android, iOS, macOS, Windows, and Linux.

Why this matters: Medical-grade health monitoring has historically required either expensive professional equipment or consumer devices that send your most sensitive biometrics to corporate clouds. HealthyPi eliminates both constraints.

Specifications:

Nordic nRF5340 dual-core SoC
1.2" 390×390 AMOLED touchscreen
128 MB flash (10 days of processed data storage)
BLE 5.2 and USB-C connectivity
$249 one-time cost

What you lose:

Higher upfront cost than consumer wearables
Not FDA-approved for medical diagnosis (consumer device classification)
The form factor is functional rather than fashion-forward

Difficulty rating: ⭐⭐⭐ (3/5)

The device works out of the box, but understanding the medical-grade features (ECG analysis, blood pressure calibration) requires some learning.

Source: Crowd Supply - HealthyPi Move

HealthyPi Move — medical-grade biometrics in a fully open hardware package

Budget Option: PineTime

If you want to experiment with open source wearables without significant investment, Pine64's PineTime costs $27 and runs the open source InfiniTime firmware.

The feature set is basic: notifications, step counting, heart rate, timer, music control. But everything — hardware and software — is completely open. You can flash custom firmware, modify the watch face, and know exactly what code runs on your wrist.

Difficulty rating: ⭐⭐⭐ (3/5) — Best for patient early adopters comfortable with evolving firmware.

Category 2: Fitness & Activity Tracking

FitoTrack — The Clear Winner for GPS Activities

When a Lemmy user tested 49 open source health apps, FitoTrack emerged as the preferred choice for GPS-based fitness tracking.

The app handles running, cycling, and hiking with real-time tracking of speed, distance, and elevation. Routes display on OpenStreetMap. Workout history includes charts and statistics. Audio announcements can read your progress through headphones during workouts.

Why FitoTrack wins:

Minimal permissions (no notification access, no nearby devices permission)
Lighter weight than alternatives like OpenTracks
Better individual exercise view
GPLv3 licensed, no ads, no tracking
Works completely offline

Vs. OpenTracks: Both are solid choices. OpenTracks integrates better with Gadgetbridge for recording workouts via your wearable. FitoTrack is leaner and requires fewer permissions. If you have a smartwatch, consider OpenTracks. If you just want a phone-based tracker, FitoTrack.

Difficulty rating: ⭐ (1/5) — Install and go.

Source: Codeberg - FitoTrack

FitoTrack — GPS tracking, route mapping, workout statistics, and history — all offline and private

wger — Self-Hosted Workout & Nutrition Manager

FitoTrack handles outdoor activities. What about strength training, nutrition logging, and body measurements?

wger (pronounced "Vega") is a self-hosted fitness management platform. It handles workout planning with progression rules, nutrition tracking via the Open Food Facts database, body weight logging, progress photos, and multi-user support for families or gyms.

The REST API enables integrations with other tools. You can run it on a Raspberry Pi 4, a home server, or any machine with Docker.

Quick deployment:

# docker-compose.yml
version: '3'
services:
  wger:
    image: wger/server:latest
    ports:
      - "8000:8000"
    volumes:
      - wger-data:/home/wger/data
volumes:
  wger-data:

Run docker-compose up -d, and wger is available at http://localhost:8000.

If self-hosting sounds like too much friction, wger.de offers a hosted instance. The tradeoff is obvious: you trust them with your data instead of keeping it local.

Difficulty rating: ⭐⭐⭐ (3/5) — Docker knowledge helps but is not strictly required.

Source: GitHub - wger-project/wger

wger — workout planning, nutrition tracking, and progress monitoring in one self-hosted platform

Quick Mentions

Feeel: The 7-minute workout app. Open source, customizable workouts, no account required. Difficulty: ⭐ (1/5)

Flexify: Minimal strength training logger. No frills, just exercise tracking. Difficulty: ⭐ (1/5)

OpenTracks: Activity tracking with Gadgetbridge integration. Records workouts directly from your smartwatch. Difficulty: ⭐ (1/5)

OpenTracks — integrates with Gadgetbridge for wearable-based workout recording

Category 3: Medication Management

MedTimer — The Privacy-First Pill Reminder

Medication data is among the most sensitive health information. Your prescription list reveals conditions, treatments, and health history. Commercial medication apps often share this with insurers, advertisers, or data brokers.

MedTimer stores everything locally. It has no network capability and works offline indefinitely.

Features:

Unlimited medications with customizable reminder schedules
Stock tracking with refill alerts when supplies run low
Weekend mode: delay reminders to a later time on chosen days
Birth control pill support with scheduled breaks
Latest version: v1.21.4 (December 22, 2025)

The app does what medication reminders should do and nothing else. No accounts, no sync, no advertising, no telemetry.

Difficulty rating: ⭐ (1/5) — Just works.

Source: F-Droid - MedTimer

MedTimer — medication tracking, reminders, and stock management — zero data leaves your device

Alternatives

Simpill: Even more minimal than MedTimer. No trackers, no ads. You can block its internet access entirely and it still functions.

Daily Pill: Focused on single daily medication. Ideal if you just need one reminder.

OpenMedTracker: A hardware solution for patients with limited tech ability. Physical button interface with Raspberry Pi backend.

Category 4: Mental Health & Wellness

Mental health apps handle uniquely sensitive data. BetterHelp's $7.8 million FTC settlement demonstrates what can go wrong: promises of privacy, followed by data flowing to Facebook.

HealSphere — The New Contender

HealSphere launched in September 2025 as a modular mental health support platform. The security approach includes JWT-based authentication, encrypted data storage, and no cloud sync by design.

The project is actively seeking contributors. If you are interested in the intersection of mental health and privacy-preserving software, this is an opportunity for involvement.

Difficulty rating: ⭐⭐⭐ (3/5) — Self-hosting required.

Source: Open Source For You - HealSphere

if me — Community-Focused Mental Health Sharing

if me takes a different approach: it is a platform for sharing mental health experiences with trusted people — friends, family, therapists.

The project has 1.6k GitHub stars and an established community. It is web-based and requires deployment, making it suitable for users comfortable with basic server setup.

Difficulty rating: ⭐⭐⭐ (3/5)

Source: GitHub - ifmeorg/ifme

Meditation & Mindfulness

Medito: A 100% free meditation app with guided sessions, breathing exercises, and sleep content. Open source, no ads, no premium tier. What Headspace charges monthly for, Medito provides free.

Difficulty rating: ⭐ (1/5)

Category 5: Medical Records & Data Aggregation

Fasten — Your Medical History, Your Server

Fasten is a self-hosted electronic medical record aggregator. The premise is straightforward: your medical history belongs to you, not to a corporation.

"This is my medical history, I'm not willing to give it to some random multi-national corporation to data-mine and sell."

Fasten connects to 25,000+ healthcare providers using your existing patient portal accounts. It pulls records from different hospitals, clinics, and labs into a single local database. You control what gets shared.

This is the most ambitious project in the open source health space, and accordingly the most complex to set up. Expect to spend time configuring provider integrations.

Difficulty rating: ⭐⭐⭐⭐ (4/5) — Requires Docker and patience.

Source: GitHub - fastenhealth/fasten-onprem

Open Wearables — The Developer's Dream

Open Wearables, launched December 2025, is a unified API connecting 200+ wearable devices: Apple Health, Garmin, Fitbit, Oura, Whoop, Strava, Suunto, Polar.

For developers, this eliminates weeks of integration work per device. For advanced users, it enables building custom health dashboards that combine data from multiple sources.

The architecture is HIPAA-ready with end-to-end encryption and user consent management. MIT licensed. Self-hosted means no vendor lock-in.

Difficulty rating: ⭐⭐⭐ (3/5) — Docker deployment, developer-oriented but accessible.

Source: GitHub - the-momentum/open-wearables

The "I Just Want Simple" Recommendations

If the preceding sections feel overwhelming, here is a decision tree:

graph TD
    Q1{"Do you have an Amazfit,<br/>Xiaomi, or Garmin watch?"}
    Q2{"Do you run, cycle,<br/>or hike outdoors?"}
    Q3{"Do you need<br/>medication reminders?"}
    Q4{"Do you do<br/>strength training?"}
    Q5{"Do you want<br/>medical-grade biometrics?"}

    A1["Install Gadgetbridge ⭐⭐"]
    A2["Install FitoTrack ⭐"]
    A3["Install MedTimer ⭐"]
    A4["Try Flexify ⭐<br/>or wger ⭐⭐⭐"]
    A5["Order HealthyPi Move ⭐⭐⭐"]
    A6["Start with any ⭐ app above"]

    Q1 -->|YES| A1
    Q1 -->|NO| Q2
    Q2 -->|YES| A2
    Q2 -->|NO| Q3
    Q3 -->|YES| A3
    Q3 -->|NO| Q4
    Q4 -->|YES| A4
    Q4 -->|NO| Q5
    Q5 -->|YES| A5
    Q5 -->|NO| A6

The minimum viable setup:

FitoTrack for exercise tracking
MedTimer for medication reminders
Export data periodically to local backup

Total cost: $0. Total data shared with third parties: zero.

What You Give Up

An honest assessment requires acknowledging tradeoffs.

Features you might miss:

Social sharing (Strava leaderboards, workout communities)
AI coaching suggestions
Seamless cloud sync across devices
Automatic firmware updates
Polished onboarding experiences

The learning curve:

Gadgetbridge auth key extraction is not intuitive
Self-hosting requires basic Docker knowledge
Less hand-holding than commercial apps

Ecosystem fragmentation:

No single app does everything
Data portability between tools varies
You become your own IT department

The question to ask yourself: is the convenience of commercial apps worth sharing your health data with unknown third parties?

For an increasing number of people in 2025, the answer is no.

Conclusion

The healthcare data landscape is shifting. HIPRA legislation signals regulatory recognition of the gap between consumer expectations and actual protections. Open source projects have reached production quality. The tools exist.

Seventy-nine percent of health apps share your data with third parties. Subscription models lock your own body's data behind monthly paywalls. Data breaches have affected hundreds of millions of patients. The status quo assumes you will trade intimate health information for convenience.

Your heart rate, sleep patterns, medication schedules, menstrual cycles, and workout history tell an intimate story about who you are. This data reveals more about you than your browsing history, more than your location data, more than your purchase patterns.

The question is not whether you can trust corporations with this data. The question is: why would you, when you no longer have to?

Context-Free Grammars Weren't Invented in the 1950s

Serendeep Rudraraju — Tue, 13 Jan 2026 14:01:57 GMT

Context-Free Grammars Weren't Invented in the 1950s

Every computer science student learns Backus-Naur Form. You write productions like <expr> ::= <term> | <expr> "+" <term>, build parsers, and move on. But most curricula never raise a relevant question: who proposed renaming it "Panini-Backus Form" in 1967, and why did the ACM publish that letter?

The answer spans 2,500 years across three continents, tracing a chain of independent discoveries that suggests something fundamental about the structure of language itself. This is not about cultural credit or historical trivia. It is about understanding the origins of the tools we use daily.

TL;DR

This article presents a technical history of formal grammar from Pāṇini (~500 BCE) through Post and Thue (1910s-20s) to Chomsky and Backus (1950s). It covers the 1967 proposal to rename BNF, explains how Post's canonical systems became the foundation for programming language specification, and documents working implementations of Pāṇini's 2,500-year-old grammar in Rust, OCaml, and Python.

The key insight: formal grammar was not invented in the 1950s. It was rediscovered multiple times, independently. Understanding this history illuminates why BNF works the way it does.

The 1967 Letter

In 1967, P.Z. Ingerman wrote a letter to Communications of the ACM proposing that Backus-Naur Form be renamed.

"Backus was not the first to use the form with which his name has become associated, although he did, indeed, discover it independently... I would like to suggest the name 'Panini-Backus Form' as being more desirable."

Dr. Alexander Wilhelmy had called Ingerman's attention to the work of Pāṇini, a Sanskrit grammarian who lived around 500 BCE. Upon examining Pāṇini's notation, Ingerman found a system "equivalent in its power to that of Backus, and has many similar properties."

The proposal was never adopted, but the comparison persisted in academic circles for good reason. This was not a loose analogy. Ingerman identified a genuine case of parallel invention, comparable to Newton and Leibniz independently developing calculus, or Darwin and Wallace arriving at evolution by natural selection.

The question follows naturally: what exactly did Pāṇini invent, and how does it compare to what Backus formalized 2,400 years later?

Who Was Pāṇini?

Pāṇini was a Sanskrit grammarian who lived in ancient India, likely in the 5th or 4th century BCE. His masterwork, the Aṣṭādhyāyī ("Eight Chapters"), contains 3,959 sutras (short, compressed rules) distributed across eight chapters and thirty-two sections.

Calling it a "grammar" undersells its nature. The Aṣṭādhyāyī is not a reference book listing word forms. It is a generative system that takes semantic inputs and produces valid Sanskrit expressions. It covers phonology, morphology, syntax, and semantics within a unified framework.

The technical sophistication includes:

Metarules: rules governing how other rules apply
Recursion: rules that reference themselves
Transformations: rules that modify intermediate forms
Auxiliary markers: terminal and non-terminal symbols

Consider an example. The sutra "इको यणचि" (iko yaṇ aci) encodes a phonological transformation:

Sutra:     इको यणचि (iko yaṇ aci)
Meaning:   i/u/ṛ/ḷ → y/v/r/l / before a vowel

Modern notation equivalent:
[i, u, ṛ, ḷ] → [y, v, r, l] / _ V

This is a context-sensitive rewrite rule compressed into three syllables. The entire grammar operates at this level of compression, which explains why scholars have spent millennia writing commentaries to unpack it.

Paul Kiparsky, a Stanford linguist and leading Pāṇini scholar, states:

"Pāṇini uses metarules, transformations, and recursions with such sophistication that his grammar has the computing power equivalent to a Turing machine."

This is not hyperbole. The Aṣṭādhyāyī is computationally complete. Given a meaning to express, it generates the correct Sanskrit form through systematic rule application.

A crucial distinction applies here: Pāṇini's rules are operative, not merely descriptive. BNF specifies whether a string is valid. Pāṇini's grammar specifies how to construct valid strings from meaning. This makes his system more sophisticated than BNF; it performs additional computational work.

The Missing Century: Post and Thue

Standard textbook accounts of BNF present a clean narrative: Chomsky formalized grammar types in 1956, Backus applied the ideas to ALGOL in 1959. This account omits a crucial link: the early 20th-century mathematicians who invented the formalism that Backus adapted.

Axel Thue was a Norwegian mathematician working on "word problems" in the 1910s. In 1914, he introduced systematic treatment of string rewriting: pairs of strings where one could be substituted for another. Though this appears simple, Thue systems turned out to be Turing-complete. In 1947, Emil Post and A.A. Markov independently proved that the word problem for Thue systems is undecidable.

Emil Post extended this work in the 1920s (published 1943). He developed "canonical systems," a general framework for string manipulation using production rules of the form g → h. Post designed these systems explicitly for algorithmic symbol manipulation, anticipating computational requirements decades before practical computers existed.

The direct lineage:

graph TD
    P["Pāṇini<br/>~500 BCE<br/><i>Generative grammar<br/>3,959 sutras</i>"]
    T["Axel Thue<br/>1914<br/><i>String rewriting systems</i>"]
    POST["Emil Post<br/>1920s (pub. 1943)<br/><i>Canonical systems<br/>production rules</i>"]
    C["Noam Chomsky<br/>1956<br/><i>Grammar hierarchy<br/>4 types</i>"]
    B["John Backus<br/>1959<br/><i>Metalinguistic formulas<br/>ALGOL syntax</i>"]
    BNF["Backus-Naur Form<br/>ALGOL 58/60"]
    ING["Ingerman's 1967 letter<br/><i>'Panini-Backus Form'</i>"]

    P -.->|"indirect, via<br/>19th-c. Indology"| C
    T --> POST
    POST --> B
    POST --> C
    C ---|"parallel work"| B
    B --> BNF
    P -.-> ING
    BNF --> ING

When Backus specified ALGOL's syntax, he did not invent a notation from scratch. He adapted Post's production-rule formalism for programming language specification. The ::= symbol, alternation with |, and angle brackets for non-terminals derive from Post's framework.

Compare the formats:

Post production:    S$x → x$S

BNF production:     <expr> ::= <term> | <expr> "+" <term>

The notation differs, but the underlying concept is identical: rewrite rules that transform symbol strings.

This matters because the actual history is not Chomsky → Backus. It is Post → Backus, with Chomsky working in parallel on natural language. The formal foundations of programming language syntax predate Chomsky's linguistic work.

Chomsky's 1956 Synthesis

Where does Chomsky fit in this lineage?

In September 1956, Noam Chomsky published "Three Models for the Description of Language" in IRE Transactions on Information Theory. This paper introduced the Chomsky hierarchy, a classification of formal grammars by generative power.

Type	Name	Recognizer	Practical Example
3	Regular	Finite automaton	Lexers, regex
2	Context-free	Pushdown automaton	Parsers, most of BNF
1	Context-sensitive	Linear-bounded automaton	Some natural language constructs
0	Unrestricted	Turing machine	General computation

Chomsky built this hierarchy explicitly on the work of Post, Thue, and Turing. His contribution was the classification itself: recognizing that different grammar types have different computational properties, and that these correspond to different automata classes.

Chomsky was also aware of Pāṇini. In a 2001 speech in Kolkata, he stated:

"The first generative grammar in the modern sense was Panini's grammar."

The influence path was indirect. Chomsky did not read the Aṣṭādhyāyī in Sanskrit and extract formal techniques. The connection runs through 19th-century European Indology. When scholars like Franz Bopp and Ferdinand de Saussure studied Sanskrit grammar, they absorbed ideas about formal linguistic rules. These ideas influenced the structuralist tradition that Chomsky both built on and reacted against.

Frits Staal, a scholar of Indian logic, traces this connection: "The idea of formal rules in language, proposed by Ferdinand de Saussure in 1894 and developed by Noam Chomsky in 1957, has origins in the European exposure to the formal rules of Pāṇinian grammar."

The Pāṇini-Chomsky-Backus connection is therefore atmospheric rather than textual. The same ideas surfaced multiple times, possibly because they reflect something fundamental about linguistic structure.

Technical Comparison: Pāṇini vs BNF

How do these systems compare in concrete terms?

A Pāṇinian sutra typically follows this structure:

[condition] + [operation] + [result domain]

Example: अचो ञ्णिति (aco ñṇiti)
Meaning: "Before affixes marked with ñ or ṇ, the vowels a/ā undergo vṛddhi"

Pseudo-BNF attempt:
<stem> ::= <vṛddhi-vowel> <rest> / _ <ñṇ-affix>

The pseudo-BNF does not fully capture the semantics. Pāṇini's rule is operative: it transforms a stem by strengthening its vowel when the appropriate affix attaches. BNF merely validates whether the result conforms to the grammar.

Key differences:

Aspect	Pāṇini	BNF
Purpose	Generate valid forms from meaning	Describe/validate syntax
Direction	Operative (produces output)	Descriptive (accepts/rejects input)
Scope	Complete language (phonology → semantics)	Syntax only
Metarules	Sophisticated conflict resolution	Minimal
Compression	Extreme brevity via technical vocabulary	Explicit, verbose

The operative vs. descriptive distinction is fundamental. A BNF grammar fed to a parser generator yields a recognizer that returns accept/reject decisions. Pāṇini's grammar yields a generator that produces correct forms.

Modern implementations demonstrate computational tractability:

Gérard Huet's Sanskrit Heritage Engine (OCaml): a complete computational implementation based on Pāṇinian principles, with a web interface for Sanskrit text analysis
Arun Prasad's Rust implementation: over 2,000 Aṣṭādhyāyī rules encoded in Rust, with WebAssembly and Python bindings
sanskrit_parser (Python): available via pip, generates valid Sanskrit padas using Aṣṭādhyāyī rules

These are not historical curiosities. They are working software demonstrating that a 2,500-year-old formal system remains computationally viable.

The Nuance: What's Overstated

Responsible historiography requires acknowledging limits. Some claims about Pāṇini and computer science are exaggerated or false.

The NASA myth: claims that "NASA declared Sanskrit the best language for programming" or "Sanskrit is used at NASA" stem from a 1985 paper by Rick Briggs titled "Knowledge Representation in Sanskrit and Artificial Intelligence." Briggs argued that Sanskrit's grammatical structure could inform AI research on knowledge representation. This is a reasonable technical observation. He never claimed Sanskrit should be a programming language, and NASA has never used Sanskrit operationally. The viral claim is a distortion.

George Cardona's warning: Cardona, a leading scholar of Pāṇinian grammar, cautions against overstating influence:

"As far as I am able to discern upon rereading Saussure's Mémoire, however, it shows no direct influence of Paninian grammar."

The connection between Pāṇini and modern linguistics is atmospheric, not textual. European scholars absorbed ideas from the Sanskrit grammatical tradition without directly porting Pāṇinian techniques.

Rajpopat's 2022 thesis: Rishi Rajpopat's Cambridge PhD generated headlines for "solving a 2,500-year-old puzzle" about rule conflicts in the Aṣṭādhyāyī. The work represents genuine scholarship, but media coverage was sensationalized. Peter Scharf and other Sanskritists raised significant objections to Rajpopat's interpretation. The question of Pāṇinian rule conflict resolution remains contested.

What the evidence supports:

Pāṇini invented a formal grammar independently of Western traditions
The conceptual parallels to BNF are real and documented
Indirect influence on Western linguistics via 19th-century Indology is probable
Both systems are computationally powerful

What the evidence does not support: that Backus knew of Pāṇini, that Chomsky directly studied the Aṣṭādhyāyī, or that "Panini-Backus Form" would accurately imply direct influence.

Implications for Practitioners

Understanding the origins of BNF has practical relevance.

Parser design: knowing that BNF descends from Post's canonical systems (production rules designed for string manipulation) explains why parser generators work as they do. The notation is not arbitrary; it derives from a formalism mathematicians designed specifically for symbolic computation.

Language design: Pāṇini's conflict resolution techniques (the vipratiṣedha rules determining which rule applies when multiple candidates exist) anticipate operator precedence and disambiguation strategies in modern parsers. Designing a language with ambiguous grammar involves the same problems Pāṇini addressed 2,500 years ago.

Historical perspective: writing a grammar places you in a 2,500-year tradition of formally describing language. The notation feels natural because it reflects deep structural patterns that humans discovered independently, across millennia, on different continents.

The parallel invention of calculus by Newton and Leibniz suggests mathematical structures "waiting to be found." The parallel development of formal grammar by Pāṇini and the Post-Chomsky-Backus lineage suggests something similar about linguistic structure itself.

Conclusion

Ideas do not emerge from vacuums.

When Backus specified ALGOL, he drew on Emil Post's work on canonical systems. When Chomsky formalized the hierarchy, he synthesized decades of mathematical logic from Turing, Post, and Thue. Behind all of this stands a 2,500-year tradition of systematic grammar that reached its apex in Pāṇini's Aṣṭādhyāyī.

The 1967 proposal to rename BNF to "Panini-Backus Form" was never adopted. But Ingerman's point stands: "since there is clear evidence that Panini was the earlier independent inventor of the notation," the parallel merits acknowledgment.

When you next write a BNF production or debug a parser, consider that you are using notation independently invented at least twice, millennia apart, by people with no knowledge of each other's work.

What does that say about the structure of language itself? And what else might we rediscover?

Towards Autonomous Edge AI: Local LLM Inference, Efficient Quantization, and Hybrid Memory in Practice

Serendeep Rudraraju — Thu, 23 Oct 2025 20:09:04 GMT

Towards Autonomous Edge AI: Local LLM Inference, Efficient Quantization, and Hybrid Memory in Practice

What if your AI worked offline, kept your secrets, and actually remembered you, without ever flinching at a spotty network?

This post moves past the "API-everywhere" playbook. It lays out theory for a practical, fully on-device large language model (LLM) workflow: no cloud, no dependence, full privacy -- built for real-world deployments on low-memory consumer hardware (think 2GB and below). You'll see how quantization (GGML/GGUF), parameter-efficient tuning (QLoRA), and lightweight in-device memory (LightMem) combine into something robust and personal.

TL;DR

Train small LLMs (1-2B params) using QLoRA for efficient low-VRAM fine-tuning, then merge adapters and convert to GGUF for extreme size reduction.
Quantize strategically: Prefer Q4_K_M or Q3_K for sub-2GB operation; adjust --ctx-size (context tokens) to fit your RAM budget.
On-device memory matters: Use LightMem patterns for building meaningful per-device memory (not just context window stuffing).
Stay offline, add sync only if needed for dumb, end-to-end encrypted operations.

With practical code, RAM charts, and pipeline diagrams to come once benchmarks are complete.

Why Local-first, Why Now?

Consumer devices -- phones, small boards, ultraportables; are finally capable of real LLM inference. Recent advances in quantization (see Riddhiman Ghatak 2025, Hugging Face quantization guide), inference libraries (GGML, llama.cpp Alex Razvant 2025), and rapid storage (GGUF OriginsHQ) have converged. Meanwhile, on-device memory systems like LightMem (ZJU NLP), and new architectural work (EdgeInfinite, 2025) suggest it's possible to make agents that truly feel consistent while remaining 100% user-sovereign.

flowchart LR
    A[Base Model<br/>1-2B params] --> B[QLoRA<br/>Fine-tuning]
    B --> C[Merge<br/>Adapters]
    C --> D[Convert to<br/>GGUF]
    D --> E[Quantize<br/>Q4_K_M]
    E --> F[Deploy<br/>On-Device]
    F --> G[LightMem<br/>Agent Memory]

Core Stack Overview

Training: QLoRA for Practical, Data-Efficient Fine-Tuning

QLoRA ("Quantized Low Rank Adapter") has changed fine-tuning economics. It lets you take a 4-bit quantized base model (using NF4 or FP4 quantization) and inject low-rank adapters, adapting powerful LLMs with as little as 6-8GB VRAM even for strong instruction-tuning (see Dettmers et al., 2023). For devices with only CPU, train elsewhere and deploy the merged model.

Tip: Don't skip the merge step before deployment: merging LoRA adapters into the base weights enables fully self-contained quantization downstream.

QLoRA code sketch (Python/HF/PEFT):

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
base_id = "your-compact-1b-2b"
tok = AutoTokenizer.from_pretrained(base_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    base_id, load_in_4bit=True, device_map="auto"
)
peft_cfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_cfg)
# Train, then merge adapters for deployment
model = model.merge_and_unload()
model.save_pretrained("./my-qlora-merged")
tok.save_pretrained("./my-qlora-merged")

Quantization: From Hugging Face to GGUF for Inference

GGUF (GPT Generated Unified Format) is the compact, single-file format that powers llama.cpp and its derivatives (Shekhawat, 2025, Hardware Corner). GGUF supports a variety of quantization "presets," with blockwise mixed precision strategies (Q4_K_M and newer).

Typical workflow:

Convert merged weights to GGUF.
Select quant preset (Q4_K_M: balance, Q5_K_M: quality, Q3_K: smallest RAM).
Optionally, use importance matrix (imatrix/AWQ) for smarter precision allocation.

flowchart TD
    MW[Merged Weights<br/>FP16/BF16] --> CONV[convert-hf-to-gguf.py]
    CONV --> F16[GGUF F16]
    F16 --> IM[imatrix Calibration<br/>optional]
    IM --> Q[llama quantize]
    F16 --> Q
    Q --> Q4[Q4_K_M<br/>Balanced]
    Q --> Q5[Q5_K_M<br/>Higher Quality]
    Q --> Q3[Q3_K<br/>Smallest]

Example terminal workflow:

python convert-hf-to-gguf.py \
    --model ./my-qlora-merged \
    --outtype f16 \
    --outfile ./my-qlora-f16.gguf

# Calibrate with domain data (if needed)
./llama imatrix -m ./my-qlora-f16.gguf -f ./calibration.txt --chunk 512 -o ./my-qlora.imatrix.dat

# Quantize to Q4_K_M
./llama quantize --imatrix ./my-qlora.imatrix.dat \
  ./my-qlora-f16.gguf \
  ./my-qlora-q4_k_m.gguf \
  Q4_K_M

On 2GB machines, Q4_K_M or Q3_K_M are your best bets. If the model OOMs, reduce --ctx-size or try more aggressive quantization. Q5_K_M is viable if you can spare the memory. See recent practical guides and model cards.

Runtime: Edge Inference on CPU (No Cloud Required)

Llama.cpp and similar runtimes let you run GGUF-quantized models on ARM, x86, and more; fully CPU-optimized with hardware SIMD. Real-world users have shown 2B Q4_K_M models running comfortably in 1.5GB RSS with 8-20 tok/s on modern phone ARM big cores (Running LLMs on Edge Devices: A Step by Step Guide).

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON
cmake --build build -j
./build/bin/llama-cli -m ./my-qlora-q4_k_m.gguf --threads 4 --ctx-size 1024 -n 128 -p "..."

Key tuning knobs: minimize --ctx-size, tune --threads to match physical cores, experiment with --mmap settings depending on your OS.

More than RAG: LightMem for Real Agent Memory

The "memory" problem in local agents has two sides: you want coherence (recall old facts, preferences), but you have tight context and storage budgets. LightMem ZJU NLP (Paper) provides a blueprint for local-first, deterministic, and privacy-respecting memory.

Reference Image

flowchart TD
    INT[User Interaction] --> WAL2[WAL Event Log]
    WAL2 --> F[Facts<br/>triples]
    WAL2 --> EP[Episodes<br/>events]
    WAL2 --> SUM[Rolling<br/>Summaries]
    F --> EMB[On-device<br/>Embeddings]
    EP --> EMB
    SUM --> EMB
    EMB --> REC[Recall Scoring]
    REC -->|semantic similarity| SC[Score]
    REC -->|recency decay| SC
    REC -->|thread affinity| SC
    SC --> CTX[Inject into<br/>LLM Context]

How does it work?

Store interactions as WAL (Write-Ahead Log) events: facts (triples), events (episodic), and rolling summaries.
Generate embeddings for memory objects with a small on-device model (see Mnemosyne, 2025 for inspiration).
Efficient recall: combine semantic (embedding similarity), recency (timestamp decay), and thread affinity. This approach mirrors theoretical advances in memory-efficient, human-inspired architectures (EdgeInfinite, 2025, TechXplore, 2025).

Sample recall scoring (TypeScript):

function scoreMemory(sim: number, ts: number, sameThread: boolean, now = Date.now()) {
  const hours = (now - ts) / (3600 * 1000);
  const decay = Math.exp(-hours / 48); // half-life ≈ 33h
  const thread = sameThread ? 1 : 0;
  return 0.7 * sim + 0.2 * decay + 0.1 * thread;
}

Inject just enough context; prefer a strong, concise memory prelude over dumping logs; 5-7 memories of <150 tokens each is the sweet spot.

Deterministic state is key: WAL + pure reducers + guaranteed replay = crash-resistant, migration-friendly memory.

Budget Breakdown: What Actually Fits in 2GB?

Practical RAM numbers from multiple recent benchmarks and community reports, Hardware Corner:

Model	Quant	Disk (GB)	Runtime RAM	Notes
1B	Q4_K_M	0.6-1.0	0.8-1.2	Leaves headroom for embeddings
2B	Q4_K_M	1.1-1.6	1.4-1.9	Stays under 2GB with ctx <= 1536
3B	Q3_K_M	1.6-2.0	2.0-2.4	Pushes limits, may OOM on mobile

Context (ctx) matters: Each token increases KV cache consumption. For <2GB, 1024-1500 tokens is safe.
Vector stores: For <10k embeddings, flat cosine search in float16/PQ works fine (<50MB).
Scheduling: Run voice/ASR/LLM in sequence if doing spoken interfaces.

Browser and Cross-Platform Notes

If you prefer browser-native, WebGPU is your path: ONNX Runtime Web, WebLLM (MLC), and custom Wasm backends can work wonders for 0.2-1B models in modern browsers. Always check for navigator.gpu and offer a Wasm fallback.

Security and Privacy

Default: fully offline; no PII leaves the device, ever.
Sensitive memory: Encrypt memory WAL and facts in OS keystores.
Sync (if used): E2E encrypt ops, not state; the relay can be dumb and untrusted.
Determinism: Seeded randomness, WAL replay, pure functional reductions.

Practical Workflow

Pick your base model: TinyLlama, Qwen, Phi, or Gemma class (1-2B params).
Fine-tune with QLoRA: Optimize with NF4/FP4, low-rank adapters.
Merge, convert to GGUF.
Quantize (Q4_K_M as your baseline); test context window at 1024-1536.
Bundle in LightMem-style memory ops with WAL persistence and on-device embeddings.
Deploy and test: Real-world speed, RAM, and stability, tune as needed.

References and Further Reading

Closing Thoughts

Offline LLMs are no longer theoretical. By combining QLoRA's tuning efficiency, GGUF's quantization, and LightMem's structured memory, developers can ship coherent, private AI on smartphones, tablets, and edge hardware. Detailed benchmarks, complete templates, and RAM flame graphs are coming in the follow-up.

When your AI works where you are, even offline; that's sovereignty over your own tools.

7 min read | Read on the blog | Buy me a coffee

Building Truly Offline Apps Isn’t Magic: Local-First with Next.js, Flutter, and On-Device AI

Serendeep Rudraraju — Wed, 08 Oct 2025 15:38:36 GMT

Building Truly Offline Apps Isn’t Magic: Local-First with Next.js, Flutter, and On-Device AI

Starting a new app, most of us wire up the backend before we even sketch the core screens -- then hope our offline mode "just works later." It rarely does. If you've ever patched cache misses, handled sync bugs, or watched your app freeze when Wi-Fi drops, you know why local-first development matters.

I decided to do it right this time, from the start. Here's how I set up local-first in Next.js and Flutter, built a sync layer that's optional (not required), and ran Gemma 3n for offline inference right on the device.

What Does "Local-First" Actually Mean?

Data on the device is always the source of truth. Your UI only ever talks to local storage. Everything else -- cloud sync, cross-device sharing, even AI inference -- is an add-on. If users lose connection, the app keeps working. If sync fails, data never disappears. If a user wants AI features, they get them instantly and privately, no backend calls.

Core requirements:

All app logic must work 100% offline (CRUD, search, inference)
Every mutation logs as an immutable operation (WAL pattern)
Deterministic reducers merge concurrent edits
Schema migrations never require cloud connectivity
Optional transport sync, but functionally no dependency on it

flowchart TD
    UI[UI Layer] --> LS[Local Storage]
    LS --> WAL[Write-Ahead Log]
    WAL --> R[Deterministic Reducer]
    R --> DS[Document Store]
    DS --> UI
    LS -.->|optional| SY[Sync Layer]
    SY -.-> RS[Remote Server]
    RS -.-> SY
    SY -.-> WAL

Next.js Implementation

1. Local Storage: IndexedDB

I use IndexedDB via Dexie. There are two stores:

ops: Write-ahead log (append-only)
docs: Materialized current state

2. Reducer Pattern

Every change is an operation object, appended to WAL, then folded with a deterministic reducer to update docs.

// IndexedDB setup
import Dexie from "dexie";
const db = new Dexie("localfirst");
db.version(1).stores({
  ops: "++id,op_id,lamport_ts",
  docs: "[coll+id],lamport_ts"
});

// Example op
const op = {
  op_id: crypto.randomUUID(),
  lamport_ts: Date.now(),
  actor: "device_123",
  kind: "todo.edit",
  payload: { id: "a", title: "updated" }
};
await db.ops.add(op);
// Then update docs via reducer

Reducers must be deterministic and order-independent where possible. For basic docs, I use "last writer wins" with per-field timestamps; for lists, I use CRDT sequences when I want rich merges.

3. Real-Time Reactivity

Hooks subscribe to batched WAL writes. Queries read straight from docs, so the UI is always current and responsive.

4. Schema Migrations

On version bump, replay ops from WAL, transform with new reducer, write updated docs, and checkpoint the migration to avoid repeated work.

Flutter Implementation

1. Local Storage: Isar

On mobile, Isar fits perfectly:

Op and Doc collections
ACID transactions, fast lookups
Can migrate by replaying WAL and updating docs as needed

2. Operations and Reducers

Each user interaction generates a serialized Op. Reducers update the current Doc record for that entity.

@collection
class Op {
  Id id = Isar.autoIncrement;
  late String opId;
  late int lamportTs;
  late String kind;
  late Map<String, dynamic> payload;
}

Reducer functions match op kind and materialize fields, using per-field timestamps for LWW or sequence CRDT logic for ordered updates.

3. UI and Isolate Sync

UI code subscribes to changes in docs via Isar queries. WAL writes and state reduction can run in a background isolate for performance, so rendering is never blocked by DB work.

flowchart LR
    subgraph Main Isolate
        UI2[UI] --> Q[Isar Query Stream]
        Q --> UI2
    end
    subgraph Background Isolate
        W[WAL Writer] --> RED[Reducer]
        RED --> DOC[Doc Store]
    end
    DOC -.-> Q

Optional Sync Layer

If the user has multiple devices or wants backup, sync comes into play. My rule: if sync fails, users never notice. Everything keeps working.

How it works at a glance:

Client advertises vector clock of known ops
Pushes new ops in batches when online
Pulls missing ops from server (or other devices)
Server is just an append-only relay; no state reconciliation required

All actual "merging" is local -- code assumes nothing about server order or reliability.

Example Sync Client (Next.js)

async function sync(endpoint: string) {
  const known = await indexedOpsVectorClock();
  const localOps = await getUnsentOps();

  // Push local ops
  await fetch(endpoint + "/push", { method: "POST", body: JSON.stringify(localOps) });

  // Pull remote ops
  const res = await fetch(endpoint + "/pull", { method: "POST", body: JSON.stringify({ known }) });
  const remoteOps = await res.json();
  // Append to WAL, replay as usual
}

It's as stateless and boring as possible. No conflict dialogs, no server arbitration.

Edge Inference with Gemma 3n

Most "AI features" these days are just SaaS vendors piping your data through a remote API. Here, the model runs on the device itself.

Packing Gemma 3n

For Next.js: Quantize the model, bundle it or fetch as-needed, load with ONNX Runtime for Web (WebGPU if possible).
For Flutter: Quantized .tflite file, run with TFLite Flutter plugin using NNAPI/CoreML delegates.

Workflow:

Download or bundle the model
Tokenize inputs locally
Run inference call directly on-device
Use result as part of normal app flow (summarizing, recommending, etc.)

Web Example:

import { InferenceSession } from "onnxruntime-web";
const session = await InferenceSession.create("gemma3n_quant.onnx");
// Prepare input tensors and call session.run()

Flutter Example:

final interpreter = await Interpreter.fromAsset("gemma3n_quant.tflite");
final result = interpreter.run(input, output);
// Use result as needed

No network calls, no PII leaves the device, and inference latency is entirely predictable.

Testing and Reliability

With local-first, you get new test strategies:

Use property-based tests to shuffle ops, replay on multiple platforms, assert convergence
Simulate random crashes (write half an op, crash, restart and verify state)
Schema migrations are tested by replaying millions of WAL steps with generated data

Conclusion

Local-first requires building your logic, storage, and even AI features around the principle that the user's device always comes first. Sync, sharing, and cloud inference are all useful -- but never required.

With Next.js, Flutter, optional WAL replication, and device-resident Gemma 3n, you can build apps that:

Don't lose data
Never block the UI
Run AI models offline
Sync only as a convenience

If you're tired of patching last-minute offline bugs or watching your AI features give up at the wrong time, building this way makes your app (and your weekends) much more dependable.

5 min read | Read on the blog | Buy me a coffee

Flutter, But Organized: A Starter Template That Won’t Make You Cry in Debug

Serendeep Rudraraju — Sat, 05 Jul 2025 20:06:26 GMT

Flutter, But Organized: A Starter Template That Won’t Make You Cry in Debug

Starting a new Flutter project can get messy, fast. You want to build something cool, but before you know it, you're knee-deep in tangled folders, mystery bugs, and "wait, where does this go?" headaches.

I know because I've been there. My first Flutter starter template started out with the best intentions, but after a few rounds of "just one more feature" and some quick fixes, it turned into a cluttered, unmaintainable mess. Every change made things worse. Eventually, I realized it would be easier (and way less stressful) to start from scratch than to keep pulling my hair out trying to fix that chaos.

So I did. And this time, I built it right.

Meet my new Flutter Starter Template, designed to keep your code clean, your state predictable, and your sanity intact.

Why Did I Need a New Starter?

My old setup was a mess:

Folders everywhere. UI, logic, data; mixed up like spaghetti.
State all over the place. Sometimes Bloc, sometimes setState, sometimes... who knows.
Navigation roulette. Accidentally landing on the wrong page? Been there.
Networking pain. Error handling? Interceptors? Not even close.
No offline support. Lose Wi-Fi, lose your app.
Auth? JWT, but duct-taped together.
Environments? Changing from dev to prod meant hunting through files.
Testing? "It works on my machine" isn't a test.

I wanted something better. Here's what I built.

What's in the Box?

Clean Architecture: No More Spaghetti

Data, Domain, UI; each in their own lane. No more "where does this go?" Just clear, logical structure.

graph LR
    A[UI Layer] --> B[Domain Layer]
    B --> C[Data Layer]
    C --> D[(Local Storage)]
    C --> E[(Remote API)]

State Management: Bloc, Done Right

Predictable, testable, and no more "why did my widget just rebuild?" moments. Bloc keeps your state where it belongs.

Navigation: GoRouter + Route Guards

No more "Oops, wrong page!" surprises. GoRouter handles your routes, guards keep your users where they should be.

Networking: Dio with Superpowers

Custom interceptors, centralized error handling, and a single place to tweak your API calls. No more copy-paste code.

Offline-First Storage: Hive

Your app works even when Wi-Fi ghosts you. Data stays local, syncs when it can.

Auth Flow: JWT, Locked Down

Multi-Environment Setup: Flip a Flag

Dev, Staging, Prod : switch with a single flag. No more hunting through config files.

Theme System: Material 3, Light-Blue, Dark Mode

Modern look, easy to tweak, and dark mode built in.

Mock vs Real APIs: One Toggle

Switch from dev stubs to live data with a single toggle. Perfect for testing.

CI/CD Ready: GitHub Actions + Pre-Commit Hooks

Linting, formatting, and tests run automatically. Your code stays clean, your builds stay green.

Testing Suite: Unit, Widget, Integration

Tests out of the box. No more "I'll add tests later" guilt.

Responsive by Default

Phones, tablets, web; your app just works everywhere.

Detailed Docs: Like a Lego Set

Step-by-step docs walk you through everything. No guesswork, just building.

The Unconventional Bit: Why Is There a `package.json` in My Flutter Project?

If you're used to Dart and Flutter, you probably expect to see pubspec.yaml, not package.json. But open up my Flutter Starter Template, and there it is, right next to your pubspec.yaml and all the usual suspects.

So... why?

Because Flutter Devs Deserve Good Tooling, Too

The Dart ecosystem is great, but when it comes to developer tooling; especially around automation, hooks, and code quality; JavaScript has been playing this game a lot longer. There's a whole world of tools out there that just work better (or only work) with Node.

So I thought: why not steal the best parts?

What Does `package.json` Actually Do Here?

This has nothing to do with running JavaScript code in your Flutter app. It's about making your development workflow smoother, faster, and less painful.

Here's what you get:

Pre-commit hooks with Husky: Make sure your code is linted, formatted, and tested before every commit. Husky + Node scripts make it easy.
Commit message linting with Commitlint: Enforce conventional commit messages, so your git history is clean and readable.
Scriptable automation: Need to install hooks, run code generation, or clean up your workspace? Just run npm run setup or any of the other handy scripts.
CI/CD glue: Some CI tools expect a package.json for running scripts, even if your main codebase is Dart.

Example: The Power of Scripts

Check out these scripts from the template:

"scripts": {
  "setup": "flutter pub get && flutter pub run build_runner build --delete-conflicting-outputs && npm run hooks:install",
  "build:dev": "flutter build apk --flavor dev -t lib/main_dev.dart",
  "test": "flutter test",
  "lint": "flutter analyze && dart format --output=none --set-exit-if-changed .",
  "pre-commit": ".githooks/pre-commit",
  "hooks:install": "node scripts/setup-hooks.js"
}

You get one-liners for everything: setup, build, test, lint, install hooks; no more copy-pasting long commands or forgetting steps.

Why Not Just Use Dart for This?

You could, but:

The Node ecosystem has years of battle-testing for this kind of workflow automation.
Tools like Husky and Commitlint are the standard for git hooks and commit hygiene.
It's cross-platform and works out of the box for most devs (since Node is everywhere).

Does This Make My Flutter App a Node App?

Nope. Your app is still 100% Dart and Flutter. package.json is just there to make your development life easier. When you build your app, none of this ships to your users.

The Bottom Line

My first starter got so cluttered that fixing it felt impossible. So I started over, and this time, I built the kind of template I wish I'd had from the start; clean, well-structured, and with a few unconventional tricks (like package.json) to make your workflow smoother.

Want to see how it all fits together? Check out the repo and peek at the scripts.

Got Thoughts? Open an Issue!

No comments section here, but I'd love to hear what you think. Open an issue on GitHub if you have ideas, questions, or just want to rant about unconventional tooling.

Happy coding.

5 min read | Read on the blog | Buy me a coffee

Cognitive Debt: Or, How I Forgot Why My Code Works

Serendeep Rudraraju — Fri, 20 Jun 2025 16:36:57 GMT

Cognitive Debt: Or, How I Forgot Why My Code Works

I've been "vibe coding" for a while now. You know the drill—smash out a solution, copy-paste a Stack Overflow snippet, maybe ask ChatGPT to sprinkle some Markdown magic on my blog post, and call it a day. But recently, I stumbled across an article that finally put a name to the weird, nagging feeling I've had for ages: Cognitive Debt.

What Even Is Cognitive Debt?

"Cognitive Debt is where you forgo the thinking in order just to get the answers, but have no real idea of why the answers are what they are."
-- Artefacts Newsletter 247

Translation: you get the answer, but your brain is basically on airplane mode. You can ship code, write essays, and even pass interviews, but ask yourself why something works, and you're left staring into the existential void.

This isn't just me being dramatic, either. The folks at MIT Media Lab actually studied this in Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. The finding: using AI assistants is great for productivity, but it's like eating only instant noodles—quick, easy, but not exactly nourishing for your brain. The more you rely on AI for writing and critical thinking, the more your own skills start to atrophy. Oops.

graph LR
    A[Problem] --> B{Use AI?}
    B -->|Yes, every time| C[Quick Answer]
    C --> D[Skill Atrophy]
    D --> E[More AI Dependence]
    E --> B
    B -->|Struggle first| F[Slower Answer]
    F --> G[Deeper Understanding]
    G --> H[Stronger Skills]
    H --> I[Smarter AI Use]

How Much Cognitive Debt Am I In?

Short answer: probably a lot. I've been racking up cognitive debt for everything from "how do I center a div" to "why does this distributed system not explode." It's easy to justify—deadlines, context switching, or just plain laziness. But now that I know there's a name for it, I can't unsee it.

So, How Do I Pay It Back?

Here's the irony: my first instinct was to ask ChatGPT how to pay back cognitive debt. But that feels like asking a credit card company how to get out of debt. So, I'm going to do something wild: actually think about it, and document my process here.

My Not-So-Scientific Plan

Notice the Debt:
Catch myself when I'm about to autopilot through a problem. Pause. Ask "do I actually get this?"
Write It Down:
Blog about what I'm learning, what I'm unlearning, and all the dumb mistakes in between.
Struggle On Purpose:
Try to solve stuff on my own before running to AI for help. (Yes, this will be painful.)
Revisit Old Solutions:
Go back to code I wrote months ago and see if I can explain it to my past self without crying.
Use AI as a Tool, Not a Crutch:
Let AI help me brainstorm or double-check, but not do the heavy lifting for me.

Baby Steps

For example, I usually write a blog post and then ask ChatGPT 4.1 to convert it to Markdown. Doesn't sound like much, but these little dependencies add up. Time to start weaning myself off—one blog post at a time.

I'll be documenting this whole experiment here. If you're also drowning in cognitive debt, or just want to watch me stumble through this, stick around. Maybe we'll figure it out together.

Date: 20-06-2025

Did what I know best—rolled up my sleeves and started debugging. I spent the last week building a new Flutter starter template. I'd made one before, but it got so cluttered and impossible to work with that starting from scratch just made more sense. (If you want the full story, read the blog post here.)

This time, I didn't just "vibe code" the whole thing. I used Claude to help sketch out the folder and file structure, and leaned on it a bit for setting up CI/CD workflows and git hooks.

And for something a little unconventional: I used npm (actually pnpm, because I like the way it works better, but don't tell anyone) to help run and set up the project. It made things way easier than I expected.

Date: 05-07-2025

4 min read | Read on the blog | Buy me a coffee

JavaScript Dates Don’t Have to Be a Nightmare Anymore: Meet Temporal

Serendeep Rudraraju — Thu, 29 May 2025 16:03:56 GMT

JavaScript Dates Don’t Have to Be a Nightmare Anymore: Meet Temporal

Working with JavaScript's built-in Date object has always felt like defusing a bomb with oven mitts on. Weird bugs. Time zone confusion. Dates that change when you least expect it. If you've ever tried to build anything with dates in JS, you know exactly what I mean.

The good news: the Temporal API is here to fix this mess. It's a new way to handle dates and times in JavaScript—one that actually makes sense.

Why Did We Need Something New?

The old Date object has been around since the language's earliest days, but it's full of gotchas:

Dates that mutate themselves.
You create a date, then—surprise!—it gets changed somewhere else in your code because Date objects are mutable.
Time zone headaches.
Only supports UTC and your local time. Good luck building a global app with just those two options.
Parsing roulette.
Different browsers, different results. Sometimes the same date string means different things depending on which engine runs it.
Locked to one calendar.
Only Gregorian. Not useful if your users need Japanese, Islamic, or Hebrew calendars.
Millisecond ceiling.
Milliseconds are fine for most things... until you need more detail.

Temporal addresses each of these directly:

Immutable objects.
Temporal objects never change once created. Every operation returns a new instance.
Full time zone support.
Handles all IANA time zones. Daylight saving transitions? Covered.
Consistent parsing.
Follows ISO 8601, so you get the same result everywhere.
Multiple calendar systems.
Japanese, Islamic, Hebrew—built right in.
Nanosecond precision.
For when you need to get really, really specific.

What's in the Temporal Toolbox?

Temporal introduces a whole set of new types, each designed for a specific job:

graph TD
    Temporal[Temporal API] --> PlainDate["PlainDate\n(just a date)"]
    Temporal --> PlainTime["PlainTime\n(just a time)"]
    Temporal --> PlainDateTime["PlainDateTime\n(date + time, no zone)"]
    Temporal --> ZonedDateTime["ZonedDateTime\n(date + time + zone)"]
    Temporal --> Instant["Instant\n(exact UTC moment)"]
    Temporal --> Duration["Duration\n(length of time)"]
    Temporal --> Now["Now\n(current date/time)"]

    style ZonedDateTime fill:#2d5a27,stroke:#4a9,color:#fff
    style Instant fill:#2d4a5a,stroke:#49a,color:#fff

Temporal.PlainDate
Just a date. No time, no time zone. Like "2025-05-29."
Temporal.PlainTime
Just a time. No date, no time zone. Like "17:30:00."
Temporal.PlainDateTime
Date and time, but still no time zone. Like "2025-05-29T17:30:00."
Temporal.ZonedDateTime
The workhorse. Date, time, and time zone all together. Like "2025-05-29T14:00:00-07:00[America/Los_Angeles]."
Temporal.Instant
An exact moment in time, always in UTC.
Temporal.Duration
A length of time. "3 days," "5 hours," "30 minutes"—that kind of thing.
Temporal.Now
Quick way to get the current date and time.

What Makes Temporal So Much Better?

Immutability: Dates That Don't Change Behind Your Back

With the old Date, you could accidentally change a date in one place and break something somewhere else. I've done it. You've probably done it too.

With Temporal, that can't happen. Every change gives you a new object. The original stays put.

Example:

// Old Date - Mutable
let eventDate = new Date('2025-12-24T10:00:00Z');
let anotherRef = eventDate;
anotherRef.setHours(12); // Oops, eventDate changed too!
console.log(eventDate.toISOString()); // "2025-12-24T12:00:00.000Z"

Now with Temporal:

// Temporal - Immutable
const eventDateTime = Temporal.Instant.from('2025-12-24T10:00:00Z');
const laterEventTime = eventDateTime.add({ hours: 2 });
console.log(eventDateTime.toString()); // "2025-12-24T10:00:00Z"
console.log(laterEventTime.toString()); // "2025-12-24T12:00:00Z"

No more accidental mutations. The original stays exactly as you created it.

Time Zones: No More Guesswork

Time zones are the worst part of date handling. With the old Date, you're stuck with UTC or local time. Temporal gives you two tools:

Temporal.Instant
A single, exact moment in time. Always UTC.
Temporal.ZonedDateTime
That same moment, but in a specific time zone. Handles daylight saving and all the weird edge cases.

Example:

const moment = Temporal.Instant.from('2025-11-02T05:30:00Z');

// New York time
const nyTime = moment.toZonedDateTimeISO('America/New_York');
console.log(nyTime.toString());

// London time
const londonTime = moment.toZonedDateTimeISO('Europe/London');
console.log(londonTime.toString());

No more guessing. Temporal figures out daylight saving for you.

Calendars: Beyond Gregorian

Not everyone uses the Gregorian calendar. Temporal lets you work with others, like Japanese or Hebrew.

Example:

const gregorianDate = Temporal.PlainDate.from('2025-05-29');
console.log(gregorianDate.calendar.id); // "iso8601"

// Japanese calendar
const reiwaDate = new Temporal.PlainDate(2025, 5, 29, 'japanese');
// Output depends on your environment

If your app needs to show dates in different calendars, Temporal handles it natively.

Nanosecond Precision: For When Milliseconds Aren't Enough

Sometimes, milliseconds just don't cut it. Temporal gives you nanosecond precision—useful for things like financial trades or scientific data where every fraction counts.

Real-World Examples

Here's how you'd use Temporal in practice.

Get the current date and time (with time zone):

const now = Temporal.Now.zonedDateTimeISO();
console.log(now.toString());
// "2025-05-29T17:46:23.123456789-07:00[America/Los_Angeles]"

Get the current instant (UTC):

const nowInstant = Temporal.Now.instant();
console.log(nowInstant.toString());
// "2025-05-30T00:46:23.123456789Z"

Add time to a date (immutably):

const date = Temporal.PlainDate.from('2025-05-29');
const newDate = date.add({ days: 5, months: 1 });
console.log(date.toString()); // "2025-05-29"
console.log(newDate.toString()); // "2025-07-03"

Work with time zones:

const appointmentNY = Temporal.ZonedDateTime.from({
  year: 2025,
  month: 11,
  day: 5,
  hour: 10,
  timeZone: 'America/New_York'
});
console.log(`Appointment in NY: ${appointmentNY.toString()}`);

const appointmentBerlin = appointmentNY.withTimeZone('Europe/Berlin');
console.log(`Same appointment in Berlin: ${appointmentBerlin.toString()}`);

Calculate durations:

const start = Temporal.ZonedDateTime.from('2025-01-15T10:00:00[Europe/London]');
const end = Temporal.ZonedDateTime.from('2025-03-20T14:30:00[Europe/London]');
const duration = end.since(start, { largestUnit: 'month' });
console.log(duration.toString()); // "P2M5DT4H30M"
console.log(`Months: ${duration.months}`);
console.log(`Days: ${duration.days}`);

Want to Try Temporal? Here's How

As of May 2025, not every browser supports Temporal yet. But you don't have to wait. There's a polyfill you can use right now.

Install it:
```
npm install @js-temporal/polyfill
```

Import it at the top of your JS file:

import { Temporal } from '@js-temporal/polyfill';

That's it. Now you can use Temporal everywhere.

Thinking in Temporal: A Few Tips

Switching to Temporal means changing how you think about dates and times:

Be specific.
Are you working with a date, a time, a time zone, or an exact instant? Pick the right type for each case.
Trust immutability.
Your objects won't change out from under you.
Use time zones for anything users see.
Store instants for tracking events internally.
Use durations for math.
No more fiddling with milliseconds and manual arithmetic.

Want to Learn More?

Check these out:

MDN Web Docs -- Great docs and examples.
TC39 Proposal -- For the nitty-gritty details.
Temporal Cookbook -- Real-world recipes and tips.

I've been burned by JavaScript dates more times than I can count. Temporal finally makes working with dates feel sane. Give it a try. Your future self will thank you.

5 min read | Read on the blog | Buy me a coffee

Installing Flutter on Arch: A Choose-Your-Own-Adventure Saga

Serendeep Rudraraju — Sat, 10 May 2025 09:18:33 GMT

Installing Flutter on Arch: A Choose-Your-Own-Adventure Saga

Installing Flutter on Arch Linux is one of those tasks that sounds straightforward until you actually try it. The internet is overflowing with guides, but most are either fossilized relics from 2019 or mysteriously don't work because Flutter evolves faster than anyone can keep up with.

If you're using bash, you'll find plenty of step-by-step resources. But if you're a fish shell user? Good luck—most guides are so outdated, you'd think they were written on papyrus.

Fear not. Here are the three main ways to get Flutter up and running on Arch, ranked from "grandma-friendly" to "why am I doing this to myself?":

The Three Paths to Flutter Enlightenment

The Easy Way:
Download Android Studio and let it do the heavy lifting. For everything else, consult Stack Overflow and hope for the best. It's like hiring movers instead of carrying the couch yourself.
The Weird Way:
Install Flutter from the AUR. Prepare for a wild ride of dependency errors, cryptic logs, and the existential dread of "why doesn't this work for me?" I never got this to work reliably, but maybe you're luckier (or braver).
The Reliable Way (My Way):
This is the method I trust—tried, tested, and works 10/10 times. If you want a setup that won't break every time Flutter sneezes out a new update, read on.

flowchart LR
    A[Install Flutter on Arch] --> B[Easy Way\nAndroid Studio]
    A --> C[Weird Way\nAUR Package]
    A --> D[Reliable Way\nManual SDK Setup]
    B --> E[GUI does everything]
    C --> F[Dependency roulette]
    D --> G[Full control,\nactually works]

Prerequisites: The Boring but Necessary Stuff

I roll with Java 21 because my apps like living on the edge with the latest Flutter and Gradle. Check the Gradle compatibility matrix if you're feeling nerdy.

Install Java 21 from the AUR:

yay -S jdk21-openjdk
archlinux-java set java-21-openjdk

No need to set JAVA_HOME—it's as deprecated as Internet Explorer. Move along.

Getting Started: The Step-by-Step

Step 1: Download the Flutter SDK

Head to the official Flutter install page and grab the latest SDK.

Flutter suggests a ~/development folder, but I prefer ~/Android—it's neater, and doubles as my ANDROID_HOME.

Extract the SDK like so:

tar -xf ~/Downloads/flutter_linux_*-stable.tar.xz -C ~/Android/

Step 2: Command Line Tools—Because GUIs Are for Mortals

Download the Android command line tools.

Extract them into this oddly specific directory:

mkdir -p ~/Android/cmdline-tools/latest
unzip ~/Downloads/commandlinetools-linux-*.zip -d /tmp/android_cmdline_temp
mv /tmp/android_cmdline_temp/cmdline-tools/* ~/Android/cmdline-tools/latest/
rm -rf /tmp/android_cmdline_temp

Yes, the directory structure is weird. No, I don't make the rules.

Step 3: Set Up Your Shell Profile

For fish users (the cool kids):

# Set the ANDROID_HOME environment variable
set -gx ANDROID_HOME "$HOME/Android"

# Add Android-related directories to the PATH
set -gx PATH "$ANDROID_HOME/flutter/bin" $PATH
set -gx PATH "$ANDROID_HOME/cmdline-tools/latest/bin" $PATH
set -gx PATH "$ANDROID_HOME/platform-tools" $PATH
set -gx PATH "$ANDROID_HOME/emulator" $PATH

For bash/zsh users (the classics):

export ANDROID_HOME="$HOME/Android"
export PATH="$ANDROID_HOME/flutter/bin:$PATH"
export PATH="$ANDROID_HOME/cmdline-tools/latest/bin:$PATH"
export PATH="$ANDROID_HOME/platform-tools:$PATH"
export PATH="$ANDROID_HOME/emulator:$PATH"

Pro tip: Make sure you're not accidentally overwriting your $PATH—append/prepend as needed!

Step 4: Relaunch and Test

Restart your terminal so your shell can soak in those new variables.

Install the essential Android components (grab a coffee, this takes a while):

yes | sdkmanager \
  "platform-tools" \
  "emulator" \
  "platforms;android-35" \
  "build-tools;35.0.0" \
  "system-images;android-35;google_apis_playstore;x86_64"

Next, get the Android licenses out of the way:

yes | flutter doctor --android-licenses

Once that's done, run:

flutter doctor -v

If something fails (usually build tools), don't panic—Google and the Arch Wiki are your best friends.

Directory Structure: What You Should See

Your ~/Android folder should look something like this:

~/Android/
├── flutter/
├── cmdline-tools/
│   └── latest/
├── platform-tools/
├── emulator/
└── (other stuff)

Quality of Life Tweaks (a.k.a. "Save Your Sanity")

Problem:
adb sometimes refuses to detect your device unless it's in MTP mode. Annoying, right?

Solution:
Create a symbolic link so adb is always where it needs to be:

sudo ln -s ~/Android/platform-tools/adb /usr/bin/adb

Now your phone should show up without needing to play USB mode roulette.

Final Thoughts

Setting up Flutter on Arch is a bit like assembling a spaceship from spare parts—frustrating, occasionally confusing, but satisfying when it finally takes off. If you hit a snag, remember: you're not alone, and there's always another guide (or meme) out there to help.

Happy coding—and may your flutter doctor always return green checkmarks!

4 min read | Read on the blog | Buy me a coffee

Choosing a Distro: More Stressful Than Naming Your Firstborn?

Serendeep Rudraraju — Mon, 05 May 2025 19:48:21 GMT

Choosing a Distro: More Stressful Than Naming Your Firstborn?

Picking a Linux distro can feel like naming your first child. Except, instead of a shortlist of "Dave" or "David", you're staring at a list of 600+ options, each with a name that sounds like a Pokémon or a sci-fi planet. And just when you think you've made up your mind, someone pops up to tell you why their choice is better.

Welcome to Linux. The land of endless options—and even more opinions.

The Distro Jungle

Imagine walking into a candy store the size of a football field. Every treat is wrapped in shiny paper, but some have mysterious fillings. Some are classics, some are new, and a few are so strange you wonder if anyone actually eats them. That's Linux. There are distros for everyone—students, businesses, tinkerers, even entire countries (yes, North Korea has its own).

Why so many? Because Linux is open-source. Anyone with a keyboard and a dream can whip up their own flavor. That's how you end up with everything from Ubuntu (the friendly neighbor) to Gentoo (the one who insists on building their own furniture from scratch).

The Paradox of Choice

Choice is supposed to be good, right? But with Linux, it's easy to get stuck in analysis paralysis. You start out thinking, "I just want something that works." Next thing you know, you're comparing package managers, reading heated debates about "rolling release" vs. "LTS," and wondering if "Pacman" is a game or a way to install software.

It's like online dating, but every profile is written in bash scripts and everyone's profile picture is a penguin.

Distro Hopping: The Never-Ending Quest

Here's a secret: most Linux users don't settle down with their first distro. They hop. A lot. One week it's Ubuntu, the next it's Manjaro, then maybe Fedora. Sometimes it's the search for the "perfect" fit. Sometimes it's just boredom. Sometimes it's because someone on Reddit posted a screenshot (staring right at you r/unixporn) that looks cooler than your current setup.

It's the tech version of redecorating your house every weekend because you saw a new color on Pinterest. Some say it's a waste of time. Others call it a rite of passage. Either way, it's almost unavoidable.

What Actually Sets Distros Apart?

It goes well beyond the wallpaper. Here's what really matters:

Package Manager: How you install stuff. APT, DNF, Pacman. Think Target vs. Walmart vs. a secret underground market.
Release Cycle: Some distros are rock-solid and rarely change. Others are always on the bleeding edge—exciting, but sometimes risky.
Desktop Environment: GNOME, KDE, Cinnamon, XFCE. Like picking your room's style, but you can swap it out with a single command (and maybe a little cursing).
Philosophy: Some distros are all about stability. Others chase the latest features. A few are for people who like a challenge (looking at you, Gentoo).
Community: Some have bustling forums and wikis. Others are ghost towns.
Pre-installed Software: Some come with everything but the kitchen sink. Others hand you a screwdriver and say, "Build it yourself."

Family Resemblance

Despite all the names, most distros belong to a few big families:

graph TD
    Linux[Linux Kernel] --> Debian
    Linux --> RedHat[Red Hat]
    Linux --> Arch
    Linux --> Independent[The Eccentrics]

    Debian --> Ubuntu
    Debian --> MXLinux[MX Linux]
    Ubuntu --> Mint[Linux Mint]
    Ubuntu --> PopOS[Pop!_OS]

    RedHat --> Fedora
    RedHat --> CentOS
    RedHat --> AlmaLinux

    Arch --> Manjaro
    Arch --> Garuda
    Arch --> EndeavourOS

    Independent --> SUSE
    Independent --> Gentoo
    Independent --> Slackware

Debian-based: Ubuntu, Linux Mint, Pop!_OS. The friendly, reliable cousins.
Red Hat-based: Fedora, CentOS, AlmaLinux. The business crowd.
Arch-based: Arch, Manjaro, Garuda. The DIY enthusiasts who think IKEA is too easy.
The Eccentrics: SUSE, Gentoo, Slackware. The quirky uncles who live off the grid.

But don't just take my word for it. Check out this family tree—it's like a genealogy chart for Linux, and it's wild how everything connects:

Linux Distribution Timeline

Want to see just how tangled the Linux family really is? That chart shows decades of distros branching, merging, and sometimes disappearing entirely. It's a little overwhelming, but also kind of beautiful—like a map of all the places you could go. For most people, the differences between the big, user-friendly distros are smaller than the forums would have you believe. It's like arguing over which bottled water tastes best.

Feeling Overwhelmed? Here's Help:

Brodie Robertson: Distro reviews, Linux news, and practical tips.
Chris Titus Tech: Linux tweaks, troubleshooting, and setup guides.
TechHut: Desktop environment showcases and beginner-friendly tutorials.
DistroTube: Deep dives into Linux philosophy, window managers, and customization.

A Few Fun Comparisons

Stability vs. Rolling Release: Debian is your minivan—safe, steady, maybe a little boring. Arch is a sports car—fast, fun, but you might end up on the side of the road Googling "kernel panic."
Distro Hopping: Like changing your room's paint color every weekend.
Personality Types: Ubuntu is the friendly neighbor. Arch is the ambitious DIY project. Gentoo is the one who takes apart IKEA furniture just for the fun of putting it back together.
Tribalism: People defend their favorite distro like parents at a school play. Expect debates.

Final Thoughts

The best Linux distro is the one that helps you get things done—or at least lets you enjoy the ride. Don't overthink it. Pick one. Give it a spin. If it doesn't fit, try another. With Linux, the only thing more predictable than change is the itch to try something new next weekend.

After finally settling on Arch Linux and using it daily for almost a year, I'll admit—there were days I wanted to pull my hair out, especially when I tried to "rice" Hyprland. But that's half the fun. It's like building your own custom bike: you'll get grease on your hands, you'll probably lose a few screws along the way, but in the end, you're riding something that's truly yours. That's the magic of Linux—the freedom to shape your desktop exactly how you want, even if it means taking a few detours (and maybe a few deep breaths) along the way.

Happy hopping!

5 min read | Read on the blog | Buy me a coffee

The Prompt & The Ponder

Memory as Polynomial Projection: The Mathematics of Long-Context Predictive Modeling

TL;DR

Two problems wearing the same costume

HiPPO, and the polynomial that ate signal processing

The family tree: S4 → S5 → Mamba → Mamba-2 → Mamba-3

The Structured State Space Duality, or: how we spent a year on notation

The Repeat-After-Me theorem: where SSMs lose, honestly

Hierarchies: where HiSS quietly wins

The 2026 frontier

Where this leaves you

Sources

Energy-Based Transformers: The 1982 Architecture Finally Got Compatible Training Tricks

TL;DR

What an Energy-Based Transformer Actually Computes

Why This Didn't Work for 40 Years

What Gladstone et al. Changed

Lead with the Win, Concede the Caveats

The 2026 Ecosystem

The Strongest Skeptic Case

Where This Leaves You

Closing

Sources

The Open Source AI Lie: Weight-Washing, Broken Definitions, and Who Benefits

TL;DR

A 28-Year-Old Definition Meets a Trillion-Dollar Industry

The Honesty Audit

The Institutional Crisis Nobody's Talking About

The Counter-Argument You Can't Dismiss

Why It Still Matters

Where This Leaves You

Sources

The Post-Transformer Era: State Space Models, Mamba, and What Comes After Attention

TL;DR

The quadratic problem

How selective state spaces work

Mamba-2 and Mamba-3

Production hybrid architectures

AI21 Jamba

IBM Bamba-9B

NVIDIA Hymba-1.5B

IBM Granite 4.0

Mistral Codestral Mamba

NVIDIA Nemotron-H

When to use what

Common misconceptions

Practical implementation

What this means

Sources

VLA Models Demystified: How Robots Learned to See, Listen, and Act

What are VLA models?

The architecture that won: dual systems

The 2025 model landscape

The closed-source heavyweights

The open-source options

What VLAs still can't do well

The connection to computer-use agents

Getting started

What this means

Sources

JavaScript Date Is Broken. Here's What Replaces It

TL;DR

Prerequisites

What's wrong with Date, exactly?

Enter Temporal: the mental model

Project setup

Building the Temporal utils layer

Data flow: from form input to sorted display

State management with Context

The event card: dual timezone display

Sorting events across timezones

What I learned

Further reading

Closing thoughts

Open Source Sunday: Open Source Health Tools That Don't Sell You Out

TL;DR

Why This Matters Now

The Privacy Illusion

The Subscription Trap

The 2025 Turning Point