What Actually Works: The Hardware Compatibility Filter in Neural Architecture (2023–2025)

Published 2025-12-06 · Read on Substack

TL;DR

The Scroll Map

Here’s the path this article takes:

Share

Introduction

In “Neural Architecture Design as a Compositional Language,” we established that architecture evolved from hand-crafted hierarchies to modular component libraries. The Transformer validated this: uniform blocks, structured composition rules, and emergent specialization through training.

We ended with three open questions. What primitives belong in the toolkit beyond attention and MLPs? What composition operators enable scaling? How do we train compositional systems effectively?

Two years of production deployments provide answers, but not through the lens most researchers expected. The organizing principle of the last 24 months isn’t expressiveness or theoretical elegance. It is hardware compatibility.

By “hardware compatibility,” we mean a concrete set of properties: operations that can be expressed as dense matrix multiplications on tensor cores, memory access patterns that respect the SRAM/HBM hierarchy, and enough regular parallelism to saturate modern GPUs. GPU architectures—optimized for these dense, parallel primitives—create a filter. Techniques that map naturally to them tend to win; techniques that fight them rarely survive contact with production, regardless of benchmark results on small models.

Hardware is not the only determinant of adoption—data availability, optimization effort, and product fit all matter—but when we compare architectures across labs in 2023–2025, hardware alignment is the strongest common constraint that shows up repeatedly in both open-weight releases and closed-model behavior.

This post examines the 2023–2025 era through three perspectives:

We argue for three claims, using production-style evidence and public model documentation:

  1. Hardware compatibility is a better early predictor of adoption than small-scale benchmarks.

  2. Architecture innovation is slowing relative to training innovation; for the foreseeable future on today’s GPU-centric stacks, the leverage is likely in optimization and data, not entirely new primitives.

  3. Pure SSMs have not yet scaled to frontier models; in practice, hybrids appear to be the effective ceiling today (see Appendix, Table 3).

All three are falsifiable. For example, a technique that is clearly misaligned with GPU primitives but still becomes standard at frontier scale would directly challenge the hardware compatibility thesis; a pure SSM frontier model would falsify our current ceiling claim. As of December 2025, we have not seen such counterexamples in the public record.

What doesn’t work: Speculation without production validation, marketing claims without aggregate data, and techniques that ignore deployment constraints.

What does work: Hardware-compatible primitives, permissive compositional operators, and training methods that exploit—not fight—modular structure.


Act I: The Compatibility Filter

How hardware constraints shaped what survived

Scope & Limitations

This act (and the broader analysis) focuses on large language models trained and deployed between 2023–2025 on GPU-centric infrastructure, using public papers, model cards, and system documentation as primary evidence. Closed-source models are included only to the extent that their internals can be inferred from public disclosures and third-party analyses; we do not claim completeness over proprietary experiments. All heuristics here (e.g., hardware compatibility as a filter, 12/24/36-month adoption lags) are empirical patterns observed in this window, not guarantees, and may not generalize to future hardware or domains outside production-scale LLMs. For definitions of key terms, see the Glossary and Key Concepts section; for a deeper look at how hardware and kernels interact with these ideas, see the Hardware and Kernel Primer section.

1. The Attention Bottleneck

By early 2023, the Transformer architecture faced a crisis, but it wasn’t about intelligence. It was about memory. The quadratic cost of attention (O(n2)) is often cited as a computational problem, but in practice, it is a memory movement problem.

At 128K context length, the Key-Value (KV) cache for a 70B model grows to hundreds of gigabytes, exceeding the HBM (High Bandwidth Memory) capacity of even an H100 GPU. The bottleneck wasn’t doing the math; it was moving the matrices from HBM to the compute units (SRAM) and back.

Naive solutions failed because they tried to approximate the math (sparse attention patterns, low-rank approximations) rather than solve the memory movement. They broke the one thing researchers wouldn’t compromise: the exactness of the attention mechanism. The solution that won didn’t change the math at all. It changed the memory access pattern.

2. Flash Attention: Hardware as Design Constraint

Flash Attention (Dao et al., 2022; 2023) is one of the cleanest case studies for the Hardware Compatibility Thesis. Published in May 2022, it did not propose a new neural primitive. Instead, it proposed a tiling algorithm that respects the GPU’s memory hierarchy.

Standard attention materializes the massive N × N attention matrix in HBM, reading and writing gigabytes of data for every token generation. Flash Attention tiles the computation so that blocks of the matrix are loaded into SRAM (fast, on-chip memory), computed, and discarded without ever writing the full matrix to HBM.

The Mechanism: By keeping the computation in SRAM, Flash Attention reduces memory reads/writes by roughly an order of magnitude and delivers 2–4× speedups on attention while cutting memory usage by ~10–20× at long sequences (see Appendix, Table 4). It turns a memory-bound operation into a compute-bound one, which is exactly what GPUs are designed to handle.

The Adoption Timeline (The ≈24-Month Lag): The adoption of Flash Attention reveals a predictable infrastructure lag pattern that governs the entire field:

It took roughly 24 months for a technique that required kernel-level changes to move from “proven research” to “standard practice.” This lag is not accidental; it is the cost of infrastructure integration. We will see this pattern repeat with MoE.

3. The Compatibility Winners

The winners of 2023–2025 all share one trait: they reduce memory bandwidth pressure without requiring new, exotic compute kernels.

Grouped Query Attention (GQA): GQA is the dominant attention variant of 2024 in our audit, appearing in a large majority of open-weight and documented models, including Llama 3, Mistral 7B, and Qwen2.5 (see Appendix, Table 5). Its mechanism is simple: share Key and Value heads across multiple Query heads.

Multi-Head Latent Attention (MLA): Pioneered by DeepSeek (V2/V3), MLA takes compression further. It projects the KV cache into a low-rank latent vector.

MoE sits in the same compatibility bucket: when well-engineered, large MoE deployments deliver approximately 2–3× higher throughput than dense models of similar quality, not the 5–7× often claimed in marketing, and their advantage shrinks toward ≈1.3–1.5× at very long contexts as routing and KV/cache bandwidth dominate (Appendix, Table 2). The techniques that stick are those that respect the hardware, even when their real-world gains are less dramatic than the launch slides.

Contrast: The Failure of KANs Kolmogorov-Arnold Networks (KANs) generated immense hype in early 2024 by proposing learnable activation functions on edges (splines) rather than fixed weights.

4. SSMs: The Compatibility Boundary

State Space Models (SSMs) like Mamba (Gu et al., 2021; Gu & Dao et al., 2023) represent the most significant architectural challenge to the Transformer. They promise linear scaling (O(n)) by replacing the attention mechanism with a recurrent state update.

The Mechanism: Mamba compresses context into a fixed-size state, updating it sequentially as it processes tokens. This eliminates the KV cache entirely, theoretically solving the memory bottleneck for infinite contexts.

The Friction: Pure SSMs push against the GPU’s greatest strength: parallelism. Transformers process all tokens in a sequence simultaneously during training (parallel prefix scan). SSMs, by definition, have a sequential dependency: state t depends on state t − 1. While Mamba introduces a “parallel scan” algorithm to mitigate this, the fundamental operation is still less straightforward to parallelize than the brute-force matmuls of a Transformer.

The Evidence (The Ceiling):

The Nuance: Hybrids exist. Jamba (AI21) and Qwen3-Next use SSM layers mixed with Attention layers.

5. External Memory (RAG): The Compatibility Shortcut

While researchers fought over O(n2) vs O(n) architectures, the industry solved the context problem with a database. Retrieval-Augmented Generation (RAG) is the ultimate compatibility hack.

Act I Synthesis: Across 2023–2025, the pattern is consistent. Flash Attention (maps to the memory hierarchy) took about 24 months to become standard (Appendix, Table 4). GQA (maps to bandwidth constraints) diffused in roughly 12 months as a weight-only change (Appendix, Tables 4 and 5). Pure SSMs (which push against parallelism) currently top out at 7B with no public frontier-scale models (Appendix, Table 3). KANs (incompatible compute paths) have no documented large-scale production deployments.

We are not designing neural networks in a vacuum. We are discovering which computations GPUs can efficiently express. Hardware acts as a ruthless filter on which ideas survive into production today.

In Act II, we will see how this filter shaped the most important composition operator of the decade: the shift from static graphs to learned, conditional execution.

Subscribe now

Act II: The Conditional Execution Revolution

How learned routing became a first-class primitive

The defining shift of 2023–2025 was not a new layer type. It was a change in the fundamental contract of computation. In the previous era, neural networks were static graphs: every input token triggered the exact same sequence of floating-point operations. In the current era, computation is conditional—the model learns where to spend compute, not just what layers exist.

This isn’t just about Mixture-of-Experts (MoE). It is a convergence across attention, routing, and adaptation. The architecture increasingly specifies how to choose what to compute for a given token, task, or context.

1. The Conditional Execution Convergence

If we look closely at the successful techniques of the last two years, a unifying pattern emerges: they all introduce some form of learned selectivity over compute or parameters:

These are not identical mechanisms, but they share a common composition operator: conditional execution. Some (like MoE and sparse attention) route tokens or positions to different compute paths at runtime; others (like LoRA) condition which parameters are active for a given task or domain.

Why did this happen now? Because GPU infrastructure finally matured to support it at scale. For years, “sparsity” was a dirty word in high-performance computing because it meant irregular memory access and thread divergence—two things GPUs hate. But with the arrival of specialized kernels and serving stacks for block-sparse operations (e.g., Triton/MegaBlocks-style MoE kernels, FasterMoE, Aurora, FlashInfer, and MoE-aware DeepSpeed/Megatron/FasterTransformer), the penalty for conditional execution dropped below the capacity and throughput gains it unlocks (Appendix, Table 2).

The thesis of Act II is simple: Conditional execution is to the 2020s what residual connections were to the 2010s—the composition operator that enables the next order of magnitude in scaling.

2. MoE: The Production Validation

For years, Mixture-of-Experts was the “technology of the future”—from Shazeer et al.’s sparsely-gated MoE (2017) through Switch Transformers (Fedus et al., 2021) and GLaM (Du et al., 2021)—always promising, but difficult to train and painful to serve. In 2024–2025, that future finally arrived. MoE is no longer an exotic research trick; it has become one of the primary strategies for scaling frontier models, especially when total parameter counts exceed what dense training and serving can economically support (Appendix, Table 1).

The Honest Numbers: Let’s strip away the marketing claims. Vendors often promise “5–7× speedups” with MoE. The reality, based on a meta-analysis of production-style deployments and system papers, is more grounded but still transformative (Appendix, Table 2):

The Evidence: The list of successes is now long and growing. A few representative examples:

The Narrative Climax: The Meta Pivot The clearest public signal of MoE’s arrival came in April 2025. Until then, Meta had pushed dense scaling hard—Llama 3.1’s 405B dense model was a monument to that strategy. With Llama 4, Meta introduced its first MoE family:

Meta’s pivot does not mean dense models are obsolete—dense Llama variants and many closed models still exist—but it is a strong signal that at frontier scale, MoE is now considered a first-class option rather than a risky experiment (Appendix, Table 1). The Dense Era has not ended, but it now shares the stage with sparse, conditionally-executed capacity.

4. Hybrid Architectures: Principled Heterogeneity

To solve the Context-Length Trap, we are seeing the emergence of Hybrid Architectures.

Qwen3-Next is the prime example. It doesn’t use just one primitive. It uses a three-dimensional hybrid approach:

  1. Linear Attention for the infinite context backbone.

  2. Gated Attention for high-fidelity retrieval.

  3. MoE for feed-forward capacity.

The Principle: Route computation to the primitive that handles it best.

While elegant, hybrids remain niche. The complexity cost of maintaining three different inference kernels usually outweighs the benefits for general-purpose models. Today, they are the solution for extreme requirements (256K+ contexts and ultra-long workflows), not yet the default for general reasoning.

5. What Didn’t Work: The Failure Mechanism

To understand why conditional execution won, we must look at what failed.

Neural ODEs (2018): Promised continuous depth and infinite adaptability.

Capsule Networks: Proposed “routing-by-agreement” to solve spatial hierarchies.

The Lesson: Compositionality that survives is permissive, not prescriptive. MoE provides a pool of experts and a router, but it doesn’t tell the model what the experts should be. It lets gradient descent discover the structure. Capsule Networks and Neural ODEs tried to design the structure explicitly, and they did not cross the production chasm.

Act II Synthesis: Conditional execution won because it is the most general way to scale capacity without worsening the compute constraint. It maps to the GPU’s ability to handle block-sparse operations and conditional parameter usage. Llama 4’s pivot confirms MoE as a standard frontier tool, not just a research prototype.

In Act III, we will explore the final piece of the puzzle: why these transitions take so long, and where the field goes next.

Share Petros Rooted Layers

Act III: The Infrastructure Lag Pattern

Why timing matters: From research to production

Flash Attention (Dao et al., 2022; 2023) was published in May 2022. It became standard practice in May 2024. Roughly two years. Switch Transformers (Fedus et al., 2021; Lepikhin et al., 2020; Du et al., 2021, for closely related MoE work) were published around 2020–2021. They reached mainstream production usage only around 2024. Roughly three years for MoE-style serving infrastructure.

Why the lag? In a field that moves at the speed of light, why does it take 24–36 months for a proven, hardware-compatible technique to reach production?

The answer is what we call the Infrastructure Lag Pattern. Techniques do not succeed when they are published. They succeed when the infrastructure makes them cheaper to adopt than to ignore.

1. The Adoption Dynamics Model

Looking across GQA, Flash Attention, and MoE, we see a recurring pattern: adoption timelines track integration cost (Appendix, Table 4). This is not a law of nature, but a useful heuristic for techniques similar to those we analyzed.

Integration cost is not the only factor—vendor priorities, open-source implementations, and hype all matter—but it is the most consistent constraint we see across 2023–2025 production case studies.

The Prediction (Heuristic, Not Law): If a new technique requires kernel changes (like KANs would), do not expect broad production usage for roughly two years. If it requires serving changes, expect closer to three. If it fundamentally fights GPU primitives, treat it as research-only for current production GPU stacks.

This model would be falsified by, for example, a kernel-level change that becomes standard in under a year, or a serving-level change that diffuses in a single retraining cycle. As of late 2025, our three highlighted case studies (GQA, Flash, MoE) line up cleanly with the 12 / ~24 / ~36-month buckets in Appendix, Table 4.

2. Training Innovation: The Next Frontier

This brings us to our final and most controversial claim: Architecture innovation is slowing relative to training and optimization.

The last widely adopted fundamental primitive (Attention) was introduced in 2017. Since then, we have seen refinements (Flash, GQA, MLA), composition operators (MoE), and hybrids (attention + SSM), but no clearly dominant replacement at frontier scale. The hardware constraints have narrowed the space of acceptable operations tightly around dense matrix multiplication, making truly novel primitives harder to push through the Hardware Compatibility Filter (Appendix, Tables 2, 3, and 5).

That does not mean architectural work has stopped—SSM-style blocks, hybrids, and long-context variants are very active research areas—but the center of gravity has shifted. For the foreseeable future on today’s GPU-centric stacks, the leverage is likely to come from Training Innovation on top of this relatively stable container.

A Case Study: DeepSeek-R1 (January 2025) One of the strongest recent examples comes from DeepSeek-R1 (DeepSeek-AI, 2025). It achieved state-of-the-art reasoning performance (79.8% on AIME, 97.3% on MATH-500), rivaling OpenAI’s o1-class models.

This is not singular “proof,” but it strongly reinforces the thesis: The architecture is the container; training is the contents. We have spent 8 years optimizing the container. It is efficient, scalable, and hardware-compatible. Now we are finally learning how to fill it.

Where Innovation Moves:

Training innovations have their own lag pattern—new optimizers and RL schemes still need to be implemented, debugged, and validated—but they typically integrate faster than core architectural overhauls. We are shifting from “what to compute” to “how to train what we already compute.”

Leave a comment


Conclusion

The story of 2023–2025 is not one of unbridled invention, but of ruthless selection.

The Hardware Compatibility Thesis explains much of the difference between techniques that survive and those that do not. Flash Attention

won because it respected the memory hierarchy. MoE won because it mapped to conditional execution. Pure SSMs have, so far, hit a ceiling around 7B because they fight parallelism (Appendix, Table 3). KANs remain research curiosities because their spline-heavy compute paths do not align with tensor core strengths.

For the senior practitioner, this offers a clear heuristic:

  1. Predictive: When evaluating a new paper, ask: “Does this map to GPU primitives?” If no, treat it as a research bet rather than a near-term production candidate.

  2. Strategic: Architecture progress is real but slower; invest a growing share of your R&D budget in training innovation (data curation, RL, optimization) instead of searching for a wholesale “next Transformer.”

  3. Tactical: Adopt Flash Attention and GQA wherever possible. Use MoE for scaling beyond ~10B parameters if your infrastructure budget can support it. Treat pure SSMs as research-only for frontier models until we see public evidence above 7B (Appendix, Table 3).

We are not designing neural networks in a vacuum. We are discovering which computations silicon can express efficiently. The ones that map naturally to the hardware survive at scale. The ones that don’t risk becoming historical footnotes on today’s GPU-centric stacks, regardless of their theoretical elegance.

The architecture wars are not literally “over,” but for production LLMs on 2023–2025 hardware, they have entered a slower phase. The next competitive frontier is how we train and use the architectures we already have.

Critique:

Thanks for reading Petros Rooted Layers! This post is public so feel free to share it.

Share


Appendix: The Evidence

For the skeptical practitioner, we present the aggregated evidence underlying the claims in this post. For terminology and hardware background, see the Glossary and Key Concepts and Hardware and Kernel Primer sections.

Table 1: Major Lab Frontier Model Releases (2023–2025)

Explicit model releases by major labs, ordered roughly chronologically. Architectures are summarized from public documentation as of Dec 4, 2025.

Insight: From 2023 to late 2025, frontier lines clearly cluster into two patterns: (1) Dense / sparse transformers with ever‑better kernels (Flash Attention, GQA, long context, sparse patterns), and (2) MoE transformers (Llama 4, Mixtral) that trade implementation complexity for large capacity at similar active compute. Pure SSM models still do not appear as standalone frontier flagships; SSM/Mamba‑like blocks show up only as components inside hybrids or efficiency‑oriented models outside this list.

Table 2: The MoE Reality Check

Marketing claims vs. production-style measurements from short to long context (up to 64K–128K tokens). Summarizes Extended Dataset 1 (MoE Speedup Meta-Analysis), aggregated from Switch, Mixtral, DeepSeek-V3, deployment-inefficiency work, and Aurora-style system papers.

Insight: MoE is primarily a throughput and capacity optimizer, not a magic fix for latency or memory. It buys “more model for the same serving budget,” especially at moderate context, but still runs into bandwidth and KV-cache limits at frontier context lengths.


Table 3: Pure SSM / Mamba Evidence Gap

Where pure Mamba/SSM models have been demonstrated vs. where evidence stops (as of Dec 2025). Summarizes ssm_benchmark_findings.md.

Insight: Pure SSMs are clearly viable up to 7B and can be very efficient, but in the public releases we examined there is no evidence of pure SSMs at 13B or frontier scales; above 7B, SSM-style blocks only appear inside hybrid architectures (Jamba, Falcon-H1, Qwen3-Next), not as end-to-end replacements for attention.


Table 4: Infrastructure Lag Timeline

Time from publication to standard production practice, grouped by integration complexity. Derived from flash_attention_timeline.md, MoE deployment studies, and GQA adoption data.

Insight: Integration cost determines adoption speed. Weight-only changes land within a retraining cycle; kernel changes take ~2 years to become ubiquitous; distributed system changes (e.g., MoE routing and large-scale sparse attention) take closer to 3 years.


Table 5: GQA / MLA / Efficient Attention Adoption

Snapshot of efficient attention mechanisms across major 2024–2025 model families, distilled from major_lab_architecture_audit.md.

Insight: By late 2025, GQA-style efficient attention (plus MLA and sparse variants) dominates over vanilla MHA in the large models we audited. This is the weight-only, hardware-aligned optimization that diffuses fastest and forms the “background assumption” for 2024–2025 architectures.


Table 6: Training Innovation Evidence

Selected examples where training methods, rather than new primitives, drove major capability or efficiency gains.

Insight: The most dramatic recent gains in reasoning and efficiency often come from training recipes and optimization choices layered on top of relatively stable architectural primitives (MoE, efficient attention), reinforcing the view that, on today’s hardware, training innovation is where much of the near-term leverage lies.

Leave a comment

Glossary and Key Concepts

This glossary collects the main acronyms and concepts used across the three Acts and the Appendix, with short but detailed explanations aimed at readers who know modern deep learning but may not live inside LLM infra day-to-day.


Attention and Memory

Attention (Self-Attention)

Mechanism that lets each token in a sequence attend to every other token, computing a weighted sum of their representations. In standard Transformer self-attention, each position produces Query (Q), Key (K), and Value (V) vectors; attention weights are computed as a softmax over Q·Kᵀ and used to mix the V vectors. This gives models flexible, content-based access to the entire context, at the cost of O(n²) memory and compute in the sequence length.

Multi-Head Attention (MHA)

Original Transformer attention variant in which several independent attention “heads” run in parallel, each with its own Q/K/V projections and slightly different focus. Outputs from all heads are concatenated and linearly projected. MHA improves expressiveness but scales memory and compute linearly in the number of heads and quadratically in sequence length.

Grouped-Query Attention (GQA)

An efficient attention variant where many Query heads share a smaller number of Key/Value heads. For example, 8 Query heads might share 1 Key and 1 Value head. This reduces the size of the KV cache (and memory bandwidth) without significantly degrading quality, especially for large models. GQA is a “weight-only” change—no new kernels are required—so it spreads quickly once integrated into a model family.

Multi-Query Attention (MQA)

An earlier efficiency variant in which all Query heads share a single Key and Value head. This maximizes KV sharing but can be less flexible than GQA. Some early Gemini versions used MQA, with later models and other families converging more on GQA-like patterns that balance efficiency and expressiveness.

Multi-Head Latent Attention (MLA)

An attention mechanism introduced by DeepSeek in which the KV cache is projected into a low-rank latent space before being stored. Instead of keeping full-resolution K/V vectors for every token, MLA maintains a compressed latent representation and attends over that; this significantly reduces memory footprint and bandwidth. MLA still uses dense matrix multiplications and fits well on GPUs, but changes the storage and retrieval pattern to support long contexts (e.g., 128K) and large MoE models.

KV Cache (Key-Value Cache)

During autoregressive generation, models cache the Key and Value tensors from previous tokens so they don’t have to recompute them at each step. For long contexts and large models, this cache dominates memory usage—often hundreds of GB at 128K context for 70B+ models if stored naïvely. Techniques like GQA, MLA, and FlashAttention-style kernels largely exist to make the KV cache more manageable.

FlashAttention

An IO-aware attention algorithm that reorders computation to respect GPU memory hierarchies. Instead of materializing the full n×n attention matrix in HBM (off-chip memory), FlashAttention tiles the computation into blocks that fit in SRAM (on-chip cache), computes attention over those blocks, and discards them without ever writing the full matrix to HBM. The result is 2–4× speedups and 10–20× memory savings at long sequences, turning attention from a memory-bound to a compute-bound operation and enabling much longer contexts.

Sliding-Window Attention (SWA)

An attention pattern where each token attends only to a local window of nearby tokens, rather than the entire sequence. This reduces the effective complexity from O(n²) to roughly O(n·w), where w is the window size, and is useful for long contexts where many long-range interactions are unnecessary. Mistral 7B is a canonical example of GQA + SWA in production.

Sparse Attention / DeepSeek Sparse Attention (DSA)

Sparse attention restricts which positions can attend to which others, either via fixed patterns (e.g., blocks, dilations) or learned structures, to reduce computation and memory. DeepSeek’s DSA is a specific form of sparse attention that reduces complexity from O(L²) to O(L·k) by attending only to selected key positions, enabling long contexts (e.g., 128K) in V3.2 with more predictable memory use.

RoPE (Rotary Position Embeddings)

A positional encoding method that rotates query/key vectors in a way that encodes relative positions while preserving certain algebraic properties of the dot product. RoPE supports extrapolation to longer contexts and has become a de facto standard in modern LLMs (Llama, Mistral, Qwen, etc.).


Mixture-of-Experts and Conditional Execution

Mixture-of-Experts (MoE)

An architectural pattern in which certain layers (usually the feed-forward / MLP blocks) contain multiple “experts,” and a small router network selects a subset of experts (e.g., top-1 or top-2) for each token. This makes the layer’s parameters “sparse”: the total parameter count can be very large, but only a small fraction is active per token. MoE is a form of conditional execution—different tokens route to different compute paths—enabling large capacity at lower per-token compute cost, at the price of more complex training and serving.

Experts and Router

In MoE, each expert is usually a feed-forward network (FFN) or similar module. A router (often a small linear layer per token) produces a distribution over experts, and a sparse selection (top-k or expert-choice) determines which experts process each token. Good routing is crucial for performance and load-balancing; poor routing leads to “collapsed experts” and low utilization.

Conditional Execution

An umbrella term for mechanisms that allow the model to dynamically select which computations or parameters to use, conditioned on the input or context. This includes:

FasterMoE / Aurora / MoE System Stacks

System-level implementations and optimizations for serving MoE models efficiently. These frameworks design block-sparse kernels, expert sharding strategies, and communication patterns to keep GPUs well utilized. They are a key part of why MoE became practical in production: without efficient system support, MoE can be 3–15× slower than dense models despite fewer active parameters.


State Space Models (SSMs) and Mamba

State Space Models (SSMs)

Models that represent sequences via latent “state” variables evolving over time according to linear or nonlinear state-space equations. Instead of attending over the full sequence, SSMs update a recurrent hidden state as tokens come in, offering nominal O(n) complexity and fixed memory in sequence length. In practice, they can be very efficient for certain long-sequence tasks but have historically struggled with in-context learning and arbitrary long-range interactions compared to attention.

Mamba

A specific SSM variant that introduces “selective state spaces”—mechanisms that allow the model to dynamically control which parts of the state get updated. Mamba and its variants have shown strong efficiency and competitive performance at small-to-medium scales (up to 7B), including Falcon Mamba 7B. However, as of late 2025, no pure Mamba model has been publicly demonstrated at 13B+ or frontier scale; above 7B, SSM-style blocks mainly appear inside hybrids with attention.

Hybrid SSM+Attention Architectures (Jamba, Falcon-H1, Qwen3-Next)

Architectures that mix SSM-like blocks with attention layers to get the best of both worlds: SSMs for efficient local/sequential processing and attention for flexible global dependencies and in-context learning. Examples include:

These hybrids show how SSMs can be powerful components without replacing attention outright.


Optimization, Training, and RL

SAM (Sharpness-Aware Minimization)

An optimization technique that modifies the gradient update to favor flatter regions of the loss landscape. Instead of minimizing the loss at the current parameters, SAM approximately minimizes the worst-case loss within a small neighborhood in parameter space. This tends to produce models that generalize better and are more robust, at modest additional compute cost. In the LLM context, SAM has been used as a drop-in optimizer improvement in some large training runs.

Progressive Depth / Curriculum Training

A training strategy in which models start training with shallower networks (fewer layers) and progressively increase depth as training progresses, or gradually “unfreeze” deeper layers. This can lead to faster convergence—often ≈10–20% wall-clock savings compared to training a full-deep network from scratch—while reaching similar or better final performance. It is a training-level innovation that leverages the same architecture more efficiently.

Routing-Aware Optimization (for MoE)

Training methods that explicitly optimize the MoE router and expert usage: auxiliary losses to balance load, penalties for collapsed experts, and specialized temperature schedules or regularizers. These methods improve expert utilization and stability, which in turn improves effective throughput and quality. ST-MoE-style work and DeepSeek’s MoE stacks are examples of routing-aware optimization in action.

GRPO (Group Relative Policy Optimization)

A reinforcement learning algorithm used in DeepSeek-R1, among others, to train reasoning behavior. GRPO is a variant of policy gradient methods that compares performance across groups of trajectories, focusing updates on relatively better reasoning chains. It allows the model to discover multi-step reasoning strategies by rewarding useful trajectories, even when supervised labels are sparse or noisy.


Retrieval and Memory-Augmented Methods

RAG (Retrieval-Augmented Generation)

A design pattern in which a language model is paired with an external retrieval system (often a vector database or search index). At query time, relevant documents or snippets are retrieved based on the input, and those retrieved chunks are fed into the model’s context. This effectively gives the model access to a much larger “memory” than its fixed context window, without changing its architecture: the GPU still runs the same Transformer-style computations, but over retrieved text instead of a monolithic long prompt.


Infrastructure and Hardware

GPU Primitives / Tensor Cores

Modern GPUs (e.g., NVIDIA A100/H100) are optimized around dense matrix multiplications (GEMMs) and tensor-core operations. A “hardware-compatible” neural primitive is one that can be implemented as regular, batched GEMMs that keep tensor cores busy. Operations that require irregular scatter/gather, small non-vectorized ops, or sequential dependencies tend to underutilize the hardware and are harder to scale.

HBM and SRAM (Memory Hierarchy)

HBM (High Bandwidth Memory) is large but slower off-chip memory on GPUs, while SRAM (on-chip cache/registers) is small but extremely fast. Efficient kernels minimize data movement between HBM and SRAM by reusing data in SRAM as much as possible. Both FlashAttention and MLA are designed explicitly around this hierarchy.

Block-Sparse Operations

Sparse computations where nonzero elements cluster in blocks rather than arbitrary positions, making them amenable to efficient GPU implementations. MoE and many sparse attention variants are implemented as block-sparse matmuls to preserve parallelism and compatibility with GPU primitives.


Other Architectures and Research Directions

KANs (Kolmogorov-Arnold Networks)

Architectures that replace standard linear layers and fixed activations with spline-based, learnable functions on edges, motivated by Kolmogorov-Arnold representation theorems. They offer a more flexible function class in principle but involve spline evaluations that do not map cleanly to dense GEMMs, making them less hardware-friendly. As of 2023–2025, KANs have generated research interest and hype but have not been demonstrated at frontier LLM scale.

Neural ODEs (Neural Ordinary Differential Equations)

Models that interpret layer depth as a continuous variable and use ODE solvers to evolve hidden states. This offers theoretical elegance and continuous-depth flexibility, but ODE solvers are iterative and sequential, conflicting with GPUs’ need for large, regular batches of parallel work. Neural ODEs have influenced theory and some niche applications but have not become a mainstream architecture for large-scale LLMs.

Capsule Networks

Architectures that group neurons into “capsules” and use routing-by-agreement mechanisms to decide how lower-level capsules contribute to higher-level ones. They aim to capture part-whole relationships and invariant representations. However, the routing logic is prescriptive and computationally heavy, and despite early excitement, capsule networks have not been adopted in large-scale production LLMs; they mostly remain a historical example of a compositional idea that did not cross the infra chasm.

Share

Hardware and Kernel Primer

A mental model for how hardware, kernels, and frameworks shape what “actually works.”

This primer gives a quick, practical view of how GPU hardware, kernels, and frameworks fit together, and why they create the Hardware Compatibility Filter and Infrastructure Lag Pattern described in the main article. For definitions of specific acronyms and components (HBM, SRAM, GEMM, tensor cores, etc.), see the Glossary and Key Concepts section.


1. The Three-Layer Stack

At a high level, modern LLM infrastructure looks like a three-layer stack:

  1. Hardware – GPUs, tensor cores, HBM/SRAM, interconnects (PCIe, NVLink).

  2. Kernels and Libraries – CUDA/Triton kernels, cuBLAS/cuDNN, FlashAttention, block-sparse matmuls.

  3. Frameworks and Models – PyTorch/JAX operators, Transformer/MoE/SSM blocks, full model architectures.

The key idea: models only run fast if their operations can be expressed as a small number of highly optimized kernels that fit the hardware well. Every time you invent a new primitive that needs a new kernel, you pay an integration and validation tax across this stack.


2. What a Kernel Actually Is

In this context, a kernel is not just an algorithm on paper; it is the concrete, low-level implementation of that algorithm on GPUs:

Writing a good kernel is hard. Making it production-grade is harder: you need tests, fallbacks, debugging hooks, and framework integration. This is why “kernel-level changes” show a ~24-month lag from paper → library → framework → production (see Appendix, Table 4).


3. How Hardware Shapes Compatibility

Modern GPUs (A100, H100, etc.) are built around three key properties:

  1. Dense matrix multiplications (GEMMs) on tensor cores.

  2. Hierarchical memory – small, fast SRAM on-chip + large, slower HBM off-chip.

  3. Massive parallelism – thousands of threads executing similar instructions in lockstep.

A technique is hardware-compatible if:

Techniques that require irregular scatter/gather, heavy branching, or inherently sequential computation (e.g., iterative solvers) tend to underutilize the hardware and are much harder to push to frontier scale.


4. Concrete Examples

4.1 FlashAttention: Same Math, Better Kernel

This is why FlashAttention could become a near-universal drop-in: it is a kernel-level improvement that respects GPU primitives while keeping the higher-level abstraction identical. The price was ~24 months of integration work across libraries and frameworks (Appendix, Table 4).

4.2 GQA and MLA: Weight-Only vs Compressed KV

In both cases, the operations remain “tensor-core friendly,” which is why they spread quickly once shown to work.

4.3 MoE: Conditional Compute Needs Kernel + System Help

Mixture-of-Experts introduces block-sparse compute and conditional routing:

Efficient MoE requires:

This is why MoE falls into the Level 3 (serving + distributed system) category and shows a ~36-month lag from Switch → Mixtral/DeepSeek/Llama 4 (Appendix, Table 4).

4.4 SSMs vs Attention: Parallelism Gap

State Space Models (SSMs) like Mamba use sequential state updates:

On GPUs:

The result in 2023–2025:

This illustrates how parallelism and kernel complexity can limit which architectures make it to frontier scale on today’s GPU stacks.


5. How This Creates Infrastructure Lag

Putting it together, the Infrastructure Lag Pattern emerges naturally from these layers:

Architectural ideas that align with existing kernels and hardware (GQA, RoPE, RMSNorm) diffuse quickly. Those that require entirely new kernels or system designs (Flash, MoE, SSMs at scale) move more slowly. Those that fight GPU primitives (e.g., heavy spline evaluations in KANs, iterative solvers in Neural ODEs) face the steepest uphill battle and, so far, have not crossed into frontier-scale production.

This is the practical substrate of the Hardware Compatibility Filter: it is not just about mathematical elegance, but about how gracefully a technique flows through hardware, kernels, frameworks, and serving systems.

Subscribe now

Bibliography and References

This bibliography collects the main external sources cited in the article “What Actually Works: The Hardware Compatibility Filter in Neural Architecture (2023–2025)”.


Primary sources (papers, technical reports, official model cards/docs) back quantitative and architectural claims. Secondary sources (e.g., Wikipedia, third-party analyses) are used only for high-level summaries and inferred details of closed models and are explicitly labeled as such.

1. Mixture-of-Experts and Conditional Computation


2. Flash Attention and Infrastructure Lag


3. State Space Models and Hybrid Architectures


4. Training and Optimization Techniques


5. Frontier Model Documentation and Model Cards


6. Production Deployment, RAG, and Serving Systems


7. Modular and Compositional Learning


8. Additional 2025 Research Directions (Speculative / Emerging)

These references inform forward-looking observations and are not treated as production-standard techniques.

Thanks for reading Petros Rooted Layers! Subscribe for free to receive new posts and support my work.