LLM Architectures Explained: Transformers, MoE & Reasoning Models

The way large language models are built has changed more in the last 18 months than in the previous five years combined. It is no longer just about making models bigger. The frontier has shifted toward smarter training paradigms, sparse architectures, and the ability to reason at inference time. This guide covers all of it — from the transformer foundation every LLM still runs on, to the reinforcement learning techniques reshaping what these models can do.

Whether you are a developer trying to choose the right model for a production system, or an engineer who wants to understand what is actually happening under the hood, this is the reference you need.

The Transformer Foundation
Core Transformer Components
How Attention Works: The Key Innovation
The 2025 Shift: Reasoning Models and RLVR
Mixture of Experts: The Efficiency Revolution
Modern Attention Variants
Frontier Models: Architecture Deep Dive
Scaling Laws: Old and New
Choosing the Right Architecture
What Comes Next

The Transformer Foundation

Every major LLM you can name today — GPT, Claude, Gemini, LLaMA, DeepSeek — is built on the transformer architecture introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. at Google. That is not a coincidence. The transformer solved two problems that had been holding back language models for years.

The problem with what came before. Recurrent Neural Networks (RNNs) and LSTMs processed text sequentially — one word at a time, in order. This created two major limitations. First, you could not parallelize training, so it was extremely slow. Second, the models struggled to “remember” context from many steps back in the sequence, a problem known as the vanishing gradient. Ask an RNN to interpret the word “it” in a long paragraph, and it might have already forgotten the subject it refers to.

What transformers did differently. Instead of sequential processing, transformers process all tokens in a sequence simultaneously and let every token attend to every other token through a mechanism called self-attention. This allows the model to directly connect distant parts of text in a single step — and because the computation is parallel, training is orders of magnitude faster.

The practical result was that transformer models could be trained at scales that were simply not feasible before, leading directly to the emergence of GPT-3, BERT, and everything that followed.

Core Transformer Components

A decoder-only transformer (the type used by most modern chat models) is made up of a handful of fundamental components, stacked repeatedly:

Embedding Layer. Raw text is first tokenized — broken into subword units called tokens — and then converted into dense numerical vectors. These embeddings capture semantic relationships between words. “King” and “Queen” end up closer together in this vector space than “King” and “table.”

Positional Encoding. Self-attention treats all tokens equally regardless of position. Positional encodings inject information about the order of tokens so the model knows that “dog bites man” and “man bites dog” are different sentences. Modern models use learned positional schemes like Rotary Position Embedding (RoPE), which we cover later.

Multi-Head Attention. The core mechanism. Rather than computing one set of attention weights, the model computes many in parallel — each “head” can learn to attend to different relationships (syntactic structure in one head, semantic similarity in another). The outputs are concatenated and projected.

Feed-Forward Network (FFN). After attention, each token position is passed through a two-layer feed-forward network independently. This is where the majority of a model’s parameters live, and it is the layer that Mixture of Experts architectures replace with specialized expert networks.

Layer Normalization and Residual Connections. These are stability mechanisms. Residual connections (borrowed from ResNet) allow gradients to flow more easily through deep networks during training. Layer normalization ensures activations stay in a healthy range.

A transformer block is just these components combined. Modern LLMs stack anywhere from 32 to 128 of these blocks on top of each other.

How Attention Works: The Key Innovation

Self-attention is the heart of why transformers work. Here is the intuition.

For every token in a sequence, the attention mechanism asks: “Given what I know about all the other tokens in this sequence, which ones are most relevant to understanding this token?” The answer is computed by turning each token into three vectors — a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). The dot product of a Query against all Keys produces attention scores, which are normalized and used to create a weighted sum of Values.

The result is that every token gets a representation that incorporates context from the entire sequence, weighted by relevance.

# Simplified self-attention (conceptual pseudocode)
def self_attention(tokens):
    Q = tokens @ W_query    # What am I looking for?
    K = tokens @ W_key      # What do I contain?
    V = tokens @ W_value    # What information do I carry?

    # Score: how much should token i attend to token j?
    scores = softmax(Q @ K.T / sqrt(d_k))

    # Weighted combination of values
    output = scores @ V
    return output

The “multi-head” part means this computation runs in parallel across many independent subspaces. One head might learn to track grammatical dependencies; another might track coreference (what “it” refers to). The outputs are concatenated and mixed back together.

The One Limitation: Quadratic Complexity

Standard attention has a cost that scales as O(n²) with sequence length — computing scores for every pair of tokens. For a 4,096-token sequence, that is manageable. For a 1 million-token context window, it is prohibitively expensive. This is why so much of the 2025 architectural innovation has been about making attention more efficient, which we cover in the Modern Attention Variants section.

The 2025 Shift: Reasoning Models and RLVR

The single biggest architectural development of 2025 was not a new attention mechanism or a larger model. It was a new training paradigm: Reinforcement Learning from Verifiable Rewards (RLVR).

What Changed

For most of LLM history, the dominant scaling law was simple: bigger model + more data + more compute = better performance. This held from GPT-2 through GPT-4. By 2024, however, returns from simply scaling pretraining were diminishing, and the cost was becoming astronomical.

RLVR offered a different path. Instead of only training a model to predict the next token (which teaches it what answers look like), you train it to actually solve problems using reinforcement learning — rewarding it when it gets correct answers on verifiable tasks like math and code.

DeepSeek-R1: The Proof of Concept

In January 2025, DeepSeek released DeepSeek-R1, demonstrating that reasoning capabilities in LLMs can be developed through reinforcement learning without requiring human-labeled reasoning trajectories. The paper, published in Nature in September 2025, showed that a model trained with pure RL (DeepSeek-R1-Zero) spontaneously develops chain-of-thought reasoning, self-verification, and backtracking behaviors — nobody explicitly taught it to “show its work.”

The RL algorithm used is called GRPO (Group Relative Policy Optimization), a simplified variant of PPO that eliminates the need for a separate value/critic model. The reward is binary and verifiable: did the final answer match the ground truth?

RLVR Training Loop (simplified):

1. Present model with a math problem
2. Model generates a response (with or without reasoning steps)
3. Verifier checks: is the final answer correct?
4. If correct → reward +1 (reinforce the behavior)
5. If wrong → reward -1 (discourage this behavior)
6. Repeat across millions of examples

Key insight: The model learns that writing out reasoning 
steps leads to more correct answers — so it learns to reason.

DeepSeek-R1-Zero started with an AIME 2024 benchmark score of 15.6% and reached 71.0% through RL training alone. This was a watershed moment: reasoning behavior emerged from reward signals, not from human demonstration.

The full DeepSeek-R1 model (which adds cold-start supervised data and two RL stages) achieved performance comparable to OpenAI o1 across math, code, and general reasoning tasks — and it was open-sourced.

Test-Time Compute: Thinking Longer to Think Better

A related discovery is that you can trade inference compute for accuracy. The same model can use varying amounts of reasoning tokens before producing a final answer. Harder problems benefit from longer thinking chains. This creates a new scaling axis that did not exist before: you do not need a bigger model, you need the same model thinking for longer.

This is the mechanism behind OpenAI’s o-series models, DeepSeek-R1, and the “thinking mode” features in Qwen3 and Claude. It represents a fundamental shift from “bigger at training time” to “smarter at inference time.”

Mixture of Experts: The Efficiency Revolution

The second major architectural trend of 2025 was the mainstream adoption of Mixture of Experts (MoE). By mid-2025, nearly every major open-weight model — DeepSeek V3, LLaMA 4, Qwen3, Mistral 3 Large — had moved to MoE or MoE variants. Sebastian Raschka’s comprehensive architecture comparison (July 2025) confirms MoE as the dominant trend among frontier open-weight models.

The Core Idea

In a standard transformer, every token passes through the same feed-forward network (FFN) in each layer. In a MoE model, the FFN is replaced by a collection of multiple “expert” FFNs, with a learned router that dynamically selects which experts process each token. Only a small fraction of experts are activated per token, keeping computational costs low even as total model capacity grows dramatically.

Standard Dense Model:
Token → FFN (all 675B parameters activated)
Cost: High compute, high memory

MoE Model (DeepSeek V3 example):
Token → Router → selects 8 of 256 experts + 1 shared expert
Only ~37B parameters activated per token
Total parameters: 671B
Effective compute cost: 94% lower per token

How Routing Works

The router assigns each token a score for every expert and selects the top-k. The outputs of the selected experts are combined using a weighted sum, where weights are proportional to the router scores. A key engineering challenge is load balancing — without intervention, the router tends to send most tokens to a small number of favorite experts. An auxiliary loss term during training penalizes uneven expert utilization to encourage balanced routing.

Design Variations in 2025 Models

Different models have made different architectural choices, reflecting active experimentation with the trade-offs:

DeepSeek V3/R1 uses MoE in nearly every transformer block (except the first 3 dense layers for training stability), with 256 routed experts plus 1 always-active shared expert per layer. Each token activates 8 routed experts plus the shared one. The shared expert captures common patterns across all inputs while routed experts specialize. Multi-head Latent Attention (MLA) is used to compress the KV cache, reducing memory footprint at long contexts.

LLaMA 4 Maverick (released April 2025) uses alternating dense and MoE layers — a design Meta chose for training stability. It has 128 routed experts per MoE layer plus a shared expert, with each token activating just 1 routed expert plus the shared one. This gives 17B active parameters against 400B total. Scout, the smaller LLaMA 4 model, uses 16 experts and achieves a 10 million token context window — the longest of any publicly available model.

Qwen3 235B-A22B (released May 2025 by Alibaba) uses 128 experts per layer, activating 8 per token, with no shared expert — a design choice possibly motivated by having enough experts that shared representation is less necessary. Qwen3’s flagship innovation is a unified “thinking” and “non-thinking” mode within a single model, eliminating the need to switch between a chat model and a reasoning model.

Mistral 3 Large (released December 2025) marks Mistral’s return to MoE after years of dense-only releases. It is a 675B parameter model with 39B active parameters, inspired in part by the DeepSeek V3 architecture.

The Key Trade-Off

MoE models have more total parameters than dense models of equivalent computational cost, but they require storing all parameters in memory even when most are inactive. This is why, for example, the LLaMA 4 Scout model — despite having only 17B active parameters — cannot run on consumer hardware with 64GB of RAM, because all 109B total parameters must be loaded.

Modern Attention Variants

The standard multi-head attention mechanism is powerful but expensive. Several architectural innovations from 2024–2025 have made long-context inference practical.

Grouped Query Attention (GQA)

GQA, which became standard in LLaMA 2 and many subsequent models, reduces the number of key and value heads while keeping the number of query heads the same. Instead of every query head having its own K/V head, groups of query heads share a single K/V head. This reduces the size of the KV cache (the memory required to store past keys and values during generation) by a factor equal to the grouping ratio.

Multi-head Latent Attention (MLA)

Introduced in DeepSeek V2 and used in V3 and R1, MLA compresses K and V representations into a low-rank “latent” vector before storing them in the cache. At inference time, the keys and values are reconstructed from this compressed representation. The result is a 50–70% reduction in KV cache memory compared to standard multi-head attention, with empirical performance that matches or exceeds standard attention on most benchmarks.

# MLA conceptual pseudocode
def multi_head_latent_attention(query, key, value):
    # Compress K,V to low-rank latent space
    kv_latent = compress(key, value)      # Small vector, easy to cache

    # Reconstruct at attention time
    K, V = decompress(kv_latent)

    # Standard attention from here
    scores = softmax(query @ K.T / sqrt(d_k))
    output = scores @ V
    return output

Sparse Attention

DeepSeek V3.2 (September 2025) introduced sparse attention patterns that avoid computing scores for every token pair. Instead, each token attends to a local window of nearby tokens plus a strided set of distant tokens. This reduces the computational cost from O(n²) to O(n × window_size), making very long contexts tractable. MiniMax-M1 and Qwen3-Next also adopted linear attention variants in 2025, though MiniMax subsequently reverted to standard attention in their M2 release, citing production reliability challenges.

Rotary Position Embeddings (RoPE)

RoPE has become the near-universal standard for position encoding in modern LLMs. Instead of adding fixed positional vectors to token embeddings, RoPE rotates the query and key vectors by an angle proportional to position before computing attention scores. The key advantage is that relative distances are naturally encoded: the dot product between a query at position i and a key at position j depends only on their difference (i − j), not their absolute positions. This makes it easier to extend to longer contexts than the model was trained on.

Frontier Models: Architecture Deep Dive

DeepSeek V3 and R1

Built on a 671B parameter MoE architecture with 37B active parameters per token, DeepSeek V3 was pretrained on 14.8 trillion tokens using FP8 mixed-precision on 2048 H800 GPUs at a documented compute cost of approximately $5.5M for the final run — significantly cheaper than comparable frontier models. DeepSeek-R1 is the same architecture with RLVR post-training applied, achieving performance comparable to OpenAI o1 on math, code, and reasoning benchmarks. Both models are open-weight under MIT license.

The key architectural innovations are MLA (Multi-head Latent Attention), FP8 training precision, and the shared expert MoE design. DeepSeek V3.2 added sparse attention for improved long-context efficiency.

LLaMA 4 (Meta)

Released April 2025, LLaMA 4 is Meta’s first natively multimodal and first MoE model family. Scout (17B active / 109B total, 16 experts) fits on a single H100 and supports a 10 million token context. Maverick (17B active / 400B total, 128 experts) targets higher capability on a single H100 host. Behemoth, still in preview, has 288B active parameters and nearly 2 trillion total parameters. LLaMA 4 models are trained with FP8 precision on over 30 trillion tokens covering 200 languages.

Meta’s design choice of alternating dense and MoE layers differs from DeepSeek’s approach of MoE in nearly every block. The trade-off is between training stability (favoring dense layers early) and maximum parameter efficiency (favoring MoE everywhere).

Qwen3 (Alibaba)

Qwen3 (released May 2025) offers 7 dense models from 0.6B to 32B and 2 MoE models (30B-A3B and 235B-A22B). The flagship 235B-A22B has 22B active parameters and uses 128 experts with top-8 routing. The defining innovation is the unified thinking/non-thinking mode: a single model switches between fast responses and extended chain-of-thought reasoning based on the task or user preference, with a configurable “thinking budget” that allocates more or fewer reasoning tokens per query. All Qwen3 models are released under Apache 2.0 and support 119 languages.

Claude (Anthropic)

Anthropic has not publicly disclosed architectural details or parameter counts for Claude models. What is known is that training incorporates Constitutional AI — a method where the model critiques and revises its own outputs according to a set of principles before human feedback is applied, reducing the volume of human annotation required while improving alignment. Claude models support extended thinking with tool use, meaning the model can reason across multiple tool calls during inference.

Gemini (Google)

Gemini’s key differentiator is native multimodality: unlike models that add vision capabilities via a separate encoder bridged to a language model, Gemini is trained on text, images, audio, and video from pretraining, using a unified token space across modalities. The 2.5 series introduced a “Deep Think” reasoning mode. Gemini 2.5 Pro supports a 1 million token context using ring attention and sparse attention optimizations to make quadratic attention tractable at that scale.

Scaling Laws: Old and New

The Chinchilla Era (pre-2025)

The Chinchilla paper (Hoffmann et al., 2022) established the dominant scaling law: for a fixed compute budget, you should scale model parameters and training tokens in roughly equal proportion. Bigger models are not always better — an undertrained large model loses to a well-trained smaller one.

The Post-RLVR Era

RLVR introduced a new dimension to scaling. The question is no longer just “how big is the model and how many tokens did it see?” but also “how much RL compute was applied post-training, and how much inference compute is allocated at test time?” As Sebastian Raschka documented in his 2025 year-in-review, the return on investment from RLVR post-training often exceeds the return from equivalent pretraining compute. DeepSeek V3’s $5.5M pretraining produced a GPT-4-class model; applying RLVR on top produced an o1-class reasoning model for a fraction of additional cost.

The emerging framework for thinking about LLM capability has four axes: pretraining scale, pretraining data quality, post-training (RLVR) compute, and test-time compute. Optimizing all four — rather than just the first two — is now the frontier of research.

Choosing the Right Architecture

Dense vs. MoE

Dense models (parameters fully activated per token) are simpler to train, easier to deploy, and have more predictable memory requirements. MoE models offer higher capability per unit of compute but require storing all parameters in memory during inference. For self-hosted deployments with tight memory constraints, a well-trained dense model often makes more practical sense than an MoE model with equivalent active parameters.

Reasoning vs. Standard Models

Reasoning models (those trained with RLVR and capable of extended thinking) are significantly more capable on complex problems but slower and more expensive to run. For tasks like simple Q&A, summarization, or straightforward code generation, a standard model will give faster, cheaper results with comparable quality. For multi-step math, complex debugging, or ambiguous reasoning tasks, a reasoning model is worth the cost.

Context Window Considerations

Most real-world applications do not need 1 million tokens. If your task involves documents under ~50 pages, a model with a 128K context is sufficient and cheaper to run. Very long context models trade efficiency for flexibility. LLaMA 4 Scout’s 10M context is remarkable, but also means inference cost scales with the amount of context you actually send.

Model Selection Reference

Use Case	Recommended Options
Complex reasoning, math, code	DeepSeek R1, o3, Claude with extended thinking
General assistant, balanced cost/quality	Claude Sonnet, Gemini 2.5 Pro, Qwen3-32B
Long document analysis (>200K tokens)	Gemini 2.5 Pro (1M context), LLaMA 4 Scout
Cost-sensitive high-volume production	DeepSeek V3, Qwen3-A22B, Mistral 3 Large
Open-source / self-hosted	LLaMA 4, Qwen3, DeepSeek V3 (all Apache/MIT)
On-device / edge deployment	Qwen3-0.6B to 4B, Llama 3.2 1B/3B
Fine-tuning for a domain	LLaMA 4, Qwen3, Mistral (best tooling support)

What Comes Next

Hybrid Architectures

Several 2025 models experimented with combining transformer attention with state-space models (SSMs) like Mamba. NVIDIA’s Nemotron 3 (December 2025) uses a Transformer-Mamba hybrid. The appeal is that SSMs can handle very long sequences with constant (not quadratic) memory, but they are harder to train at scale than transformers. Expect more hybrid experimentation in 2026.

Smaller Models Getting More Capable

One of the clearest trends in 2025 was capability compression: smaller models reaching the performance bar that only larger models could achieve a year earlier. Distillation from reasoning models (DeepSeek released 1.5B through 70B models distilled from R1) and improved training pipelines are driving this. The implication for developers is that cost-effective deployment is becoming more accessible, not less.

Reasoning as a Standard Feature

The boundary between “reasoning models” and “standard models” is dissolving. Qwen3 integrates both in a single model. Claude and Gemini have added thinking modes to their main model families. By 2026, extended thinking will likely be a standard feature rather than a distinguishing one, with models adapting the depth of their reasoning dynamically based on task complexity.

The Architecture Foundation Stays

Despite all the innovation, decoder-only transformers remain the dominant architecture for generation tasks. Self-attention is still the core mechanism. Pre-training followed by post-training is still the standard recipe. The innovation is happening at the margins — in how we train, how we route computation, and how we scale inference — not in replacing the fundamental building blocks.

Key Takeaways

The most important things to take away from the 2025 LLM architecture landscape:

RLVR changed the training calculus. Smarter post-training often beats simply scaling up pretraining. DeepSeek-R1’s emergence behavior — developing chain-of-thought reasoning from pure RL rewards — was the research story of the year.

MoE is now mainstream. Nearly every major open-weight frontier model uses sparse expert routing. This is not a niche efficiency technique anymore; it is the standard approach for building large-parameter, low-active-compute models.

Transformers are not going anywhere. Despite experimentation with SSMs, linear attention, and hybrid architectures, the decoder-only transformer is still the dominant paradigm, and there is no clear successor on the immediate horizon.

Test-time compute is a new scaling axis. The ability to allocate more inference compute to harder problems — and less to easy ones — is reshaping how developers should think about model selection and deployment cost.

Open-source caught up. LLaMA 4, Qwen3, and DeepSeek V3/R1 are genuinely competitive with frontier closed models on most benchmarks and are freely available for research, fine-tuning, and deployment.