GEMMA 4 ARCHITECTURE

Complete technical comparison of Google's Gemma 4 model family — released April 2, 2026 — Apache 2.0

Gemma 4 is Google DeepMind's fourth-generation open model family, released under Apache 2.0. It is the first generation where the architecture itself -- not just scale or training data -- becomes the primary axis of differentiation across variants. The family spans four models (31B, 26B-A4B, E4B, E2B) that share a common structural DNA but diverge in how they allocate capacity, reflecting a design philosophy where the same building blocks are composed differently for server, edge, and on-device deployment. The smaller E-series variants (E2B, E4B) support all four modalities -- text, image, video, and audio -- making them true any-to-any models.

The Shared Skeleton

Every Gemma 4 model is a decoder-only transformer with hybrid sliding/full attention. Layers alternate between local sliding-window attention and global full attention in a fixed ratio (5:1 or 4:1). This is not new -- Gemma 3 and Mistral pioneered it. What is new is that Gemma 4 makes these two layer types structurally different:

Sliding Layers (local)
head_dim=256, more KV heads, standard RoPE (theta=10K), full rotation. Optimized for fine-grained local patterns within a window of 512-1024 tokens.
Full Layers (global)
head_dim=512, fewer KV heads, p-RoPE (theta=1M, partial=0.25), K=V weight sharing. Optimized for long-range semantic attention across the entire 256K context.

This dual-config approach means that every 6th layer operates with a completely different attention geometry -- wider heads, fewer KV groups, and only 25% of dimensions carrying positional information. The transition from Gemma 3 to Gemma 4 can be summarized as moving from "same attention, different window" to "different attention, different window."

Four Targets, Four Compositions

The family spans four deployment regimes, each making distinct architectural trade-offs within this shared framework:

31B Server · Dense
The flagship dense model (Arena score: 1452). 60 layers of GQA + GeGLU with the full Gemma 4 attention design. No compression tricks -- every layer computes fresh K, V, and FFN. Image-Text-to-Text with a 27-layer ViT vision encoder using 2D RoPE and learned spatial embeddings. Supports variable image token budgets (70, 140, 280, 560, 1120 tokens).
26B-A4B Server · MoE
The efficiency variant (Arena score: 1441 with only 4B active). 26B total parameters, but only ~4B active per token. Every layer runs a dense GeGLU FFN (hidden=2,112) in parallel with a 128-expert MoE (top-8, hidden=704 each) -- their outputs are summed. This is architecturally unusual: most MoE models replace the FFN with experts, but Gemma 4 keeps both, giving each layer always-on dense capacity plus sparse expert specialization.
E4B Edge · Any-to-Any
The mid-size on-device model. ~8B total parameters with ~4.5B effective, 128K context. Shares the E-series architecture with per-layer input embeddings and KV cache sharing. Supports all four modalities (text, image, video with audio, and standalone audio), using the same ViT vision encoder and USM Conformer audio encoder as E2B. Not analyzed in this page (config not yet published at time of writing).
E2B On-device · Any-to-Any
The most architecturally novel variant. ~5.1B total parameters but only ~2.3B effective -- the gap comes from a per-layer input embedding table that maps each token to a unique 256-dim gated residual for every layer. 20 of 35 layers share KV caches from earlier layers, and KV-shared layers compensate with 2x wider MLPs. Includes both a Conformer audio encoder (USM-style, 12 layers) and a ViT vision encoder, making it a true any-to-any model supporting text, image, video, and audio.

What Changed from Gemma 3

Compared to Gemma 3 27B, the architectural delta is substantial. The table below focuses on the 31B (the most direct comparison), but most innovations propagate to all three variants:

  • Per-type attention geometry -- Gemma 3 used identical head_dim/kv_heads for all layers. Gemma 4 splits: 256/16kv for sliding, 512/4kv for full. This is the single biggest structural change.
  • p-RoPE replaces linear scaling -- Full attention layers now rotate only 25% of dimensions (proportional RoPE) instead of applying 8x linear frequency scaling. This is grounded in the Oxford/DeepMind finding that low-frequency channels carry semantic (not positional) information.
  • K=V weight sharing -- Full attention layers eliminate the V projection entirely, reusing key states as values. Combined with fewer KV heads (4 vs 16), this dramatically cuts per-layer parameters.
  • V-norm (value normalization) -- All variants apply RMSNorm to values (without learned scale), a stabilization technique absent in Gemma 3.
  • Logit soft-capping -- Output logits are bounded via tanh(x/30)*30, preventing extreme values during generation.
  • Vision encoder upgrade -- SigLIP (Gemma 3) is replaced by a ViT with 2D RoPE and learned 2D positional embeddings, yielding 280 tokens per image (vs 256) through 3x3 pooling (vs 4x4).
Model Overview
Gemma 4 31B IT
DenseVision
Total Params~31B
Active Params31B
Context256K tokens
Hidden Size5,376
Layers60 (5:1 sliding/full)
Gemma 4 26B-A4B IT
MoEVision
Total Params~26B
Active Params~4B
Context256K tokens
Hidden Size2,816
Layers30 (5:1 sliding/full)
Gemma 4 E2B IT
DenseVisionAudio
Total Params~5B (~2B effective)
Active Params~2B eff
Context128K tokens
Hidden Size1,536
Layers35 (4:1 sliding/full)
Gemma 3 27B IT comparison
DenseVision
Total Params27B
Active Params27B
Context128K tokens
Hidden Size5,376
Layers62 (5:1 sliding/full)
Parameter Comparison
Parameter Gemma 4 31B Gemma 4 26B-A4B Gemma 4 E2B Gemma 3 27B
Total Params ~31B ~26B ~5B 27B
Active Params 31B ~4B ~2B eff 27B
Context 256K 256K 128K 128K
Hidden Size 5,376 2,816 1,536 5,376
Layers 60 30 35 62
Layer Pattern 5:1 sliding/full 5:1 sliding/full 4:1 sliding/full 5:1 sliding/full
Attention GQA GQA GQA GQA
Q Heads 32 16 8 32
KV Heads (sliding) 16 8 1 16
KV Heads (full) 4 2 1 16
Head Dim (sliding) 256 256 256 128
Head Dim (full) 512 512 512 128
FFN Type GeGLU MoE + GeGLU GeGLU GeGLU
FFN Hidden 21,504 2,112 + MoE(704) 6,144 21,504
MoE Experts -- 128 (top-8) -- --
Vocab 262,144 262,144 262,144 262,208
RoPE (sliding) theta=10K theta=10K theta=10K theta=10K
RoPE (full) theta=1M, 25% partial theta=1M, 25% partial theta=1M, 25% partial theta=1M, linear 8x
QK Norm RMSNorm RMSNorm RMSNorm RMSNorm
V Norm RMSNorm (no scale) RMSNorm (no scale) RMSNorm (no scale) --
K=V Sharing yes (full layers) yes (full layers) -- --
KV Shared Layers -- -- 20 of 35 --
Per-Layer Input -- -- dim=256 --
Logit Cap 30.0 30.0 30.0 --
Vision Encoder ViT 27L, d=1152 ViT 27L, d=1152 ViT 16L, d=768 SigLIP 27L, d=1152
Audio Encoder -- -- Conformer 12L --
Tie Weights yes yes yes yes
Benchmarks

Source: Hugging Face blog. Instruction-tuned variants. E4B included from blog (architecture page covers E2B, 26B-A4B, 31B in detail).

Benchmark31B26B-A4BE4BE2B
Arena Score (LMArena est.)14521441----
MMLU Pro85.2%82.6%69.4%60.0%
GPQA Diamond84.3%82.3%58.6%43.4%
AIME 2026 (no tools)89.2%88.3%42.5%37.5%
LiveCodeBench v680.0%77.1%52.0%44.0%
Codeforces ELO21501718940633
MMMU Pro (vision)76.9%73.8%52.6%44.2%
MRCR v2 8-needle 128K66.4%44.1%25.4%19.1%
CoVoST (audio)----35.5433.47
FLEURS (audio)----0.080.09

The 26B-A4B MoE achieves 98% of the 31B's Arena score with only 4B active parameters -- a 7.5x compute reduction. The E2B achieves competitive audio scores despite being a 2.3B-effective model.

Per-Block Parameter Estimates

Computed from: Q=dim×q_heads×head_dim, K/V=dim×kv_heads×head_dim, O=q_heads×head_dim×dim. GeGLU FFN=3×dim×hidden (gate+up+down). V proj=0 when K=V. MoE: per_expert=3×dim×expert_hidden, router=dim×num_experts.

Gemma 4 31B — 60 layers (50 sliding + 10 full)
ComponentSliding BlockFull BlockFormula
Q proj44.0M88.1M5376 × 32 × dh
K proj22.0M11.0M5376 × kv × dh
V proj22.0M0 (K=V)eliminated
O proj44.0M88.1Mq×dh × 5376
Attention total132.1M187.2M
GeGLU FFN346.8M346.8M3 × 5376 × 21504
Block total478.9M534.0M
50 × 478.9M + 10 × 534.0M + 1.4B embed = ~30.7B total
Gemma 4 26B-A4B — 30 layers (25 sliding + 5 full)
ComponentSliding BlockFull BlockFormula
Q proj11.5M23.1M2816 × 16 × dh
K proj5.8M2.9M2816 × kv × dh
V proj5.8M0 (K=V)
O proj11.5M23.1M
Attention total34.6M49.0M
Dense GeGLU17.8M17.8M3 × 2816 × 2112
MoE experts (128×)761.3M761.3M128 × 3 × 2816 × 704
MoE active (top-8)47.6M47.6M8 × 5.9M/expert
Router0.4M0.4M2816 × 128
FFN total capacity779.5M779.5Mdense + all experts + router
FFN active/token65.8M65.8Mdense + top-8 + router
Block total capacity814.1M828.5M
Block active/token100.4M114.8M
Total capacity: 25 × 814M + 5 × 829M + 0.7B embed = ~25.2B
Active per token: 25 × 100M + 5 × 115M + 0.7B embed = ~3.8B
Gemma 4 E2B — 35 layers (28 sliding + 7 full)
ComponentSliding BlockFull BlockFormula
Q proj3.1M6.3M1536 × 8 × dh
K proj0.4M0.8M1536 × 1 × dh
V proj0.4M0.8Mno K=V on E2B
O proj3.1M6.3M
Attention total7.1M14.2M
GeGLU FFN28.3M28.3M3 × 1536 × 6144
GeGLU FFN (2× wide)56.6M56.6MKV-shared layers only
Block total35.4M42.5Mstandard layers
Block total (2× wide)63.7M70.8MKV-shared layers
15 standard blocks + 20 double-wide blocks + 0.4B embed + 2.3B per-layer embed = ~5.1B total
Effective (excl. per-layer embed table): ~2.3B
Gemma 3 27B — 62 uniform layers PREVIOUS GEN
ComponentPer BlockFormula
Q proj22.0M5376 × 32 × 128
K proj11.0M5376 × 16 × 128
V proj11.0M5376 × 16 × 128
O proj22.0M32 × 128 × 5376
Attention total66.1M
GeGLU FFN346.8M3 × 5376 × 21504
Block total412.9M
62 × 412.9M + 1.4B embed = ~27.0B total
GPU Memory Requirements

Estimates: weight memory + KV cache (FP16). KV formula: 2 × kv_heads × head_dim × 2 bytes per token per layer, summed across sliding and full layers. E2B benefits from KV sharing (20 of 35 layers reuse cache).

Weight Memory (all params loaded)
Precision31B26B-A4BE2BGemma 3 27B
FP1662.0 GB52.0 GB10.2 GB54.0 GB
INT831.0 GB26.0 GB5.1 GB27.0 GB
INT415.5 GB13.0 GB2.6 GB13.5 GB
KV Cache (FP16, batch=1)
Context31B26B-A4BE2BGemma 3 27B
4K3.4 GB0.9 GB0.07 GB1.9 GB
32K27.5 GB6.9 GB0.6 GB15.5 GB
128K110 GB27.5 GB2.3 GB62 GB
256K220 GB55 GB----

31B: 50 sliding layers (16 kv_heads × d=256) + 10 full layers (4 kv_heads × d=512). 26B-A4B: 25 sliding (8 × 256) + 5 full (2 × 512). E2B: 15 unique KV layers after sharing (1 × 256/512). Gemma 3: 62 uniform layers (16 × 128).

Total VRAM & GPU Recommendation (batch=1)
Scenario31B26B-A4BE2BGemma 3 27B
FP16, 4K ctx 65.4 GB
1× H100 80GB
52.9 GB
1× H100 80GB
10.3 GB
RTX 4070 12GB
55.9 GB
1× H100 80GB
INT4, 4K ctx 18.9 GB
RTX 4090 24GB
13.9 GB
RTX 4070 Ti 16GB
2.7 GB
Any GPU
15.4 GB
RTX 4090 24GB
INT4, 128K ctx 125.5 GB
2× H100 160GB
40.5 GB
A6000 48GB
4.9 GB
RTX 4060 8GB
75.5 GB
1× H100 80GB

The 26B-A4B MoE loads all 26B parameters into VRAM (all experts must be resident), but its KV cache is 4x smaller than the 31B due to fewer KV heads. At INT4 + 128K context, it fits in a single A6000 -- impossible for the 31B. E2B with KV sharing runs full 128K context on a laptop GPU.

Key Architectural Innovations in Gemma 4
Deep Dive: Why partial_rotary_factor=0.25?
Based on: "Round and Round We Go! What makes Rotary Positional Encodings useful?" Barbero, Vitvitskyi, Perivolaropoulos, Pascanu, Velichkovic (Oxford & Google DeepMind, 2024)
The Problem
Standard RoPE rotates all head dimensions. High-frequency components create positional heads (tracking token positions), while low-frequency components carry semantic information (content meaning). For long context, these slow-rotating low-frequency channels can break -- they complete full rotations across the sequence, losing the semantic signal they encode.
The Insight
Not all RoPE dimensions are equal. The paper proves that high frequencies build positional heads robustly, while low frequencies are "most invariant" to position -- they encode semantic attention patterns. For long context, increasing theta alone (e.g. 10K → 1M) isn't enough because the lowest frequencies still eventually break.
p-RoPE Solution
Proportional RoPE (p-RoPE) simply sets the (1-p) lowest frequency dimensions to zero -- making them position-independent (NoPE). With p=0.25, only the top 25% of frequencies are rotated (positional), while 75% become pure semantic channels that never break regardless of context length.
Gemma 4's Implementation
Gemma 4 applies p-RoPE on full attention layers only:
Sliding layers: standard RoPE, theta=10,000, all dims rotated
Full layers: p-RoPE, theta=1,000,000, partial=0.25
This gives sliding layers fine-grained local positioning, while full layers get robust long-range semantic attention with only 25% positional capacity -- exactly the p-RoPE prescription.
A. RoPE Frequency Spectrum (per head) Each head_dim has d/2 frequency pairs. Low frequencies rotate slowly, high frequencies rotate fast. Low freq Medium freq High freq slow rotation medium rotation fast rotation dim 0 dim d/4 dim d/2 Semantic content meaning position-invariant Mixed both semantic & positional vulnerable at long context Positional token position positional heads B. The Problem: Low Frequencies Break at Long Context Short Context (4K tokens) Low-freq wave: partial rotation, signal intact Long Context (256K tokens) Low-freq completes full rotations → signal destroyed C. p-RoPE Fix: Zero Out the Vulnerable Dimensions Standard RoPE (Gemma 3): All dimensions rotated (100%) p-RoPE p=0.25 (Gemma 4): 75% zeroed: cos=1, sin=0 (NoPE) 25% RoPE Position-free: semantic attention that never breaks Works at any context length (4K to 256K+) Positional heads Track token position Key insight: p-RoPE gives global attention layers robust semantic capacity at any context length, while sliding layers keep full positional precision for local patterns (standard RoPE, theta=10K)
RoPE Strategy Evolution: Gemma 3 → Gemma 4
Aspect Gemma 3 (27B) Gemma 4 (31B) Impact
Sliding RoPE theta=10K, full rotation theta=10K, full rotation Same -- local attention unchanged
Full RoPE theta theta=1M theta=1M Same base frequency
Full RoPE scaling linear 8x partial=0.25 (p-RoPE) p-RoPE replaces linear scaling
Rotated dims (full) 128 of 128 (100%) 128 of 512 (25%) 75% dims are position-free
Semantic capacity All dims position-coupled 384 dims pure semantic Robust long-context semantics
Max context 128K 256K 2x context with p-RoPE
Validation Perplexity (from paper)

The paper validates p-RoPE on Gemma 7B-scale models. Lower is better.

EncodingWikiPlanV2Properties
NoPE4.85946.6429Semantic only, no position
RoPE (theta=10K)4.46276.4429Standard
RoPE (theta=500K)4.44856.4593High theta
0.25-RoPE4.45926.4683= Gemma 4's setting
0.75-RoPE (inverted)4.45376.4562More rotation
0.25-RoPE (full model)4.53026.5111p-RoPE on all layers
0.75-RoPE (full model)4.44146.4422Best overall perplexity

Gemma 4 uses 0.25-RoPE on full attention layers only (not all layers), combining the best of both: standard RoPE for local sliding attention + p-RoPE for global full attention.

Code-Verified Architecture Details

Verified against transformers/models/gemma4/modeling_gemma4.py. Code snippets are exact quotes.

1. Attention Scaling = 1.0 (not 1/√d)
Gemma 4 does NOT use the standard 1/√head_dim scaling. Instead, scaling is fixed to 1.0 and the QK norms (with learned scale) absorb the normalization function.
self.scaling = 1.0
2. K=V Sharing: Keys and Values Diverge Through Norms
When attention_k_eq_v=True, the V projection is eliminated (v_proj=None). The key tensor is cloned as the value before normalization. Then K gets k_norm (with learned scale) + RoPE, while V gets v_norm (without scale, no RoPE). So K and V start identical but diverge through different normalizations.
# v_proj is None when attention_k_eq_v=True
value_states = self.v_proj(hidden_states).view(hidden_shape) \
    if self.v_proj is not None else key_states

# K path: scaled norm + RoPE
key_states = self.k_norm(key_states)          # with_scale=True
key_states = apply_rotary_pos_emb(key_states, cos, sin)

# V path: unscaled norm, NO RoPE
value_states = self.v_norm(value_states)      # with_scale=False
W_k (K proj) clone K path V path k_norm with_scale=True v_norm with_scale=False RoPE no RoPE K V same source, different paths
3. RMSNorm: with_scale vs without
Gemma 4 introduces a with_scale flag. Q/K norms have learnable scale (multiplicative weight), V norm does not. The weight is initialized to ones (standard RMSNorm), unlike Gemma 2/3's (1 + weight) parameterization.
class Gemma4RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6, with_scale=True):
        super().__init__()
        self.eps = eps
        self.with_scale = with_scale
        if self.with_scale:
            self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, hidden_states):
        normed = self._norm(hidden_states.float())
        if self.with_scale:
            normed = normed * self.weight.float()
        return normed.type_as(hidden_states)
4. Embedding Scaling (learnable buffer)
Token embeddings are multiplied by √hidden_size. This is a registered buffer (not a gradient-tracked parameter), initialized from the config. The per-layer embedding uses √per_layer_dim instead.
class Gemma4TextScaledWordEmbedding(nn.Embedding):
    def __init__(self, num_embeddings, embedding_dim,
                 padding_idx, embed_scale=1.0):
        super().__init__(num_embeddings, embedding_dim, padding_idx)
        self.register_buffer(
            "embed_scale", torch.tensor(embed_scale))

    def forward(self, input_ids):
        return super().forward(input_ids) * self.embed_scale

# Main embedding: scale = sqrt(5376) ≈ 73.3
# Per-layer embedding: scale = sqrt(256) = 16.0
5. Per-Layer Input: Third Residual Block
The per-layer input is NOT injected during attention or FFN. It's a third residual block applied after both. A gating linear projects hidden states to the per-layer dim, applies activation, multiplies element-wise with the per-layer embedding slice, projects back, norms, then adds as residual.
# After attention + FFN residuals:
if self.hidden_size_per_layer_input:
    residual = hidden_states
    hidden_states = self.per_layer_input_gate(hidden_states)  # d → 256
    hidden_states = self.act_fn(hidden_states)
    hidden_states = hidden_states * per_layer_input  # gated element-wise
    hidden_states = self.per_layer_projection(hidden_states)  # 256 → d
    hidden_states = self.post_per_layer_input_norm(hidden_states)
    hidden_states = residual + hidden_states
hidden_states (after attn+FFN) residual gate_proj (d→256) act_fn × per_layer_emb emb 256-dim slice for this layer projection (256→d) RMSNorm +
6. MoE: Parallel Dense + Routed with 3 Post-Norms
The dense MLP and MoE branch run in parallel from the same pre-MLP residual. Each output gets its own post-norm, then they're summed and the sum goes through a third post-norm before the residual add. MoE layers have 7 RMSNorm modules (vs 4 for standard layers).
# Dense path (standard MLP)
hidden_states = self.pre_feedforward_layernorm(residual)
hidden_states = self.mlp(hidden_states)

if self.enable_moe_block:
    hidden_states_1 = self.post_feedforward_layernorm_1(
        hidden_states)

    # MoE path (parallel, from pre-MLP residual)
    hidden_states_flat = residual.reshape(-1, residual.shape[-1])
    _, top_k_weights, top_k_index = self.router(
        hidden_states_flat)
    hidden_states_2 = self.pre_feedforward_layernorm_2(
        hidden_states_flat)
    hidden_states_2 = self.experts(
        hidden_states_2, top_k_index, top_k_weights)
    hidden_states_2 = self.post_feedforward_layernorm_2(
        hidden_states_2)

    # Sum both paths, then final post-norm
    hidden_states = hidden_states_1 + hidden_states_2
    hidden_states = self.post_feedforward_layernorm(
        hidden_states)

hidden_states = residual + hidden_states
residual (pre-MLP) DENSE pre_ff_norm GeGLU MLP post_ff_norm_1 MOE router pre_ff_norm_2 128 Experts post_ff_norm_2 Σ sum post_feedforward_norm 3rd post-norm (MoE-only) +
7. Router: Norm → Scale → Softmax → TopK → Per-Expert Scale
The router applies RMSNorm (without scale), multiplies by a learned per-dim scale × 1/√hidden_size, projects to num_experts, softmax, selects top-K, normalizes weights, then applies per-expert learned scales.
def forward(self, hidden_states):
    hidden_states = self.norm(hidden_states)        # RMSNorm, no scale
    hidden_states = hidden_states * self.scale * self.scalar_root_size

    expert_scores = self.proj(hidden_states)        # Linear → num_experts
    router_probs = softmax(expert_scores, dim=-1)

    top_k_weights, top_k_index = torch.topk(
        router_probs, k=self.config.top_k_experts)

    top_k_weights /= top_k_weights.sum(dim=-1, keepdim=True)
    top_k_weights = top_k_weights * self.per_expert_scale[top_k_index]

    return router_probs, top_k_weights, top_k_index
8. Layer Scalar (buffer, not parameter)
Each decoder layer multiplies its output by a per-layer scalar. This is a register_buffer (loaded from checkpoint, not trained via gradient), initialized to 1.0. Applied after all residual blocks (attention + FFN + optional per-layer input).
hidden_states *= self.layer_scalar  # buffer, not nn.Parameter
9. Logit Soft-Capping
Applied after the LM head projection. The config default is None (disabled), but published model checkpoints set it to 30.0. Bounds logits to [-30, 30] smoothly.
if self.config.final_logit_softcapping is not None:
    logits = logits / self.config.final_logit_softcapping
    logits = torch.tanh(logits)
    logits = logits * self.config.final_logit_softcapping
10. Proportional RoPE Config
Full attention layers use rope_type: "proportional" which sets the lowest 75% of frequency dimensions to zero (cos=1, sin=0), leaving those dimensions position-independent. Sliding layers use standard default RoPE. This is configured per-layer-type in the config, not in the modeling code.
# From Gemma4TextConfig:
rope_parameters = {
    "sliding_attention": {
        "rope_type": "default",
        "rope_theta": 10_000.0
    },
    "full_attention": {
        "rope_type": "proportional",
        "partial_rotary_factor": 0.25,
        "rope_theta": 1_000_000.0
    }
}

# Last layer forced to full_attention:
if self.layer_types[-1] != "full_attention":
    self.layer_types[-1] = "full_attention"
Sliding Layer (100% rotated) head_dim=256, theta=10K all 256 dims rotated (RoPE) 0 256 Full Layer (25% rotated) head_dim=512, theta=1M, partial=0.25 384 dims: cos=1, sin=0 (NoPE) 128 RoPE 0 384 512 Rotated (positional heads) Zeroed (semantic, position-free) Gemma 3 full layers: 128/128 rotated (100%) Gemma 4 full layers: 128/512 rotated (25%) 75% of dims are pure semantic channels that never break at long context
11. KV Cache Sharing (E2B/E4B)
Shared layers skip K/V projection entirely and load cached KV from the last non-shared layer of the same attention type (sliding or full). The query is still computed fresh. The stored KV includes the full sequence length for reuse.
# In shared layer forward:
if self.is_kv_shared_layer and past_key_values is not None:
    key_states, value_states = \
        past_key_values.shared_layers[self.kv_shared_layer_index]

# In non-shared layer: store KV for later reuse
if self.store_full_length_kv:
    past_key_values.shared_layers[self.layer_idx] = \
        key_states, value_states
Architecture Diagrams
Gemma 4 31B IT
Gemma 4 26B-A4B IT
Gemma 4 E2B IT
Previous Generation
Gemma 3 27B IT