Nous Analysis Proposes Lighthouse Consideration: A Coaching-Solely Choice-Based mostly Hierarchical Consideration That Delivers 1.4–1.7× Pretraining Speedup at Lengthy Context
Coaching massive language fashions on lengthy sequences has a widely known downside: consideration is pricey. The scaled dot-product consideration (SDPA) on the core of each transformer scales quadratically Θ(N²) in each compute and reminiscence with sequence size N. FlashAttention addressed this via IO-aware tiling that avoids materializing the total N×N consideration matrix in high-bandwidth reminiscence, lowering the reminiscence footprint considerably, however the underlying Θ(N²) compute scaling stays. Researchers at Nous Analysis have launched a brand new methodology known as Lighthouse Consideration that addresses this bottleneck particularly at pretraining time, reaching a 1.40× to 1.69× end-to-end wall-clock speedup in opposition to a cuDNN-backed SDPA baseline, with matching or decrease last coaching loss.
The core downside with present sparse consideration strategies
To know why Lighthouse works the way in which it does, it helps to know what present sparse consideration strategies do. Most prior work like NSA, HISA, DSA, MoBA makes the identical two design selections. First, they pool solely the important thing and worth aspect whereas leaving queries at full decision (uneven compression). Second, their choice logic lives inside a customized consideration kernel, which implies groups can’t reuse the optimized dense-attention kernels that trendy GPU tensor cores are constructed round.
There’s additionally a priority particular to coaching that inference-only sparse strategies don’t face. An inference-time sparse methodology is evaluated solely in opposition to its dense spine and it’s at most pretty much as good as that spine. A training-time sparse methodology faces a more durable check: as soon as coaching is completed, will the ensuing weights nonetheless produce a reliable dense-attention mannequin at inference? Lighthouse treats that query as its central correctness criterion.
Lighthouse takes a unique strategy on each design selections. It swimming pools queries, keys, and values symmetrically throughout a multi-level pyramid, and it locations choice solely outdoors the eye kernel. After choice, the system gathers the chosen entries right into a contiguous, dense sub-sequence and runs inventory FlashAttention on it — the identical kernel utilized by the dense baseline.
https://arxiv.org/pdf/2605.06554
How the four-stage pipeline works
A Lighthouse consideration layer wraps round, however doesn’t modify, scaled dot-product consideration. The pipeline has 4 phases.
Within the first stage, common pooling constructs an L-level pyramid from Q, Okay, and V. With pooling issue p, stage ℓ of the pyramid has N/p^ℓ tokens, every summarizing p^ℓ base positions. Crucially, the identical pooling applies to all three projections, producing coherent (Q^(ℓ), Okay^(ℓ), V^(ℓ)) triples at each stage. Complete pyramid building prices Θ(N) time and reminiscence.
Within the second stage, a parameter-free scorer assigns every pyramid entry two scalar scores utilizing per-head ℓ₂ norms: one as a question rating (∥Q^(ℓ)_i∥₂) and one as a key rating (∥Okay^(ℓ)_i∥₂). Coarser ranges inherit scores from finer ones through max-pooling, so a rough span picks up the significance of its strongest token. A fused chunked-bitonic top-Okay kernel then selects ok entries collectively throughout all pyramid ranges. One design element value noting: the coarsest pyramid stage is at all times retained in full — it’s low cost and ensures no less than one contributor at each base place; the remaining choice funds is spent on finer ranges. Moreover, the chunked-bitonic design produces a stratified top-Okay moderately than a strict world top-Okay: the rating stream is partitioned into fixed-size chunks, every sustaining an in-register top-m buffer, so if the ok globally highest-scoring entries clustered in a single chunk, some would get replaced by lower-scoring entries from different chunks. The result’s extra balanced consideration protection throughout the sequence and avoids choice collapse onto a slender span.
The highest-Okay step is discrete and non-differentiable — no straight-through estimator, no Gumbel softmax. Choice indices carry no gradient. Gradients stream solely via the gathered Q, Okay, V entries into WQ, WK, WV, so the projections study to provide values which might be helpful when chosen moderately than scores which might be good at choosing.
Within the third stage, the chosen entries are gathered right into a contiguous sub-sequence of size S = N/p^(L−1) + (L−1)·p·ok and handed to plain FlashAttention. At N = 1,000,000 with L = 4, p = 4, ok = 4,096, S ≈ 65,000 — far smaller than N. A vital property of the gathering course of is that it ensures no “holes” or empty areas within the assembled sub-sequence. This issues particularly as a result of Lighthouse additionally compresses queries: a niche within the sequence would imply these lacking tokens don’t have any gradient path through the backward cross and will trigger coaching instabilities. Uneven strategies that go away queries at full decision don’t face this downside, however Lighthouse’s symmetric design requires that the gathered sub-sequence stays totally dense.
Within the fourth stage, every output entry is scattered again to the p^ℓ base positions it represents through a deterministic integer-atomic scatter kernel, with a shift of p^ℓ − 1 to protect causality. The per-position fan-in is bounded by L no matter ok.
https://arxiv.org/pdf/2605.06554
Why symmetric pooling modifications the compute
Pooling queries alongside keys and values modifications the computational character of the eye name from O(N Sd) to O(S² d) at coaching time. As a result of S ≪ N at lengthy contexts, that is what produces the latency benefit. Benchmarked on a single NVIDIA B200 at 512K context (bfloat16, B=1, H=8, head dimension 128, L=3, p=4, sparsity ≈ 1:64), Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on the mixed ahead+backward cross relative to cuDNN-backed SDPA.
From an asymptotic standpoint, setting L = logp(N/ok) provides a gathered sub-sequence measurement of S = Θ(ok log N), which makes the dense FlashAttention name price Θ(k² log² N d) — polylogarithmic in N at fastened ok. Mixed with the linear-cost phases (pyramid building, scoring, scatter-back), whole per-layer compute is Θ(T d) at bounded ok — the identical asymptotic class as linear consideration and SSMs — whereas preserving softmax consideration’s recall properties on the chosen sub-sequence.
Inference is a unique constraint. Autoregressive decoding presents one question at a time, which violates the belief that each one queries co-occur in a single ahead cross. Lighthouse is a training-only methodology, and the symmetric pooling design can’t be used straight at inference.
The 2-stage coaching recipe and recoverability
The experimental setup used a 530M-parameter Llama-3-style decoder (dmodel=1024, 30 layers, 8 heads, head dimension 128, FFN width 1536, byte-level tokenizer), skilled on C4 at 98,304-token context with AdamW at studying price 2×10⁻³, β1=0.9, β2=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, and FSDP. One implementation element that issues for practitioners: of the 30 layers, layers {0, 1, 28, 29} retain dense SDPA all through — solely the opposite 26 layers use Lighthouse. The internal consideration name inside these 26 Lighthouse layers makes use of the identical cuDNN-backed SDPA kernel because the dense baseline.
The coaching aproach is two-stage. Stage 1 trains with Lighthouse choice enabled for almost all of the step funds. Stage 2 resumes the Stage 1 checkpoint underneath dense SDPA (similar optimizer state, similar dataloader) for a brief tail. If Stage 1 had hollowed out the mannequin’s dense-attention functionality, Stage 2 restoration would fail.
It doesn’t fail. Testing at a complete funds of 16,000 steps (~50.3B tokens), three break up factors (10k+6k, 11k+5k, 12k+4k) had been evaluated in opposition to a dense-from-scratch SDPA baseline. At every resume level the coaching loss spikes transiently by 1.12–1.57 nats because the mannequin is first run via consideration it was not skilled in opposition to, then recovers inside roughly 1,000–1,500 SDPA steps and crosses under the dense baseline. By step 16,000, all three resumed Lighthouse runs attain last losses of 0.6980–0.7102, in opposition to the dense baseline’s 0.7237, whereas spending 22.5h to 27.0h wall-clock in comparison with 37.9h for dense-SDPA-from-scratch on the identical token funds.
Ablations and throughput
The complete ablation grid covers scorer kind, pooling issue p, variety of pyramid ranges L, and top-Okay funds ok. Key findings: the projection-norm scorer is inside ~0.01 of the dilated softmax-attention scorer in both route (no uniform winner) however is roughly 9% cheaper in B200-hours, because it skips the eye cross over the pyramid solely. Shallower pyramids (L=3) persistently outperform deeper ones (L=4, L=5) at matched budgets. Smaller ok values produce decrease post-resume loss throughout the examined vary — the lowest-loss configuration throughout the grid is L=3, p=2, ok=1536 with the dilated scorer, reaching a last lack of 0.6825 — a counter-intuitive consequence the analysis groups attribute to hierarchical choice performing as a regularizer at this token funds scale.
Stage-1 throughput throughout the ablation grid ranges from 84,000 to 126,000 tokens/s/GPU in opposition to roughly 46,000 for dense SDPA. The projection-norm scorer at L=3, p=4, ok=1536 tops the vary at 126,000 tokens/s/GPU by skipping the dilated-attention cross solely.
Lengthy-context retrieval
To enhance the loss-based recoverability outcomes, the analysis group ran a simplified Needle-in-a-Haystack (NIAH) analysis: a single passkey digit hidden in random alphanumeric filler at depths of 0–100% throughout context lengths of 4K to 96K tokens, with retrieval scored as a one-token argmax over the ten digit tokens (random probability: 10%). 4 Lighthouse configurations (various ok ∈ {1536, 2048} and scorer ∈ {dilated, norm} at L=3, p=4) had been examined in opposition to the dense-SDPA-from-scratch baseline. Three of 4 Lighthouse runs match or beat the dense baseline’s imply retrieval price of 0.72: ok=2048 dilated reaches 0.76, ok=1536 dilated reaches 0.73, and ok=2048 norm matches the baseline at 0.72. Solely ok=1536 norm dips, to 0.65. A sample emerges throughout the grid: bigger ok is the dominant axis for retrieval efficiency, and the norm scorer hurts retrieval greater than it hurts coaching loss on the similar ok. The sensible implication is that the optimum configuration is dependent upon whether or not the downstream activity is loss-driven or retrieval-driven.
Context parallelism scaling
For sequences past ~100K tokens, Lighthouse runs underneath context parallelism (CP). Pyramid pooling, scoring, and top-Okay run shard-locally on every rank with no inter-rank communication, because the coarsest pool window (e.g., 64 tokens) is orders of magnitude smaller than the shard measurement. The gathered sub-sequence is dense, so it participates in normal ring consideration with out sparse-aware collectives — one thing sparse-index-based strategies can not do with out engineering particular to the sparse structure. Context parallelism introduces roughly 10% per-rank throughput overhead from ring rotation, however the Lighthouse vs. SDPA speedup ratio is preserved. The tactic scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) with no modifications to the internal consideration kernel.
Marktechpost’s Visible Explainer
01 / The Downside
Why Lengthy-Context Coaching Is Costly
Each transformer makes use of scaled dot-product consideration (SDPA), which computes a rating between each token and each different token within the sequence. As sequence size N grows, this price scales as Θ(N²) in each compute and reminiscence — it doubles the price for each ~1.4× enhance in context.
FlashAttention lowered this by utilizing IO-aware tiling that avoids ever materializing the total N×N consideration matrix in high-bandwidth reminiscence, chopping reminiscence footprint considerably. However the underlying Θ(N²) compute scaling is unchanged — the wall continues to be there.
Θ(N²) SDPA compute & reminiscence scaling
1M token context frontier fashions goal
32 B200 GPUs wanted for 1M-token coaching
The consequence: groups both practice at shorter contexts than they need, or spend huge compute budgets on consideration alone. Lighthouse Consideration is a technique that wraps round normal SDPA throughout pretraining to scale back this price, then will get eliminated so the ultimate mannequin is a standard dense-attention mannequin at inference.
02 / Prior Work
What Present Sparse Consideration Will get Fallacious
A number of strategies already attempt to scale back the eye price by attending to solely a subset of tokens. However most share two design selections that create issues for pretraining.
⚠ Downside 1: Asymmetry
Strategies like NSA, HISA, InfLLM-v2 pool solely keys and values however go away queries at full decision. The hierarchy turns into a compressed reminiscence moderately than a real multi-scale illustration. It additionally means the dense consideration name stays O(N·S·d) as an alternative of shrinking additional.
⚠ Downside 2: Kernel Entanglement
Strategies like NSA, DSA, HISA, MoBA embed choice logic inside a customized consideration kernel. This implies they can not reuse the optimized FlashAttention kernels that GPU tensor cores are constructed round. Each sparse methodology ships its personal ahead and backward kernels.
The toughest downside: An inference-only sparse methodology is routinely pretty much as good as its dense spine. A training-time sparse methodology should reply a more durable query: as soon as coaching is completed, will the ensuing weights nonetheless work as a reliable dense-attention mannequin at inference? Most strategies don’t check this.
Lighthouse Consideration treats this recoverability query as its central correctness criterion.
03 / The Methodology
Lighthouse Consideration: Core Thought
Lighthouse is a selection-based hierarchical consideration that wraps round, however doesn’t modify, the eye kernel. It provides a pre-processing step that selects a small subset of tokens, runs inventory FlashAttention on simply that subset, and scatters the output again. On the finish of coaching, you disable Lighthouse and maintain the dense mannequin.
Two key design variations from prior work: ✓ Queries, keys, and values are all pooled symmetrically (not simply keys/values) ✓ Choice sits outdoors the eye kernel — FlashAttention runs on a standard dense sub-sequence
21× quicker ahead cross vs SDPA at 512K context
17.3× quicker ahead+backward at 512K context
1.69× end-to-end pretraining wall-clock speedup
The tactic introduces no new learnable parameters and no auxiliary losses. The scoring operate is parameter-free, and the top-Okay choice step is intentionally non-differentiable — no straight-through estimator or Gumbel softmax.
04 / Structure
The 4-Stage Pipeline
A Lighthouse consideration layer replaces the usual SDPA name with 4 phases. Phases 1 and 4 are customized kernels; phases 2 and three are normal PyTorch operations fused by torch.compile.
1
Pyramid Pool
Common-pool Q, Okay, and V symmetrically into an L-level pyramid with pooling issue p. Degree ℓ has N/pⁿ tokens, every summarizing pⁿ base positions. Complete price: Θ(N). Crucially, the coarsest stage is at all times retained in full to ensure no less than one contributor per base place.
2
Rating + High-Okay Choice
Every pyramid entry will get two scalar scores utilizing its per-head ℓ₂ norm: one as a question rating, one as a key rating. A fused chunked-bitonic top-Okay kernel selects ok entries collectively throughout all pyramid ranges. This step is non-differentiable — indices carry no gradient.
3
Dense Collect + FlashAttention
Chosen (Q, Okay, V) triples are gathered right into a contiguous sub-sequence of size S = N/pⁿ⁻¹ + (L−1)·p·ok, then handed to inventory FlashAttention. No customized sparse kernel. The gathered sequence has no holes, which is crucial as a result of queries are additionally compressed.
4
Scatter-Again
Every output entry is scattered again to the pⁿ base positions it represents through an integer-atomic scatter kernel. The output is totally dense. Per-position fan-in is bounded by L no matter ok.
05 / Key Design Selection
Why Symmetric Q/Okay/V Pooling Issues
Most prior hierarchical strategies pool solely Okay and V whereas leaving Q at full decision. Lighthouse swimming pools all three. This isn’t beauty — it modifications the mathematics of the eye name.
Methodology
Question aspect
Consideration price
NSA, HISA, InfLLM-v2
Full decision (N)
O(N · S · d)
Lighthouse
Pooled (S)
O(S² · d)
As a result of S ≪ N at lengthy contexts, O(S²·d) is dramatically cheaper than O(N·S·d). At N = 1,000,000 with L=4, p=4, ok=4096, S ≈ 65,000.
The no-holes assure: Compressing queries means each question place should have a gradient path. Lighthouse ensures no gaps within the gathered sub-sequence, which prevents coaching instabilities that might come up from tokens with lacking gradients. Uneven strategies that go away Q at full decision don’t face this downside.
At bounded ok, setting L = logᵣ(N/ok) provides whole per-layer compute of Θ(T·d) — the identical asymptotic class as linear consideration and SSMs, however with softmax consideration’s recall properties on the chosen sub-sequence.
The highest-Okay step is discrete. Lighthouse intentionally doesn’t approximate it with a straight-through estimator or Gumbel softmax. It is a aware design selection.
What does NOT get gradients
The choice indices and the scoring operate. The ℓ₂ norm scorer is rarely skilled — it has no parameters and receives no gradient sign.
What DOES get gradients
Gradients stream via scatter-back → FlashAttention → collect into the gathered Q̃, Okaỹ, Ṽ and on into W_Q, W_K, W_V.
The consequence: the projection matrices study to produce values which might be helpful when chosen, not scores which might be good at choosing. This avoids the optimization issues — scorer collapse, scorer–consideration misalignment, auxiliary loss tuning — that learnable selectors in NSA and DSA are liable to.
Complexity comparability throughout consideration households (per-layer compute at bounded ok):
The central declare of Lighthouse is that sparse coaching doesn’t break the mannequin’s means to make use of dense consideration at inference. The 2-stage recipe is how that is validated.
1
Stage 1 — Lighthouse pretraining
Prepare for almost all of the step funds with Lighthouse choice energetic. That is the quick stage: ~2× increased throughput than dense SDPA.
2
Stage 2 — Dense SDPA resumption
Resume the Stage 1 checkpoint underneath normal dense SDPA with the similar optimizer state and dataloader. The loss spikes transiently by 1.12–1.57 nats, then recovers inside ~1,000–1,500 SDPA steps and crosses under the dense baseline.
Examined at 16,000 whole steps (~50.3B tokens) on a 530M Llama-3-style mannequin (dmodel=1024, 30 layers, H=8, head dim 128, FFN 1536, byte-level tokenizer, C4 dataset, 98,304-token context) throughout three break up factors:
Cut up
B200–Hrs
Tok/s (ok)
Closing Loss
Dense SDPA baseline
303.2
45.6
0.7237
LH 12k + SDPA 4k
214.7
74.7
0.7102
LH 11k + SDPA 5k
219.6
75.4
0.7001
LH 10k + SDPA 6k
228.0
75.0
0.6980
All three Lighthouse runs beat the dense baseline at matched token budgets.
08 / Implementation Element
Not All Layers Use Lighthouse
An essential element for practitioners: within the 30-layer experimental mannequin, layers {0, 1, 28, 29} retain dense SDPA all through. Solely the remaining 26 layers use Lighthouse. The internal consideration name inside these Lighthouse layers makes use of the identical cuDNN-backed SDPA kernel because the dense baseline.
This implies Lighthouse is a partial substitute, not a full model-wide substitution. The primary and final layers preserving dense consideration is a sensible stabilization selection — these boundary layers typically carry disproportionate significance for mannequin habits.
Optimizer setup: AdamW, lr 2×10⁻³, β₁=0.9, β₂=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, FSDP solely.
Chunked-bitonic top-Okay: The kernel produces a stratified top-Okay, not a strict world top-Okay. Rating stream is partitioned into fixed-size chunks; every chunk maintains an in-register buffer. If the globally highest-scoring entries clustered in a single chunk, some are changed by lower-scoring entries from different chunks — guaranteeing each area of the sequence contributes tokens and stopping consideration from collapsing onto a slender span.
S = N / p^(L-1) + (L-1) * p * ok
# Instance: N=1M, L=4, p=4, ok=4096
# S = 1,000,000/64 + 3*4*4096
# S = 15,625 + 49,152 ≈ 65,000 (vs 1,000,000 for full consideration)
09 / Ablations
What the Hyperparameter Sweep Exhibits
The complete ablation grid diverse scorer kind, pooling issue p, pyramid ranges L, and top-Okay funds ok. All configurations used the 10k+6k break up at 98K context.
Config
Scorer
B200–Hrs
Tok/s (ok)
Closing Loss
SDPA baseline
—
303.2
45.6
0.7237
L=3, p=2, ok=1536
Dilated
203.9
93.9
0.6825
L=3, p=4, ok=1536
Dilated
197.2
99.5
0.6881
L=3, p=4, ok=1536
Norm
179.6
126.0
0.6946
L=3, p=2, ok=4096
Dilated
215.7
83.5
0.6951
Key findings from the sweep:
Smaller ok → higher loss (counter-intuitive) Shallower L=3 beats L=4, L=5 Norm scorer: 9% cheaper, related high quality Each config beats dense baseline
The counter-intuitive discovering on ok: loss decreases monotonically as ok shrinks from 4,096 to 1,536. The authors attribute this to hierarchical choice performing as a regularizer on the 50.3B-token funds. Whether or not this reverses at bigger budgets is left to future work.
10 / Retrieval Analysis
Needle-in-a-Haystack Outcomes
Past coaching loss, the paper evaluates long-context retrieval utilizing a simplified Needle-in-a-Haystack (NIAH) check: a single passkey digit hidden in random alphanumeric filler at depths of 0–100% throughout context lengths of 4K–96K tokens. Retrieval is scored as a one-token argmax over the ten digit tokens. Random probability is 10%.
Configuration
Imply Retrieval Price
vs Baseline
Dense SDPA baseline
0.72
—
ok=2048, Dilated scorer
0.76
+0.04
ok=1536, Dilated scorer
0.73
+0.01
ok=2048, Norm scorer
0.72
Matches
ok=1536, Norm scorer
0.65
−0.07
Three of 4 Lighthouse configurations match or beat the dense-from-scratch baseline on retrieval. The norm scorer hurts retrieval greater than it hurts coaching loss on the similar ok. The sensible implication: in case your downstream activity is retrieval-heavy, use a bigger ok and the dilated scorer. If optimizing for loss and throughput, the norm scorer with ok=1536 is the higher trade-off.
11 / Scaling
Context Parallelism at 1M Tokens
For sequences past ~100K tokens, the 530M mannequin OOMs on a single B200 no matter consideration methodology (activations + gradients + optimizer state). Lighthouse extends to multi-GPU context parallelism (CP) cleanly.
1
Shard-local pre-attention
Every rank holds a contiguous slice of the sequence. Pyramid pooling, scoring, and top-Okay all run shard-locally. The coarsest pool window (e.g., 64 tokens) is much smaller than the shard measurement (N/W ≈ 128K at N=1M, W=8), so no inter-rank communication is required at this stage.
2
Commonplace ring consideration
The gathered sub-sequence is dense, so it participates in normal ring consideration with no sparse-aware collectives. KV shards rotate via the ring as in a totally dense long-context run. Sparse-index-based strategies can not do that — ring rotation requires a contiguous tensor, which their sparse outputs usually are not.
~10% ring-rotation overhead in CP vs single-device
1M token coaching context achieved
4×8 nodes × GPUs, CP diploma 8
The Lighthouse vs. SDPA speedup ratio is totally preserved underneath matched CP geometry, carrying the benefit cleanly into the 1M-token regime.
12 / Limitations & Assets
Limitations and Open Instructions
Key limitation: Symmetric Q/Okay/V pooling presumes all queries co-occur in a single ahead cross. Autoregressive decoding presents one question at a time — this violates that assumption. Lighthouse is a training-only methodology and depends on the dense-SDPA resumption to provide an inference-ready mannequin. The gathered sub-sequence price is Θ(S²·d): sub-quadratic in N at fastened ok, however not strictly linear. Regimes the place ok should scale with N stay uncharacterized.
Open instructions from the paper:
Uneven sparse resumption (DSA / NSA / MoBA goal) Per-layer / per-head adaptive ok Imaginative and prescient, audio, video pyramid extensions Serving integration (steady batching, KV-cache)
Paper
arXiv:2605.06554 “Lengthy Context Pre-Coaching with Lighthouse Consideration” Peng, Ghosh, Quesnelle — Nous Analysis
Code
github.com/ighoshsubho/ lighthouse-attention Patch on upstream torchtitan + 2 new information
Nous Analysis’s Lighthouse Consideration swimming pools Q, Okay, and V symmetrically throughout a multi-level pyramid — not like NSA and HISA which solely pool Okay and V — chopping the eye name from O(N S d) to O(S² d) and making the costly step inventory FlashAttention on a small dense sub-sequence.
It is a training-only methodology: a short dense-SDPA resumption on the finish converts the checkpoint into a standard full-attention mannequin that matches or beats dense-from-scratch on the similar token funds (last loss 0.6980–0.7102 vs. 0.7237 baseline, 16k steps, ~50.3B tokens).
At 512K context on a single B200, Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on ahead+backward vs. cuDNN SDPA — translating to a 1.40×–1.69× end-to-end pretraining wall-clock speedup.
The highest-Okay choice step is intentionally non-differentiable — no straight-through estimator, no Gumbel softmax — so projection matrices study to provide values which might be helpful when chosen, to not recreation a learnable scorer.
Scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) underneath context parallelism with no modifications to the internal consideration kernel, as a result of the gathered sub-sequence is dense and participates in normal ring consideration.