Attention Mechanisms Compared: Standard, Linear, and Flash

Attention is the core building block of modern Transformers. But as sequence lengths grow, the original attention mechanism becomes a bottleneck. This post breaks down how attention works, why it’s expensive, and two major approaches to making it faster: Linear Attention and Flash Attention.

1. Standard (Softmax) Attention

Given an input sequence of $N$ tokens, we project each token into three vectors:

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

where $X \in \mathbb{R}^{N \times d_{\text{model}}}$ and $Q, K, V \in \mathbb{R}^{N \times d}$.

Q (Query): what information this token is looking for
K (Key): what information this token offers for matching
V (Value): the actual content this token passes along when attended to

Attention is then computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V$$

Step by step

Compute $QK^\top$, an $N \times N$ matrix of pairwise similarity scores.
Apply softmax row-wise to normalize scores into attention weights.
Multiply by $V$ to get the weighted output.

Cost

Time: $O(N^2 d)$, dominated by the $N \times N$ matrix multiply.
Memory: $O(N^2)$, since the full attention matrix must be stored.

This quadratic scaling is the fundamental bottleneck. Double the sequence length, and you pay 4× in compute and memory.

2. Linear Attention

Linear attention, introduced by Katharopoulos et al. (2020), removes the softmax and replaces it with a kernel feature map $\phi$:

$$\text{Attention}(Q, K, V) = \phi(Q) \cdot \phi(K)^\top \cdot V$$

Without softmax, we can exploit the associativity of matrix multiplication:

$$\underbrace{(\phi(Q) \cdot \phi(K)^\top)}_{N \times N} \cdot V \quad = \quad \phi(Q) \cdot \underbrace{(\phi(K)^\top \cdot V)}_{d \times d}$$

Why this changes everything

Left grouping (standard order): Build $N \times N$ first → $O(N^2 d)$.
Right grouping (the trick): Compute $\phi(K)^\top V$ first, which is $(d \times N) \times (N \times d) = d \times d$. Then multiply $\phi(Q)$ ($N \times d$) by this small $d \times d$ matrix → $O(Nd^2)$.

Since $d$ (typically 64 or 128) is much smaller than $N$ (thousands or more), $Nd^2 \ll N^2 d$, giving us linear complexity in sequence length.

Why can’t standard attention do this?

Softmax is a nonlinear, row-wise operation that sits between $QK^\top$ and $V$. It forces us to fully materialize the $N \times N$ matrix before we can proceed, so there’s no way to reorder the computation around it.

Trade-off

Linear attention is truly $O(N)$, but removing softmax means losing its sharp, peaked attention distribution. Softmax concentrates weight on the most relevant tokens; kernel approximations produce smoother distributions. This can hurt model quality, especially on tasks requiring precise retrieval from long contexts.

3. Flash Attention

Flash Attention, introduced by Dao et al. (2022), takes a completely different approach. It computes exact softmax attention (the same math as standard attention) but optimizes how it runs on GPU hardware.

The insight: it’s a memory problem, not a compute problem

Modern GPUs have two memory tiers:

HBM (High Bandwidth Memory): large but relatively slow
SRAM (on-chip): very fast but tiny

Standard attention writes the full $N \times N$ matrix to HBM, then reads it back. This memory I/O is the real bottleneck, not the FLOPs.

How Flash Attention works

Instead of materializing the full $N \times N$ attention matrix:

Tile the $Q$, $K$, $V$ matrices into small blocks that fit in SRAM.
Compute attention block by block on-chip.
Use an online softmax trick to accumulate the correct normalized result without ever storing the full matrix.
Write only the final output back to HBM.

Cost

Time: Still $O(N^2 d)$ in FLOPs. The math hasn’t changed.
Memory: $O(N)$, since the $N \times N$ matrix is never materialized.
Wall-clock: Much faster due to reduced HBM reads/writes.

Side-by-side Comparison

	Standard Attention	Linear Attention	Flash Attention
Complexity	$O(N^2 d)$	$O(N d^2)$	$O(N^2 d)$
Memory	$O(N^2)$	$O(N d)$	$O(N)$
Exact softmax?	Yes	No (kernel approx.)	Yes
Speedup source	Baseline	Algorithmic (math)	Hardware (I/O-aware)
Quality impact	Baseline	Some degradation	Identical to baseline
Retraining needed?	Baseline	Yes	No (drop-in replacement)
Long-seq scaling	Hits wall fast	Scales well	Better than standard, still quadratic

When to use what?

Flash Attention is the default choice today. It’s a free speedup with zero quality loss and requires no architecture changes. Most major frameworks (PyTorch, JAX) support it natively.
Linear Attention is the research frontier for extremely long sequences (100K+ tokens) where even Flash Attention’s quadratic cost becomes prohibitive. Active variants like Gated Linear Attention (GLA) and RetNet are narrowing the quality gap.
In practice, these approaches are not mutually exclusive. Flash Linear Attention (Yang et al., 2023) applies Flash Attention’s I/O-aware tiling to linear attention’s algorithm, combining the best of both.

References

Vaswani, A., Shazeer, N., Parmar, N., et al. “Attention Is All You Need.” NeurIPS, 2017. arXiv:1706.03762
Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.” ICML, 2020. arXiv:2006.16236
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Ré, C. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS, 2022. arXiv:2205.14135
Dao, T. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR, 2024. arXiv:2307.08691
Yang, S., Wang, B., Shen, Y., Panda, R., & Kim, Y. “Gated Linear Attention Transformers with Hardware-Efficient Training.” arXiv, 2023. arXiv:2312.06635

Read More

Attention Mechanisms Compared: Standard, Linear, and Flash

Attention Mechanisms Compared: Standard, Linear, and Flash

1. Standard (Softmax) Attention

Step by step

Cost

2. Linear Attention

Why this changes everything

Why can’t standard attention do this?

Trade-off

3. Flash Attention

The insight: it’s a memory problem, not a compute problem

How Flash Attention works

Cost

Side-by-side Comparison

When to use what?

References

Read More

Attention Residuals: How Kimi Rethinks Depth-Wise Information Flow in LLMs