Attention Mechanisms Compared: Standard, Linear, and Flash

Attention is the core building block of modern Transformers. But as sequence lengths grow, the original attention mechanism becomes a bottleneck. This post breaks down how attention works, why it’s expensive, and two major approaches to making it faster: Linear Attention and Flash Attention.


1. Standard (Softmax) Attention

Given an input sequence of $N$ tokens, we project each token into three vectors:

$$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$$

where $X \in \mathbb{R}^{N \times d_{\text{model}}}$ and $Q, K, V \in \mathbb{R}^{N \times d}$.

  • Q (Query): what information this token is looking for
  • K (Key): what information this token offers for matching
  • V (Value): the actual content this token passes along when attended to

Attention is then computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V$$

Step by step

  1. Compute $QK^\top$, an $N \times N$ matrix of pairwise similarity scores.
  2. Apply softmax row-wise to normalize scores into attention weights.
  3. Multiply by $V$ to get the weighted output.

Cost

  • Time: $O(N^2 d)$, dominated by the $N \times N$ matrix multiply.
  • Memory: $O(N^2)$, since the full attention matrix must be stored.

This quadratic scaling is the fundamental bottleneck. Double the sequence length, and you pay 4× in compute and memory.


2. Linear Attention

Linear attention, introduced by Katharopoulos et al. (2020), removes the softmax and replaces it with a kernel feature map $\phi$:

$$\text{Attention}(Q, K, V) = \phi(Q) \cdot \phi(K)^\top \cdot V$$

Without softmax, we can exploit the associativity of matrix multiplication:

$$\underbrace{(\phi(Q) \cdot \phi(K)^\top)}_{N \times N} \cdot V \quad = \quad \phi(Q) \cdot \underbrace{(\phi(K)^\top \cdot V)}_{d \times d}$$

Why this changes everything

  • Left grouping (standard order): Build $N \times N$ first → $O(N^2 d)$.
  • Right grouping (the trick): Compute $\phi(K)^\top V$ first, which is $(d \times N) \times (N \times d) = d \times d$. Then multiply $\phi(Q)$ ($N \times d$) by this small $d \times d$ matrix → $O(Nd^2)$.

Since $d$ (typically 64 or 128) is much smaller than $N$ (thousands or more), $Nd^2 \ll N^2 d$, giving us linear complexity in sequence length.

Why can’t standard attention do this?

Softmax is a nonlinear, row-wise operation that sits between $QK^\top$ and $V$. It forces us to fully materialize the $N \times N$ matrix before we can proceed, so there’s no way to reorder the computation around it.

Trade-off

Linear attention is truly $O(N)$, but removing softmax means losing its sharp, peaked attention distribution. Softmax concentrates weight on the most relevant tokens; kernel approximations produce smoother distributions. This can hurt model quality, especially on tasks requiring precise retrieval from long contexts.


3. Flash Attention

Flash Attention, introduced by Dao et al. (2022), takes a completely different approach. It computes exact softmax attention (the same math as standard attention) but optimizes how it runs on GPU hardware.

The insight: it’s a memory problem, not a compute problem

Modern GPUs have two memory tiers:

  • HBM (High Bandwidth Memory): large but relatively slow
  • SRAM (on-chip): very fast but tiny

Standard attention writes the full $N \times N$ matrix to HBM, then reads it back. This memory I/O is the real bottleneck, not the FLOPs.

How Flash Attention works

Instead of materializing the full $N \times N$ attention matrix:

  1. Tile the $Q$, $K$, $V$ matrices into small blocks that fit in SRAM.
  2. Compute attention block by block on-chip.
  3. Use an online softmax trick to accumulate the correct normalized result without ever storing the full matrix.
  4. Write only the final output back to HBM.

Cost

  • Time: Still $O(N^2 d)$ in FLOPs. The math hasn’t changed.
  • Memory: $O(N)$, since the $N \times N$ matrix is never materialized.
  • Wall-clock: Much faster due to reduced HBM reads/writes.

Side-by-side Comparison

Standard Attention Linear Attention Flash Attention
Complexity $O(N^2 d)$ $O(N d^2)$ $O(N^2 d)$
Memory $O(N^2)$ $O(N d)$ $O(N)$
Exact softmax? Yes No (kernel approx.) Yes
Speedup source Baseline Algorithmic (math) Hardware (I/O-aware)
Quality impact Baseline Some degradation Identical to baseline
Retraining needed? Baseline Yes No (drop-in replacement)
Long-seq scaling Hits wall fast Scales well Better than standard, still quadratic

When to use what?

  • Flash Attention is the default choice today. It’s a free speedup with zero quality loss and requires no architecture changes. Most major frameworks (PyTorch, JAX) support it natively.

  • Linear Attention is the research frontier for extremely long sequences (100K+ tokens) where even Flash Attention’s quadratic cost becomes prohibitive. Active variants like Gated Linear Attention (GLA) and RetNet are narrowing the quality gap.

  • In practice, these approaches are not mutually exclusive. Flash Linear Attention (Yang et al., 2023) applies Flash Attention’s I/O-aware tiling to linear attention’s algorithm, combining the best of both.


References

  1. Vaswani, A., Shazeer, N., Parmar, N., et al. “Attention Is All You Need.” NeurIPS, 2017. arXiv:1706.03762

  2. Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.” ICML, 2020. arXiv:2006.16236

  3. Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Ré, C. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS, 2022. arXiv:2205.14135

  4. Dao, T. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR, 2024. arXiv:2307.08691

  5. Yang, S., Wang, B., Shen, Y., Panda, R., & Kim, Y. “Gated Linear Attention Transformers with Hardware-Efficient Training.” arXiv, 2023. arXiv:2312.06635


Written By

Jiufeng Li

Build yourself and believe in yourself.

Read More