2 posts found
Kimi's Attention Residuals paper proposes replacing fixed residual connections with learned softmax …
A deep dive comparing standard softmax attention, linear attention, and Flash Attention: their math, …