> "In our attention design, a transformer block with softmax attention follows every seven transnormer blocks (Qin et al., 2022a) with lightning attention."
Alright, so it's 87.5% linear attention + 12.5% full attention.
TBH I find the terminology around "linear attention" rather confusing.
"Softmax attention" is an information routing mechanism: when token `k` is being computed, it can receive information from tokens 1..k, but it has to be crammed through a channel of a fixed size.
"Linear attention", on the other hand, is just a 'register bank' of a fixed size available to each layer. It's not real attention, it's attention only in the sense it's compatible with layer-at-once computation.