KVQuant attention-aware extensions to KV cache vector quantization (paper + code)

syedMohib44 · May 19, 2026, 9:16am

Hi everyone,

Sharing my paper KVQuant five structure-aware extensions to KV cache

compression built on top of TurboQuant (Zandieh et al., 2025).

TurboQuant rotates KV vectors then applies Lloyd-Max quantization

near-optimal MSE with provable bounds. But it treats every token the same

and ignores that quantization error has exploitable structure.

**Five extensions:**

- Attention-weighted bit assignment 47–70% lower weighted distortion

- Delta compression 1.1–2.2x lower MSE on correlated streams

- Adaptive bit allocation EMA tracker, promotes/demotes during generation

- Low-rank error correction rank-4 SVD recovers 96% of 2-bit PPL loss

- Product quantization 2-bit storage matching 3-bit scalar quality

**Key result:** 2-bit + rank-4 correction on gpt2-medium drops dPPL from

+173 to +5.95. PQ (M=16, b=8) produces coherent generation where 2-bit

scalar completely collapses.

Paper: OSF

Topic		Replies	Views
Seeking arXiv endorsement for cs.LG — QEAC paper on KV cache co-optimization Awesome paper	0	14	May 13, 2026
KVBoost – a full inference engine for HF causal LMs: KV reuse, FlashAttention-2, AWQ streaming, and speculative decoding Show and Tell	0	16	May 22, 2026
SVSK -Q quantization method Research	0	35	April 28, 2026
KV Cache Compression Research	2	192	February 10, 2026
KV Cache precision compatibility in Spatial Disaggregation (Prefill-Decode) setups with AWQ/GPTQ models Beginners	1	29	March 15, 2026