KVQuant attention-aware extensions to KV cache vector quantization (paper + code)

Hi everyone,

Sharing my paper KVQuant five structure-aware extensions to KV cache

compression built on top of TurboQuant (Zandieh et al., 2025).

TurboQuant rotates KV vectors then applies Lloyd-Max quantization

near-optimal MSE with provable bounds. But it treats every token the same

and ignores that quantization error has exploitable structure.

**Five extensions:**

- Attention-weighted bit assignment 47โ€“70% lower weighted distortion

- Delta compression 1.1โ€“2.2x lower MSE on correlated streams

- Adaptive bit allocation EMA tracker, promotes/demotes during generation

- Low-rank error correction rank-4 SVD recovers 96% of 2-bit PPL loss

- Product quantization 2-bit storage matching 3-bit scalar quality

**Key result:** 2-bit + rank-4 correction on gpt2-medium drops dPPL from

+173 to +5.95. PQ (M=16, b=8) produces coherent generation where 2-bit

scalar completely collapses.

Paper: OSF

Code: GitHub - syedMohib44/kvquant: Attention-aware KV cache quantization for LLM inference ยท GitHub

1 Like