KVBoost – a full inference engine for HF causal LMs: KV reuse, FlashAttention-2, AWQ streaming, and speculative decoding

iampoppyxx · May 22, 2026, 11:18am

Hey guys, I am building kvboost. GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub

What’s in the engine

1. Cross-request KV cache reuse

Prompts are split into fixed-size chunks and content-addressed by hash. On a
cache hit, stored K/V tensors are loaded instead of recomputed. CacheBlend
seam repair selectively recomputes the ~15% most-deviated tokens at chunk
boundaries, so stitched K/V produces output quality identical to a full prefill.

On a 500-conversation ShareGPT replay (Qwen2.5-3B, RTX 4060 8GB):

Turn	Baseline TTFT	KVBoost TTFT	KV reuse
1	18.8 ms	17.4 ms	35.7%
3	35.2 ms	20.6 ms	99.2%
8	121.6 ms	26.5 ms	99.6%

TTFT stays flat. No measurable accuracy loss (99.2% WARM = 99.2% COLD).

2. Custom FlashAttention-2 CUDA kernel

A tiled-softmax kernel that reduces HBM memory traffic from O(N²) to O(N)
during KV encoding. Supports float16/bfloat16, head dims 64/96/128, any
sequence length, and causal masking. Covers Volta through Hopper (sm_70–sm_90).
Falls back gracefully to torch.nn.functional.scaled_dot_product_attention
if not compiled.

pip install 'kvboost[cuda]'   # builds the kernel

3. AWQ layer streaming — run models bigger than VRAM

Streams INT4 layer weights from pinned host RAM into two CUDA staging slots,
overlapping PCIe transfer with compute. Embeddings, layernorms, and a
configurable number of head/tail decoder layers stay resident; the rest are
DMA’d on demand.

Qwen2.5-32B-Instruct-AWQ on an RTX 3060 12GB (~19GB packed weights):

Peak VRAM: 9.58 GB
Steady-state: 1.40 tok/s
No OOM. Fully coherent output.

from kvboost import KVBoost
from kvboost.streaming import StreamingConfig

engine = KVBoost.from_pretrained(
    "Qwen/Qwen2.5-32B-Instruct-AWQ",
    streaming_config=StreamingConfig(keep_first_k=9, keep_last_k=9),
)

4. Speculative decoding stacked on streaming

A small resident draft model proposes K tokens; the streamed target verifies
them in a single multi-token forward — the same DMA cycle, but yielding
multiple tokens per round.

Qwen2.5-32B target + 1.5B draft, RTX 3060 12GB, gamma=5:

Mode	tok/s (decode)
Streaming only	0.91
+ Speculative (γ=5)	2.79

3.07× decode speedup. Acceptance rate 40%, avg 3.0 committed tokens per round.
Greedy mode is bit-for-bit identical to non-speculative greedy.

5. OpenAI-compatible server

Async prefix-grouped batching: requests sharing a prompt prefix are dispatched
as a single batch, loading shared K/V once and broadcasting zero-copy. Drop-in
for the OpenAI SDK, LangChain, LlamaIndex, Instructor, and the Vercel AI SDK.

kvboost-server --model Qwen/Qwen2.5-3B --port 8000 \
    --recompute-strategy cacheblend \
    --kv-cache-bits 8 \
    --batch-window-ms 20

All four optimizations compose — AWQ streaming + speculative decoding + KV
reuse + FlashAttention all work together through the same endpoint.

Quick start

from kvboost import KVBoost

engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
engine.warm("You are a helpful assistant.")

result = engine.generate("Your prompt here", max_new_tokens=256)
print(f"TTFT: {result.ttft_ms:.1f} ms | KV reuse: {result.kv_reuse_ratio:.0%}")

pip install kvboost              # CPU / MPS
pip install 'kvboost[cuda]'      # + FlashAttention-2 kernel
pip install 'kvboost[server]'    # + OpenAI-compatible server
pip install 'kvboost[streaming]' # + AWQ layer streaming

Repo: GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub

Would love feedback from anyone running multi-turn agents, RAG pipelines, or
trying to squeeze large models onto consumer GPUs — those are the workloads
this was built for.

Topic		Replies	Views
KV caching for varying length texts 🤗Transformers	1	224	December 16, 2024
Pass CausalLM KV cache into the next inference batch 🤗Transformers	0	605	October 14, 2023
Distributed LLaMA Inference Engine Built from Scratch (KV Cache, GQA, RoPE) 🤗Transformers	0	65	January 16, 2026
KVQuant attention-aware extensions to KV cache vector quantization (paper + code) Research	0	17	May 19, 2026
Generate: using k-v cache is faster but no difference to memory usage 🤗Transformers	5	16709	June 3, 2025