KVBoost – a full inference engine for HF causal LMs: KV reuse, FlashAttention-2, AWQ streaming, and speculative decoding

Hey guys, I am building kvboost. GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub

What’s in the engine

1. Cross-request KV cache reuse

Prompts are split into fixed-size chunks and content-addressed by hash. On a
cache hit, stored K/V tensors are loaded instead of recomputed. CacheBlend
seam repair
selectively recomputes the ~15% most-deviated tokens at chunk
boundaries, so stitched K/V produces output quality identical to a full prefill.

On a 500-conversation ShareGPT replay (Qwen2.5-3B, RTX 4060 8GB):

Turn Baseline TTFT KVBoost TTFT KV reuse
1 18.8 ms 17.4 ms 35.7%
3 35.2 ms 20.6 ms 99.2%
8 121.6 ms 26.5 ms 99.6%

TTFT stays flat. No measurable accuracy loss (99.2% WARM = 99.2% COLD).


2. Custom FlashAttention-2 CUDA kernel

A tiled-softmax kernel that reduces HBM memory traffic from O(N²) to O(N)
during KV encoding. Supports float16/bfloat16, head dims 64/96/128, any
sequence length, and causal masking. Covers Volta through Hopper (sm_70–sm_90).
Falls back gracefully to torch.nn.functional.scaled_dot_product_attention
if not compiled.

pip install 'kvboost[cuda]'   # builds the kernel

3. AWQ layer streaming — run models bigger than VRAM

Streams INT4 layer weights from pinned host RAM into two CUDA staging slots,
overlapping PCIe transfer with compute. Embeddings, layernorms, and a
configurable number of head/tail decoder layers stay resident; the rest are
DMA’d on demand.

Qwen2.5-32B-Instruct-AWQ on an RTX 3060 12GB (~19GB packed weights):

  • Peak VRAM: 9.58 GB
  • Steady-state: 1.40 tok/s
  • No OOM. Fully coherent output.
from kvboost import KVBoost
from kvboost.streaming import StreamingConfig

engine = KVBoost.from_pretrained(
    "Qwen/Qwen2.5-32B-Instruct-AWQ",
    streaming_config=StreamingConfig(keep_first_k=9, keep_last_k=9),
)

4. Speculative decoding stacked on streaming

A small resident draft model proposes K tokens; the streamed target verifies
them in a single multi-token forward — the same DMA cycle, but yielding
multiple tokens per round.

Qwen2.5-32B target + 1.5B draft, RTX 3060 12GB, gamma=5:

Mode tok/s (decode)
Streaming only 0.91
+ Speculative (γ=5) 2.79

3.07× decode speedup. Acceptance rate 40%, avg 3.0 committed tokens per round.
Greedy mode is bit-for-bit identical to non-speculative greedy.


5. OpenAI-compatible server

Async prefix-grouped batching: requests sharing a prompt prefix are dispatched
as a single batch, loading shared K/V once and broadcasting zero-copy. Drop-in
for the OpenAI SDK, LangChain, LlamaIndex, Instructor, and the Vercel AI SDK.

kvboost-server --model Qwen/Qwen2.5-3B --port 8000 \
    --recompute-strategy cacheblend \
    --kv-cache-bits 8 \
    --batch-window-ms 20

All four optimizations compose — AWQ streaming + speculative decoding + KV
reuse + FlashAttention all work together through the same endpoint.


Quick start

from kvboost import KVBoost

engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
engine.warm("You are a helpful assistant.")

result = engine.generate("Your prompt here", max_new_tokens=256)
print(f"TTFT: {result.ttft_ms:.1f} ms | KV reuse: {result.kv_reuse_ratio:.0%}")
pip install kvboost              # CPU / MPS
pip install 'kvboost[cuda]'      # + FlashAttention-2 kernel
pip install 'kvboost[server]'    # + OpenAI-compatible server
pip install 'kvboost[streaming]' # + AWQ layer streaming

Repo: GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub

Would love feedback from anyone running multi-turn agents, RAG pipelines, or
trying to squeeze large models onto consumer GPUs — those are the workloads
this was built for.

1 Like