Hey guys, I am building kvboost. GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub
What’s in the engine
1. Cross-request KV cache reuse
Prompts are split into fixed-size chunks and content-addressed by hash. On a
cache hit, stored K/V tensors are loaded instead of recomputed. CacheBlend
seam repair selectively recomputes the ~15% most-deviated tokens at chunk
boundaries, so stitched K/V produces output quality identical to a full prefill.
On a 500-conversation ShareGPT replay (Qwen2.5-3B, RTX 4060 8GB):
| Turn | Baseline TTFT | KVBoost TTFT | KV reuse |
|---|---|---|---|
| 1 | 18.8 ms | 17.4 ms | 35.7% |
| 3 | 35.2 ms | 20.6 ms | 99.2% |
| 8 | 121.6 ms | 26.5 ms | 99.6% |
TTFT stays flat. No measurable accuracy loss (99.2% WARM = 99.2% COLD).
2. Custom FlashAttention-2 CUDA kernel
A tiled-softmax kernel that reduces HBM memory traffic from O(N²) to O(N)
during KV encoding. Supports float16/bfloat16, head dims 64/96/128, any
sequence length, and causal masking. Covers Volta through Hopper (sm_70–sm_90).
Falls back gracefully to torch.nn.functional.scaled_dot_product_attention
if not compiled.
pip install 'kvboost[cuda]' # builds the kernel
3. AWQ layer streaming — run models bigger than VRAM
Streams INT4 layer weights from pinned host RAM into two CUDA staging slots,
overlapping PCIe transfer with compute. Embeddings, layernorms, and a
configurable number of head/tail decoder layers stay resident; the rest are
DMA’d on demand.
Qwen2.5-32B-Instruct-AWQ on an RTX 3060 12GB (~19GB packed weights):
- Peak VRAM: 9.58 GB
- Steady-state: 1.40 tok/s
- No OOM. Fully coherent output.
from kvboost import KVBoost
from kvboost.streaming import StreamingConfig
engine = KVBoost.from_pretrained(
"Qwen/Qwen2.5-32B-Instruct-AWQ",
streaming_config=StreamingConfig(keep_first_k=9, keep_last_k=9),
)
4. Speculative decoding stacked on streaming
A small resident draft model proposes K tokens; the streamed target verifies
them in a single multi-token forward — the same DMA cycle, but yielding
multiple tokens per round.
Qwen2.5-32B target + 1.5B draft, RTX 3060 12GB, gamma=5:
| Mode | tok/s (decode) |
|---|---|
| Streaming only | 0.91 |
| + Speculative (γ=5) | 2.79 |
3.07× decode speedup. Acceptance rate 40%, avg 3.0 committed tokens per round.
Greedy mode is bit-for-bit identical to non-speculative greedy.
5. OpenAI-compatible server
Async prefix-grouped batching: requests sharing a prompt prefix are dispatched
as a single batch, loading shared K/V once and broadcasting zero-copy. Drop-in
for the OpenAI SDK, LangChain, LlamaIndex, Instructor, and the Vercel AI SDK.
kvboost-server --model Qwen/Qwen2.5-3B --port 8000 \
--recompute-strategy cacheblend \
--kv-cache-bits 8 \
--batch-window-ms 20
All four optimizations compose — AWQ streaming + speculative decoding + KV
reuse + FlashAttention all work together through the same endpoint.
Quick start
from kvboost import KVBoost
engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
engine.warm("You are a helpful assistant.")
result = engine.generate("Your prompt here", max_new_tokens=256)
print(f"TTFT: {result.ttft_ms:.1f} ms | KV reuse: {result.kv_reuse_ratio:.0%}")
pip install kvboost # CPU / MPS
pip install 'kvboost[cuda]' # + FlashAttention-2 kernel
pip install 'kvboost[server]' # + OpenAI-compatible server
pip install 'kvboost[streaming]' # + AWQ layer streaming
Repo: GitHub - pythongiant/KVBoost: Make LLM inference faster with chunk-level KV cache reuse · GitHub
Would love feedback from anyone running multi-turn agents, RAG pipelines, or
trying to squeeze large models onto consumer GPUs — those are the workloads
this was built for.