🐷

[KVCache] FlashInfer

に公開

Key Contributions

Block-Sparse and Composable Formats for KV-Cache

Unique KV-Cache: unique tokens -> use L2 cache or VRAM(Global Memory)
Shared KV-Cache: frequently use tokens ex) shared-prefix -> use shared memory

Dynamic Load-Balanced Scheduling

grouping works into similar size to balance works

Reference

FLASHINFER: EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR
LLM INFERENCE SERVING

Discussion