🐷
[KVCache] FlashInfer
Key Contributions
Block-Sparse and Composable Formats for KV-Cache
Unique KV-Cache: unique tokens -> use L2 cache or VRAM(Global Memory)
Shared KV-Cache: frequently use tokens ex) shared-prefix -> use shared memory
Dynamic Load-Balanced Scheduling
grouping works into similar size to balance works
Reference
FLASHINFER: EFFICIENT AND CUSTOMIZABLE ATTENTION ENGINE FOR
LLM INFERENCE SERVING
Discussion