📝
[KVCache] PagedAttention
Previous Works
Memory Fragmentations
internal fragmentation: not used max token size
external fragmentation: free memory is scattered, not contiguous enough for a new request.
ex)
request A: max tokens=2048
request B: max tokens=512
Key Contributions
block table translation
block table store
- physical block number: index of block
- filled: 0~Max Size
2 requests at the same time
- each request will use a different block
shared prefix
can share a kv cache within a block level by reference counts
- Block7: ref count=2
- Blcok1: ref count=2->1(due to block diverge at last the token)
- Block3: ref count=1
Reference
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussion