📝

[KVCache] PagedAttention

に公開

Previous Works

Memory Fragmentations

internal fragmentation: not used max token size
external fragmentation: free memory is scattered, not contiguous enough for a new request.

ex)
request A: max tokens=2048
request B: max tokens=512

Key Contributions

block table translation

block table store

  • physical block number: index of block
  • filled: 0~Max Size

2 requests at the same time

  • each request will use a different block

shared prefix

can share a kv cache within a block level by reference counts

  1. Block7: ref count=2
  2. Blcok1: ref count=2->1(due to block diverge at last the token)
  3. Block3: ref count=1

Reference

Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussion