iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
💸

What Happens When the Cache Hit Rate Changes by 1%?

に公開
2

In the section "2.3.14 Caching" of Systems Performance, there is the following description:

The performance difference between a 98% and 99% hit rate is much greater than that between a 10% and 11% hit rate. Because of the difference in speed between a cache hit and a cache miss—that is, the speed difference between the two storage tiers involved—this results in a non-linear profile.

https://www.oreilly.co.jp/books/9784814400072/

A diagram like the one below is also included, and looking at this figure, it certainly seems that a 1% difference becomes more significant as the hit rate increases.

However, I didn't quite have an intuitive grasp of it, so I decided to run an experiment.

Experiment Code

package main

import (
	"fmt"
	"time"
)

func main() {
	// Set the number of cache entries
	cacheCount := 99
	cache := make(map[int]bool, cacheCount)

	// Initialize the cache
	for i := 0; i < cacheCount; i++ {
		cache[i] = true
	}

	hits := 0
	misses := 0

	// Start measurement
	start := time.Now()

	for i := 0; i < 100; i++ {
		for j := 0; j < 100; j++ {
			if _, isHit := cache[j]; isHit {
				hits++
			} else {
				misses++
				time.Sleep(100 * time.Microsecond) // Simulate delay on cache miss
			}
		}
	}

	// Calculate processing end time
	duration := time.Since(start)

	// Display results
	fmt.Printf("Cache hit rate: %.2f%%\n", float64(hits)/float64(hits+misses)*100)
	fmt.Printf("Throughput: %.f operations/sec\n", float64(10000)/duration.Seconds())
}

https://go.dev/play/p/zL3BXBRtHMs

  • Caches numbers from 0 up to cacheCount - 1.
  • If not present in the cache, it sleeps for 100μs to simulate access to a disk or similar storage.
  • Performs cache hit processing for 0 to 99, 100 times each.
    • A total of 10,000 operations.

Results

The results of testing while varying cacheCount were as follows:

Cache Hit Rate (%) Throughput (ops/sec)
10 11111
11 11236
50 20000
51 20408
98 500000
99 1000000

It is even easier to understand when shown as a bar chart:

Looking at these results, I realized that:

  • Saying "99 hits out of 100"
  • Versus "98 hits out of 100"

doesn't quite make the difference clear. However, if you phrase it as:

  • Waiting 10 seconds only once in 100 operations
    • 99% cache hit rate
  • Waiting 10 seconds only once in 50 operations
    • 98% cache hit rate

then it becomes obvious that while the former finishes 100 operations in just over 10 seconds, the latter takes just over 20 seconds. You can clearly see that the number of operations per second differs by nearly a factor of two[1].

I suppose people who monitor performance daily would grasp this intuitively without needing to rephrase it, but I found it quite interesting.

Addendum: What happens if the speed difference between storage tiers is small?

As mentioned in the quote at the beginning of this article, the non-linear profile is caused by the "difference in speed between the two storage tiers involved." So, what would happen if those speeds were closer?
(In the original code, the difference is extreme because it compares memory/cache line access with a 100μs sleep.)

I tested this by modifying the code to simulate storage tier speed ratios of 1:1000, 1:100, and 1:10.

	for i := 0; i < 100; i++ {
		for j := 0; j < 100; j++ {
			if _, isHit := cache[j]; isHit {
				hits++
+				// Sleep even on a cache hit (this code is for a 1:100 tier ratio)
+				time.Sleep(1 * time.Microsecond)
			} else {
				misses++
				time.Sleep(100 * time.Microsecond) // Simulate delay on cache miss
			}
		}
	}
Cache Hit Rate (%) 1:1000 Tier Ratio 1:100 Tier Ratio 1:10 Tier Ratio
10 11110 11099 10989
11 11235 11222 11099
50 19980 19802 18182
51 20387 20198 18484
98 476644 335570 84746
99 909918 502513 91743

We can see that as the speed difference between storage tiers becomes smaller, the throughput improvement at 98% → 99% also decreases.
Additionally, when the hit rate is around 50%, the throughput doesn't change significantly even with different tier ratios.
Conversely, if you can only achieve a hit rate of about 50%, choosing a storage tier with a 1:10 ratio might be acceptable, which can be a useful factor in cost-performance decisions.

Addendum 2: Cache Hit Rate in Real Environments

In real-world environments, it is crucial to understand "what kind of workload the cache hit rate is composed of."

For example, the following situation could be considered for a web service database:

  • During daytime access, small-scale data reads and writes account for the majority.
  • During nighttime batch processing, large amounts of data, different from those accessed during the day, are read and written.

In this case, while a relatively high cache hit rate can be obtained during daytime processing, the hit rate becomes lower during nighttime batch processing, and the overall measured cache hit rate reflects the influence of these different workloads.

Due to these characteristics, it may be realistically difficult to aim for an improvement in the cache hit rate without changing the workload. Furthermore, trying to force an improvement could lead to over-provisioning of resources.

In addition, there are many different types of caches.
(In the database example mentioned, there are multiple caches with different purposes.)

Caches are utilized at various layers, including CPU, disk, and network, and each cache has different benchmark hit rates and tuning methods, making it a very deep topic...

脚注
  1. Since the processing time is not zero even when a cache hit occurs, I wrote "just over." ↩︎

Discussion

wataberyotawataberyota

若干おせっかい気味+まとまりのないコメントですが・・・

この記事を読んで「ウチのシステムのキャッシュヒット率が98%だから、なんとしても99%まで上げないとダメだ!」と誤解してしまう人がいることを若干懸念しました。
この記事の趣旨は「キャッシュヒット率を1%でもUPできるのであれば、それがパフォーマンス改善に大きく寄与する点を示すこと」で、「キャッシュヒット率98%はNGと主張する」ものではない認識です。

近年の多くのシステムにおけるキャッシュヒット率に関わる動作は、

  • ほとんどすべての処理で、キャッシュヒット率がほぼ100%
  • 実行頻度が低い処理のうち、いくつかについては、キャッシュヒット率が0%
    のようなものであり、これらの平均値が「98%」なり「95%」なりという数値になっているためです。
    このような状況であるため、多くの場合、「キャッシュヒット率は十分に高く」、「これ以上キャッシュヒット率を高めることは現実的に難しい」はずです。
hmarui66hmarui66

実稼働システムにおいて、キャッシュヒット率がどんなワークロードによって構成されているかは、重要な観点ですね。

その辺り、記事に追記しようと思います。
ご指摘ありがとうございました!