iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔥

Implementing Profile-guided Optimization (PGO) in Production (Part 1)

に公開

Introduction

Are you all using Go 1.21 yet?

In Go 1.21, several packages leveraging generics were released as standard packages, such as the slices package, which simplifies slice operations. Additionally, Profile-guided Optimization (PGO) has reached General Availability (GA).

In this blog post, I would like to introduce PGO, which was released as GA in Go 1.21!

I am planning to actually introduce PGO into our production environment. Before doing so, I investigated whether it would be effective by introducing it in a local environment that mimics production!

I have summarized these findings in this first part. If I am able to successfully introduce it into the production environment, I will summarize those results in a follow-up post!

What you will learn from this article

  • What Profile-guided Optimization (PGO) is
  • How to implement PGO
  • The level of improvement expected when introducing PGO to a web service
  • Points to consider when implementing PGO

What is Profile-guided Optimization (PGO)?

First, what exactly is PGO? Detailed information can be found in the official Go blog post, Profile-guided optimization in Go 1.21.

Based on that post, here is my personal summary of the concept:

  • The Go compiler performs several optimizations to ensure the built binary performs as well as possible (inlining, escape analysis, etc.).
  • While improvements have been made with every release, it is not an easy task.
  • The compiler performs optimizations without information on how the code is actually used in a production environment.
  • Providing profiles from the production environment during the build process to enable better optimizations is called PGO.

By actually introducing PGO, it seems that code that would not normally be optimized is now being optimized (Reference: https://github.com/golang/go/blob/2da8a55584aa65ce1b67431bb8ecebf66229d462/src/cmd/compile/internal/inline/inl.go#L310)

How much improvement can be expected?

The official documentation includes the following statement:

As of Go 1.21, benchmarks for a representative set of Go programs show that building with PGO improves performance by around 2-7%.

We can expect an improvement of around 2-7% 🎉

How to use PGO?

You can use PGO by following these steps:

  1. Release a binary to the production environment without PGO enabled.
  2. Collect profiles from the production environment.
  3. Release a new binary based on the collected profiles.
  4. Return to step 2.

As indicated by the instruction to return to step 2, PGO is not a one-time optimization but one that needs to be performed continuously.

Trying it out with the blog's example

First, I will try out the example from the Profile-guided optimization in Go 1.21 blog post.

Please refer to the aforementioned blog for the detailed execution commands; here, I will simply introduce the key points.

In the Go blog, a service that converts Markdown files to HTML is used as the target for PGO. After starting that service locally, profiles are collected using net/http/pprof. A PGO-enabled binary is generated based on the collected profiles, and performance is compared against a binary without PGO using benchmarks.


Here are the results from my own attempt:

No PGO With PGO vs base
Time per operation 172.3µs ± 0% 172.9µs ± 1% +0.36% (p=0.027 n=40)

What I want to draw attention to is the vs base column. This value represents the result of the performance of the PGO-enabled binary compared to the binary without PGO.

The result was +0.36%, meaning the performance actually worsened with PGO enabled 😞
However, since the absolute difference is only 0.6µs, it could be interpreted as being within the margin of error. In the case of the blog's example, PGO did not seem to be effective.

Trying it out with our company's server

Now, here is the real deal. I would like to apply PGO to our company's server.

Since jumping straight into the production environment would be time-consuming and difficult, I will first attempt to reproduce an environment as close to production as possible locally and introduce PGO there.

There were four steps to implement PGO:

  1. Release a binary to the production environment without PGO enabled.
  2. Collect profiles from the production environment.
  3. Release a new binary based on the collected profiles.
  4. Return to step 2.

For validation, I will apply these four steps to our company's server as follows:

  1. Release a binary to the production environment without PGO enabled.
    • Rewrite the code so that the same binary as in production can be used in the local environment.
  2. Collect profiles from the production environment.
    • Investigate the top 3 most frequently called APIs.
    • Use locust + boomer (explained below) to apply load and capture profiles.
  3. Release a new binary based on the collected profiles.
    • Build a new binary using the collected profiles.
  4. Return to step 2.
    • Instead of returning to step 2, compare performance using benchmarks.

Step 1: Release a binary to the production environment without PGO enabled

I will omit the details for this step, as it simply requires being able to build and run the same binary as production in a local environment.

Step 2: Collect profiles from the production environment

For this step, I attempted to collect profiles by applying a load as close to the production environment as possible.

First, I investigated the top 3 most frequently called APIs. I will refer to them as API 1, API 2, and API 3 (with smaller numbers indicating more frequent calls). Furthermore, I examined the following items for these APIs:

  • How many times they are called in total?
  • How often they are called with/without a session?
  • What is the number of requests per second?

By comparing these investigation results with the number of running servers, I estimated the ratio and load to apply to each API for a single server in the local environment.


Once the estimation was complete, I used locust and boomer to apply load and collect profiles. Personally, I liked boomer because it allows writing load scenarios in Go and enables profile collection without modifying the server-side code.

Start locust with the following command:

 locust --master -f dummy.py

Then, capture profiles while applying load with boomer:

 go build -o boomer main.go && ./boomer --max-rps 10 -cpu-profile cpu.pprof -cpu-profile-duration 60s

As for the duration of the profile capture, I chose 60 seconds for this validation.

Step 3: Release a new binary based on the collected profiles

To enable PGO, you simply need to pass the -pgo=auto flag to the go build command.
If the -pgo=auto flag is specified, the build process will use the default.pgo file if it exists.

It is also possible to specify the path to the profile explicitly. For more details on PGO-enabled builds, please refer to this page.

Step 4: Compare performance using benchmarks

Finally, I ran benchmarks to compare performance.

For these benchmarks, I created code that simply calls the API a specified number of times. Ideally, the benchmarks should have been written based on the frequency and ratios investigated in Step 2, but I simplified it to just calling them a specified number of times.


Here are the results:

Test Case No PGO With PGO vs base
API 1 (with session) 6.106ms ± 17% 5.447ms ± 38% -10.79% (p=0.021, n=40)
API 1 (no session) 1.793ms ± 23% 1.799 ± 19% ~ (p=0.844, n=40)
API 2 (with session) 2.625ms ± 8% 2.674ms ± 33% ~ (p=0.589, n=40)
API 2 (no session) 2.110ms ± 32% 2.205ms ± 26% ~ (p=0.663, n=40)
API 3 (with session) 6.208ms ± 5% 5.715ms ± 5% -7.95% (p=0.028, n=40)
API 3 (no session) 1.704ms ± 41% 2.137ms ± 32% ~ (p=0.677, n=40)

Looking purely at the results, we can expect a 7-10% improvement in latency for API 1 and API 3 when sessions are used by introducing PGO 🎉🎉🎉

Discussion

Although I have listed the simple numerical results, I found the outcome to be very interesting. Based on these results, I think the following points deserve further discussion:

  1. Why was no improvement expected from PGO when sessions were not used?
  2. While the official documentation suggests an expected performance improvement of 2-7%, why are the actual figures 7.95% and 10.79%, which are higher?
  3. Which parts are actually being accelerated in API 1 and API 3, where improvements are expected?

I would like to investigate these points a bit further!

Notes on PGO

Finally, I would like to mention two important notes regarding PGO:

  • Source stability
  • Iterative stability

Source stability

There may be slight differences between the code used to collect profiles and the code built using those collected profiles. Consequently, some parts of the code built using the collected profiles will not be optimized by PGO. However, since this is localized and most functions will still benefit from PGO optimizations, it should not be a major issue.

That said, significant refactoring, function renaming, or moving functions across packages might prevent PGO from optimizing effectively, so I believe it is important to continuously collect profiles.

Iterative stability

If a function is optimized by PGO, it may already be optimized in the next profile, potentially affecting performance in unexpected ways. It seems Go uses PGO conservatively to avoid such issues. This concept is called iterative stability.

Summary

  • As a preliminary investigation before actual production deployment, I reproduced an environment close to production locally and introduced PGO.
  • For two out of the top 3 most frequently called APIs, I observed a latency improvement of about 7-10%.
  • There are still points that need further consideration and investigation, such as iterative stability, which I hope to look into when I have more time.
GitHubで編集を提案

Discussion