Python Boto3のexponential backoffの振る舞いを試す

Boto3などのAWS SDKはAPI失敗時にリトライしてくれる機能がある。

https://docs.aws.amazon.com/ja_jp/sdkref/latest/guide/feature-retry-behavior.html
その際にリトライ回数に応じてリトライ間の待ち秒数を長くしていく "exponential backoff" というアルゴリズムが使われている。

以下のロジックで待ち秒数が変わるらしい。
再試行可能なエラーの場合、再試行遅延は台形型エクスポネンシャルバックオフを使用して計算されます。は、ジッターで切り捨てられたバイナリエクスポネンシャルバックオフSDKsを使用します。次のアルゴリズムは、i リクエストに対する応答の休止時間（秒単位）がどのように定義されているかを示しています。
seconds_to_sleep_i = min(b*r^i, MAX_BACKOFF)
前述のアルゴリズムでは、以下の値が適用されます。

b = random number within the range of: 0 <= b <= 1

r = 2

MAX_BACKOFF = 20 seconds ほとんどの では SDKs。確認のために、特定のSDKガイドまたはソースコードを参照してください。
boto3側の説明にも。
Retry attempts on several HTTP status codes, including 429, 500, 502, 503, 504, and 509.

Any retry attempt will include an exponential backoff by a base factor of 2.
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#legacy-retry-mode
なんとなくわかりそうだが、念の為実際に動かして試してみる

koh-sh

Claudeと相談しながらスクリプトを作った。

スクリプト

import boto3
from botocore.config import Config
import logging
import time


class RetryHandler(logging.Handler):
    def __init__(self):
        super().__init__()
        self.retry_count = 0

    def emit(self, record):
        msg = record.getMessage()
        if 'Retry needed, action of:' in msg:
            self.retry_count += 1
            # "Retry needed, action of: X.XXX" から数値を抽出
            sleep_time = float(msg.split('action of: ')[1])
            print(f"リトライ {self.retry_count}: {sleep_time:.3f} 秒間スリープ")


def setup_logging():
    # 全てのログを無効化
    for logger_name in ['botocore', 'urllib3', 'urllib3.connectionpool']:
        logging.getLogger(logger_name).setLevel(logging.CRITICAL)

    # retryhandlerのログだけを有効化
    retry_logger = logging.getLogger('botocore.retryhandler')
    retry_logger.setLevel(logging.DEBUG)

    handler = RetryHandler()
    retry_logger.addHandler(handler)

    return handler


def create_client():
    config = Config(
        connect_timeout=0.001,
        retries=dict(
            max_attempts=10,
        )
    )

    return boto3.client(
        's3',
        endpoint_url='http://localhost:54321',
        config=config
    )


def measure_retry_times():
    handler = setup_logging()
    client = create_client()
    print("リトライの計測を開始します...\n")
    start_time = time.time()

    try:
        response = client.list_buckets()
    except Exception as e:
        pass
    finally:
        total_time = time.time() - start_time
        print(f"\n計測完了: 合計実行時間 {total_time:.2f}秒")
        if handler.retry_count == 0:
            print("リトライは発生しませんでした")


if __name__ == '__main__':
    measure_retry_times()

ざっくり流れを説明すると

APIのend pointを http://localhost:54321 に設定。かつtimeoutを極端に短くする。
- このend pointはリクエスト受け付けしていないのでエラーになる
log levelをDEBUGにすると DEBUG:botocore.retryhandler:Retry needed, action of: 15.229222111676261 のように待ち時間が出力される
loggingをカスタマイズして不要な出力を消しつつ、上記の待ち時間の部分を取得してprintする

koh-sh

実行結果は以下のようになります。

% python retry.py
リトライの計測を開始します...

リトライ 1: 0.932 秒間スリープ
リトライ 2: 0.452 秒間スリープ
リトライ 3: 0.762 秒間スリープ
リトライ 4: 2.498 秒間スリープ
リトライ 5: 9.523 秒間スリープ
リトライ 6: 25.979 秒間スリープ
リトライ 7: 23.023 秒間スリープ
リトライ 8: 90.973 秒間スリープ
リトライ 9: 161.616 秒間スリープ
リトライ 10: 101.076 秒間スリープ

計測完了: 合計実行時間 416.93秒

乱数によって秒数が変わりつつも、リトライを重ねるごとに秒数が増えていくのが確認できた。

koh-sh

client生成部分の引数からconfigの部分を削除するとデフォルトの挙動を確認できる

--- a/retry.py
+++ b/retry.py
@@ -44,7 +44,6 @@ def create_client():
     return boto3.client(
         's3',
         endpoint_url='http://localhost:54321',
-        config=config
     )

% python retry.py
リトライの計測を開始します...

リトライ 1: 0.572 秒間スリープ
リトライ 2: 1.146 秒間スリープ
リトライ 3: 0.110 秒間スリープ
リトライ 4: 1.752 秒間スリープ

計測完了: 合計実行時間 3.62秒

koh-sh

boto3はデフォルトだとlegacyモードで動く。
standardモードで試すには、ログの出力のされ方が変わるので少し修正が必要

--- a/retry.py
+++ b/retry.py
@@ -11,10 +11,10 @@ class RetryHandler(logging.Handler):
 
     def emit(self, record):
         msg = record.getMessage()
-        if 'Retry needed, action of:' in msg:
+        if 'retrying request after delay of:' in msg:
             self.retry_count += 1
-            # "Retry needed, action of: X.XXX" から数値を抽出
-            sleep_time = float(msg.split('action of: ')[1])
+            # "retrying request after delay of: X.XXX" から数値を抽出
+            sleep_time = float(msg.split('delay of: ')[1])
             print(f"リトライ {self.retry_count}: {sleep_time:.3f} 秒間スリープ")
 
 
@@ -23,8 +23,8 @@ def setup_logging():
     for logger_name in ['botocore', 'urllib3', 'urllib3.connectionpool']:
         logging.getLogger(logger_name).setLevel(logging.CRITICAL)
 
-    # retryhandlerのログだけを有効化
-    retry_logger = logging.getLogger('botocore.retryhandler')
+    # botocoreのリトライログのみを有効化
+    retry_logger = logging.getLogger('botocore.retries.standard')
     retry_logger.setLevel(logging.DEBUG)
 
     handler = RetryHandler()
@@ -38,6 +38,7 @@ def create_client():
         connect_timeout=0.001,
         retries=dict(
             max_attempts=10,
+            mode="standard"
         )
     )

% python retry.py
リトライの計測を開始します...

リトライ 1: 0.114 秒間スリープ
リトライ 2: 1.878 秒間スリープ
リトライ 3: 1.514 秒間スリープ
リトライ 4: 7.344 秒間スリープ
リトライ 5: 6.158 秒間スリープ
リトライ 6: 13.427 秒間スリープ
リトライ 7: 0.658 秒間スリープ
リトライ 8: 1.482 秒間スリープ
リトライ 9: 9.904 秒間スリープ
リトライ 10: 17.276 秒間スリープ

計測完了: 合計実行時間 59.83秒

standardモードだとMAX_BACKOFF = 20 secondsが採用されるので、20秒以上の待ち時間になることはなさそう。

koh-sh

適当にエラーを返すwebサーバ作った方が見やすかった。

webサーバ (Go)

package main

import (
    "log"
    "net/http"
    "time"
)

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        // アクセスログを出力
        log.Printf(
            "[%s] %s %s from %s",
            time.Now().Format("2006-01-02 15:04:05"),
            r.Method,
            r.URL.Path,
            r.RemoteAddr,
        )

        // 429 Too Many Requestsを返す
        w.WriteHeader(http.StatusTooManyRequests)
        w.Write([]byte("Rate limit exceeded"))
    })

    log.Println("Server starting on :8080")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        log.Fatal(err)
    }
}

@@ -43,7 +43,7 @@ def create_client():

     return boto3.client(
         's3',
-        endpoint_url='http://localhost:54321',
+        endpoint_url='http://localhost:8080',
         config=config
     )

% go run ./main.go
2024/12/20 02:02:22 Server starting on :8080
2024/12/20 02:02:48 [2024-12-20 02:02:48] GET / from [::1]:55496
2024/12/20 02:02:49 [2024-12-20 02:02:49] GET / from [::1]:55496
2024/12/20 02:02:51 [2024-12-20 02:02:51] GET / from [::1]:55496
2024/12/20 02:02:54 [2024-12-20 02:02:54] GET / from [::1]:55496
2024/12/20 02:03:00 [2024-12-20 02:03:00] GET / from [::1]:55496
2024/12/20 02:03:02 [2024-12-20 02:03:02] GET / from [::1]:55496
2024/12/20 02:03:06 [2024-12-20 02:03:06] GET / from [::1]:55496
2024/12/20 02:03:55 [2024-12-20 02:03:55] GET / from [::1]:55496
2024/12/20 02:05:17 [2024-12-20 02:05:17] GET / from [::1]:55496
2024/12/20 02:07:41 [2024-12-20 02:07:41] GET / from [::1]:55496
2024/12/20 02:09:21 [2024-12-20 02:09:21] GET / from [::1]:55496

% python retry.py
リトライの計測を開始します...

リトライ 1: 0.694 秒間スリープ
リトライ 2: 1.711 秒間スリープ
リトライ 3: 3.282 秒間スリープ
リトライ 4: 5.940 秒間スリープ
リトライ 5: 1.606 秒間スリープ
リトライ 6: 3.939 秒間スリープ
リトライ 7: 49.351 秒間スリープ
リトライ 8: 82.183 秒間スリープ
リトライ 9: 144.045 秒間スリープ
リトライ 10: 99.679 秒間スリープ

計測完了: 合計実行時間 392.53秒

このスクラップは2024/12/20にクローズされました