💭

Workaround for CloudWatch PutLogEvents cannot span more than 24 hours

james.lue

2025/03/14に公開

 日本語サマリーBoostDraftでは、AWS Logging .NET を利用してエラーログを AWS CloudWatch および Elasticsearch に保存していましたが、定期的なログレビュー中に大量のエラーログが記録されていることに気づきました。原因は、バッチ送信時に24時間を超えるログイベントが含まれていたため、AWS の制限によりエラーが発生していたことにありました。

調査の結果、while ループ内で無限にエラーログが発生していることが確認され、以下の2点で修正を行いました：
例外が発生した場合に送信バッチ内のログイベントを破棄
送信前に24時間を超える古いログを除外
この修正により、ログのノイズが減少し、CloudWatch のパフォーマンスが向上、エラー追跡がより有効になりました。

 BackgroundAt BoostDraft, we use AWS Logging .NET to capture errors and exceptions, storing them in AWS CloudWatch for our products. Additionally, we save these logs to Elasticsearch for analysis, following the approach outlined in this gist.
During a routine log review, we noticed a flood of error logs in Elasticsearch. The majority of these errors stemmed from the following message:
An error occurred (InvalidParameterException) when calling the PutLogEvents operation: The batch of log events in a single PutLogEvents request cannot span more than 24 hours
This is a restriction imposed by AWS, as documented here.

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs/client/put_log_events.html#put-log-events

 Identifying the Root CauseSince AWS Logging .NET is open-source, we were able to investigate the issue directly in the source code:

https://github.com/aws/aws-logging-dotnet
After tracing the code, we identified a potential issue in this line:

https://github.com/aws/aws-logging-dotnet/blob/ec3931ea9bb1f854be209ebabf032772d2e7d626/src/AWS.Logger.Core/Core/AWSLoggerCore.cs#L409
This line is inside a while loop, which we suspected was causing the repeated error logs. To confirm, we manually threw a general exception within the try block and observed that LogLibraryServiceError(ex); kept getting called indefinitely.

 Implementing a FixThe key to resolving this issue was preventing logs from being resent indefinitely. The error message itself suggested that the issue was caused by log events in the batch spanning more than 24 hours.
We identified the log batching logic in this part of the code:

https://github.com/aws/aws-logging-dotnet/blob/ec3931ea9bb1f854be209ebabf032772d2e7d626/src/AWS.Logger.Core/Core/AWSLoggerCore.cs#L28
To address the issue, we implemented two key fixes:

Drop log events in the sending batch if an exception occurs – If AWS rejects a batch due to this 24-hour rule, we discard the batch instead of repeatedly attempting to send it. This prevents an infinite loop of error logs.

Filter out log events older than 24 hours before sending – Before sending each batch, we remove any log events that exceed the 24-hour limit, ensuring compliance with AWS’s restrictions.
Final Code Fix:

https://github.com/boostdraft/aws-logging-dotnet/pull/1/files

 Trade-offsWhile this solution effectively stops excessive error logs, there is a minor downside: some older logs may be lost. However, the number of dropped logs should be minimal. Given the alternative—an overwhelming flood of duplicate error logs—this trade-off is well worth it.
By making these changes, we significantly reduced log noise, improved CloudWatch performance, and maintained meaningful error tracking. 🚀

BoostDraft TECH BLOGPublication

法律専門家向け総合文書エディタを開発する「BoostDraft」社のテックブログです. BoostDraftのエンジニア達が, 業務の中で得た知見についてご紹介します.

日本語サマリー

Background

Identifying the Root Cause

Implementing a Fix

Trade-offs

Discussion