💭

Workaround for CloudWatch PutLogEvents cannot span more than 24 hours

に公開

日本語サマリー

BoostDraftでは、AWS Logging .NET を利用してエラーログを AWS CloudWatch および Elasticsearch に保存していましたが、定期的なログレビュー中に大量のエラーログが記録されていることに気づきました。原因は、バッチ送信時に24時間を超えるログイベントが含まれていたため、AWS の制限によりエラーが発生していたことにありました。
調査の結果、while ループ内で無限にエラーログが発生していることが確認され、以下の2点で修正を行いました:

  • 例外が発生した場合に送信バッチ内のログイベントを破棄
  • 送信前に24時間を超える古いログを除外

この修正により、ログのノイズが減少し、CloudWatch のパフォーマンスが向上、エラー追跡がより有効になりました。

Background

At BoostDraft, we use AWS Logging .NET to capture errors and exceptions, storing them in AWS CloudWatch for our products. Additionally, we save these logs to Elasticsearch for analysis, following the approach outlined in this gist.

During a routine log review, we noticed a flood of error logs in Elasticsearch. The majority of these errors stemmed from the following message:

An error occurred (InvalidParameterException) when calling the PutLogEvents operation: The batch of log events in a single PutLogEvents request cannot span more than 24 hours

This is a restriction imposed by AWS, as documented here.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs/client/put_log_events.html#put-log-events

Identifying the Root Cause

Since AWS Logging .NET is open-source, we were able to investigate the issue directly in the source code:
https://github.com/aws/aws-logging-dotnet

After tracing the code, we identified a potential issue in this line:
https://github.com/aws/aws-logging-dotnet/blob/ec3931ea9bb1f854be209ebabf032772d2e7d626/src/AWS.Logger.Core/Core/AWSLoggerCore.cs#L409

This line is inside a while loop, which we suspected was causing the repeated error logs. To confirm, we manually threw a general exception within the try block and observed that LogLibraryServiceError(ex); kept getting called indefinitely.

Implementing a Fix

The key to resolving this issue was preventing logs from being resent indefinitely. The error message itself suggested that the issue was caused by log events in the batch spanning more than 24 hours.

We identified the log batching logic in this part of the code:
https://github.com/aws/aws-logging-dotnet/blob/ec3931ea9bb1f854be209ebabf032772d2e7d626/src/AWS.Logger.Core/Core/AWSLoggerCore.cs#L28

To address the issue, we implemented two key fixes:

  1. Drop log events in the sending batch if an exception occurs – If AWS rejects a batch due to this 24-hour rule, we discard the batch instead of repeatedly attempting to send it. This prevents an infinite loop of error logs.
  2. Filter out log events older than 24 hours before sending – Before sending each batch, we remove any log events that exceed the 24-hour limit, ensuring compliance with AWS’s restrictions.

Final Code Fix:
https://github.com/boostdraft/aws-logging-dotnet/pull/1/files

Trade-offs

While this solution effectively stops excessive error logs, there is a minor downside: some older logs may be lost. However, the number of dropped logs should be minimal. Given the alternative—an overwhelming flood of duplicate error logs—this trade-off is well worth it.

By making these changes, we significantly reduced log noise, improved CloudWatch performance, and maintained meaningful error tracking. 🚀

BoostDraft TECH BLOG

Discussion