iTranslated by AI
Automating CloudWatch Logs Error Classification with Claude: SRE Design Using Bedrock + Lambda
Let AI Handle ERROR Meaning Classification
Sifting through every single ERROR line streaming into CloudWatch Logs to judge whether it's a transient error or an incident is the most exhausting task for SREs. Rule-based filters can only catch pre-defined patterns, and they become obsolete every time a new exception message appears unless someone manually updates them.
On the other hand, AWS native Log Anomaly Detection is great for knowing if a log with the same shape has burst, but it doesn't interpret the meaning of the log content. It is faster to let Claude via Bedrock handle semantic classification and provide suggestions for initial responses. Here, I share a design that applies the same philosophy as my previous article on classifying terraform plan ~ to logs streaming in real-time.
Why Native Anomaly Detection Isn't Enough
CloudWatch Logs Log Anomaly Detection, which became GA in 2024, learns a baseline for each log group based on the past two weeks of logs. It detects five types of anomalies, such as changes in pattern frequency, new patterns, and token variations. This is sufficiently robust for capturing when "a log of the same shape appears more often / less often / for the first time."
However, what is problematic in operational settings is not the baseline anomalies themselves, but "logs that say ERROR but whose initial response cannot be determined without reading the content." Here are three examples:
ERROR ConnectionRefused: dial tcp 10.0.5.32:6379: connect: connection refused
ERROR NullPointerException at com.example.UserService.findById(UserService.java:42)
ERROR AccessDeniedException: User: arn:aws:sts::... is not authorized to perform: s3:GetObject
All three lines start with ERROR. Log anomaly detection catches these as separate patterns, but it doesn't tell us "which one humans should act on." The first is a Redis transient failure that is retryable; the second is an application code defect; the third is potentially an IAM misconfiguration or a breach. The teams responsible (Infrastructure, Development, Security) are all different.
Attempting to handle this "semantic sorting" with rule-based systems leads to a mountain of regular expressions. It requires maintenance every time a new error message emerges. LLMs like Claude are perfect for this "semantic interpretation of free-format strings" domain.
With this in mind, I am building a design that adds a Claude-based semantic classification layer after the native anomaly detection.
Overall Pipeline
The basic architecture is as follows.
CloudWatch Logs (Log Group)
└─ Subscription Filter (ERROR lines only)
└─ Lambda (Python)
├─ De-duplication via fingerprint (DynamoDB)
├─ Bedrock InvokeModel (Claude Haiku)
└─ Notify classification results to Discord/Slack
The fingerprint here refers to a template string where variable parts of the log content (IPs, timestamps, request IDs, etc.) are replaced with <PLACEHOLDER>. Since even repeated errors of the same type can be collapsed into a single fingerprint, it is used as a key in DynamoDB for de-duplication. The specific generation method will be discussed in the prompt design section later.
The subscription filter streams only lines containing ERROR to Lambda. The Lambda checks DynamoDB to see if the same type of error has already been notified within the last N minutes, and if not, sends it to Bedrock's Claude. The returned classification category and initial response suggestion are then sent to Discord.
Why filter roughly at the subscription filter stage? There are two reasons.
LLM calls are significantly more expensive than native filters: Each line costs about $0.0001 with Haiku (approx. 200–500 tokens). Sending all 1 million lines from a log group per day would exceed several hundred dollars per month. Native filters are essentially free.
A layer is needed to suppress notification noise early: Classifying DEBUG/INFO logs is of little value. Precisely because we perform semantic classification in the back end, we narrow down to only the "levels that potentially require human attention" in the front end.
Narrowing Down to 5 Classification Categories
We fix the categories to five, corresponding one-to-one with "what to do next" in operations.
| Category | Meaning | Direction of Initial Response |
|---|---|---|
| transient | Transient network/timeout issues | Ignore if it recovers naturally via retry. Escalate if persistent. |
| dependency | Abnormalities in external dependencies (DB/API/SaaS) | Check status pages of dependencies, consider circuit breaking. |
| application | Application code defects (exceptions, assertions) | File a ticket for the dev team, collect reproduction conditions. |
| security | Authentication failures, insufficient permissions, unexpected operations | Escalate immediately to security personnel. |
| unknown | None of the above | Human reviews, corrects classification, and adds to known-failures. |
Why five categories? If you subdivide into 10, the LLM's classification stability drops—the same log might be app-bug one day and runtime-error the next, losing credibility. Conversely, a binary "Critical/Minor" choice leaves different responders mixed under "Critical," forcing humans to read them anyway. A granularity that matches the responders (yourself, dev, security) provides the right operational resolution.
unknown is a category that should be kept to "expose what the LLM isn't confident about without categorizing it." It is better to bring misclassifications out into the open for human correction than to hide them under a misleading label like application.
Prompt Design
The prompt itself is stored in a file within the repository, which the Lambda reads at startup. Embedding prompts directly into Lambda code requires a new deployment every time you make an improvement.
The essence of the prompt is as follows:
You are an SRE classifying production application error logs.
The input is a single log message and its originating log group / service name.
Please return one of the following categories:
- transient: Transient anomalies like network glitches, timeouts, or rate limits that may recover via retry.
- dependency: Abnormalities in external dependencies like DBs, external APIs, or SaaS.
- application: Defects in your own code (NullPointerException, unhandled exception, assertion violations, etc.).
- security: Authentication failures, insufficient permissions, or unexpected operations (IAM AccessDenied, SSL verification failures, CORS rejections, etc.).
- unknown: Cannot confidently fit into any of the above.
Reference: Known errors that have occurred in this service are recorded below.
If the error has the same meaning, please return the fingerprint as `known:<id>`.
{{known_failures_excerpt}}
Output only a single JSON line. No explanations or code fences.
{
"category": "<category>",
"summary": "<Summary in one Japanese sentence>",
"first_response": "<Action to take next in 1-2 Japanese sentences>",
"fingerprint": "<Templated log. Replace variable parts with <PLACEHOLDER>>",
"confidence": "<high|medium|low>"
}
Input:
service: {{service}}
log_group: {{log_group}}
message: {{message}}
There are three key design points:
Passing known_failures to context: Including known-failures.md (or a summary of recent incidents on S3) in the prompt improves accuracy for known patterns. Since this varies by project, it is practical to switch content based on the service name.
Templatizing the fingerprint: Claude generates the fingerprint that we discussed earlier. If you try to write this with rules, the regex for variable parts grows infinitely, so letting Claude handle it at the same semantic layer is faster.
Mandating confidence: This allows for branching logic, such as changing the background color of Discord notifications for low confidence items, or using it in conjunction with the unknown category. By adding an operational layer between "trusting the LLM completely" and "doubting everything," we manage the output effectively.
Lambda Sketch
The core of the design is the Lambda classify function and the deduplication logic. I will detail this with pseudocode. The scope validated includes unit tests for the classification part; it does not include end-to-end production runs from the subscription filter. For production deployment, you must separately design minimal IAM permissions and DynamoDB capacity.
import base64
import gzip
import hashlib
import json
import os
import time
import boto3
bedrock = boto3.client("bedrock-runtime")
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table(os.environ["DEDUP_TABLE"])
MODEL_ID = "anthropic.claude-haiku-4-5-20251001-v1:0"
DEDUP_WINDOW_SEC = 600 # Do not notify for the same type of error within 10 minutes
with open("prompt.md", encoding="utf-8") as f:
PROMPT_TEMPLATE = f.read()
def handler(event, _context):
payload = json.loads(gzip.decompress(base64.b64decode(event["awslogs"]["data"])))
log_group = payload["logGroup"]
service = log_group.split("/")[-1]
for log_event in payload["logEvents"]:
message = log_event["message"]
result = classify(service, log_group, message)
if result is None:
continue
if is_recently_seen(result["fingerprint"]):
continue
notify(result, service, log_group, message)
record_seen(result["fingerprint"])
def classify(service, log_group, message):
prompt = (
PROMPT_TEMPLATE
.replace("{{service}}", service)
.replace("{{log_group}}", log_group)
.replace("{{message}}", message[:2000])
.replace("{{known_failures_excerpt}}", load_known_failures(service))
)
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 400,
"messages": [{"role": "user", "content": prompt}],
}
resp = bedrock.invoke_model(modelId=MODEL_ID, body=json.dumps(body))
text = json.loads(resp["body"].read())["content"][0]["text"].strip()
try:
return json.loads(text)
except json.JSONDecodeError:
return None
def is_recently_seen(fingerprint):
key = hashlib.sha256(fingerprint.encode("utf-8")).hexdigest()
item = table.get_item(Key={"fp": key}).get("Item")
if not item:
return False
return time.time() - int(item["ts"]) < DEDUP_WINDOW_SEC
def record_seen(fingerprint):
key = hashlib.sha256(fingerprint.encode("utf-8")).hexdigest()
table.put_item(Item={"fp": key, "ts": int(time.time())})
The key is to return None and halt notification if classify() fails (e.g., due to JSON parsing failure). If broken Bedrock responses are sent to Discord, the ops team may judge them as "noise," switch off the filter, and cause genuine anomalies to be buried. Treat LLM output as something that will eventually break.
I truncate the message to the first 2000 characters with message[:2000] to keep Bedrock token usage down. Signals that influence classification (exception classes, error messages, root causes) are usually concentrated at the start of an ERROR line, and including the entire tail end can make the deduplication fingerprint unstable. Prepare separate handling for exceptionally long logs.
Terraform Sketch
The minimal configuration for connecting a subscription filter and Lambda using Terraform looks like this. I assume that the Lambda code and the DynamoDB table itself are defined in a separate module.
resource "aws_cloudwatch_log_subscription_filter" "app_errors" {
name = "app-errors-to-classifier"
log_group_name = aws_cloudwatch_log_group.app.name
filter_pattern = "ERROR"
destination_arn = aws_lambda_function.log_classifier.arn
}
resource "aws_lambda_permission" "allow_logs" {
statement_id = "AllowExecutionFromCWLogs"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.log_classifier.function_name
principal = "logs.amazonaws.com"
source_arn = "${aws_cloudwatch_log_group.app.arn}:*"
}
resource "aws_iam_role_policy" "classifier_inline" {
role = aws_iam_role.log_classifier.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["bedrock:InvokeModel"]
Resource = "arn:aws:bedrock:*::foundation-model/anthropic.claude-haiku-*"
},
{
Effect = "Allow"
Action = ["dynamodb:GetItem", "dynamodb:PutItem"]
Resource = aws_dynamodb_table.dedup.arn
}
]
})
}
The IAM resource specification uses a wildcard for the Bedrock model name. This is intended to allow for Claude version upgrades without modifying Terraform, but it is not recommended to widen this to anthropic.claude-*. The reason is to prevent accidents where code intended for Haiku unexpectedly hits Opus, causing costs to spike. Operations such as "promoting classification for the security category to Sonnet" should be handled by explicitly granting permission in separate statements for better safety.
Notification Format Design
Configure the notification sent to Discord/Slack as follows. Since writing out the primary response directly can create a side effect where "human-operated rules are based on LLM advice," it should be clearly stated that this is a proposal, not a decision.
[application] payment-service: NullPointerException at UserService.findById
Summary: Null reference occurred in a path where userId was not found
Classification Basis (confidence: high)
Primary Response (Proposal): Create a ticket for the dev team. Check for requests where userId is "" as a reproduction condition
log_group: /ecs/payment-service
fingerprint: NullPointerException at <CLASS>.<METHOD>(<FILE>:<LINE>)
I explicitly label it as "Primary Response (Proposal)" to prevent humans from abdicating responsibility during the early stages of operation by saying, "I did it because Claude said so." It is a starting point, and I embed the assumption that the final decision is made by the human who receives it.
A Feedback Loop to Turn Misclassifications into Assets
LLM classification will inevitably fail. Build a mechanism from the start to accumulate records where humans have made corrections when it fails.
Follow this operational flow:
- Add "✅ Correct" and "✏️ Correction" reactions to Discord notifications (same for Slack).
- In case of a correction, leave a one-line comment explaining the correct category and reason.
- A separate Lambda picks up the reaction and appends the JSON to
corrections/in S3. - Aggregate correction logs weekly and update
known-failures.mdandknown_failures_excerptin the prompt.
the key to this design is "lowering the cost of correction." If you create a UI where a single reaction completes a correction, operators will make corrections rather than "ignoring it because it is tedious." If correction data does not accumulate, the LLM operation will rot.
Why not perform fine-tuning every time? Simply appending "real-world examples corrected by humans recently" to the prompt is sufficient for classification tasks on the scale of Haiku. Fine-tuning introduces an update cycle longer than a week, making it impossible to keep up with known-failures additions.
The Boundary Between "Leaving it to Claude" and "Human Decision Making"
In this design, we leave the following to Claude:
- Categorizing the meaning of log bodies into one of five categories.
- Templating the fingerprint that groups similar logs.
- Drafting candidates for primary response in Japanese.
Meanwhile, we leave the following to humans:
- Judging category corrections and updating
known-failures— what constitutes a "known" issue depends on business context. - Final decision for incident declaration — whether to trigger on-call immediately upon receiving a
securitynotification or just run a preliminary investigation is a decision for humans. - Refinement policy for the prompt — design decisions such as adding new categories or changing classification granularity.
Comparing this boundary with the terraform plan classification article, a common outline emerges for areas that AI can take over. The more a task is "monotonous for humans, unmaintainable if written with regular expressions, but close to mechanical when interpreting meaning," the greater the benefit of leaving it to Claude. Conversely, judgments involving "business context, ethics, and accountability" are better kept away from Claude to keep operations sharp.
Do not leave anomaly detection entirely to native features; instead, add a meaning-classification layer on top. This is a design that allows SREs to stop reading ERROR logs one by one and instead dedicate time to their original work of making corrections and policy decisions.
Discussion