iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔀

A Practical Guide to Choosing Between SQS, EventBridge, and Step Functions

に公開

Introduction

When trying to implement asynchronous processing on AWS, don't you find it confusing with so many choices?

  • SQS (Simple Queue Service)
  • EventBridge
  • Step Functions

I think the question, "They're all services for passing messages, right? What's the difference?" is a hurdle that anyone starting with AWS inevitably encounters.

In my work, I use these three together within the same system. In this article, I will introduce criteria for choosing which one to use in practice, along with concrete examples.

Understanding the differences in 3 lines

Service In a nutshell Suitable scenarios
SQS A queue that ensures job delivery 1-to-1 asynchronous job execution
EventBridge A bus that intelligently routes events 1-to-many event delivery & decoupling
Step Functions A workflow that controls sequences of steps Processes involving branching, retries, and waits

SQS: Choose this if you want to "ensure the processing happens exactly once"

Typical use case

API Gateway → Lambda (Reception) → SQS → Lambda (Processing)

This is a pattern where a request received by an API is immediately returned with 202 Accepted, and the actual processing is performed in the background.

Criteria for choosing SQS

  • 1-to-1 relationship between sender and receiver
  • Ordering guarantee is required (FIFO queue)
  • Want to retry if processing fails (Visibility timeout + Dead Letter Queue)
  • Want to control throughput (Batch size, concurrency limits)

Implementation points

How to choose between FIFO vs Standard

Standard FIFO
Throughput Almost unlimited 300 msg/sec (3,000 with batching)
Ordering guarantee None Yes (per MessageGroupId)
De-duplication None Yes (for 5 minutes)
Use cases Log collection, notifications Payment processing, order processing

Practical advice: Start with Standard if you're unsure. FIFO is only necessary in cases where "it's a problem if requests from the same user are not processed in order."

Always set up a Dead Letter Queue (DLQ)

# Terraform example
resource "aws_sqs_queue" "job_queue" {
  name = "job-queue"
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.job_dlq.arn
    maxReceiveCount     = 3
  })
}

resource "aws_sqs_queue" "job_dlq" {
  name = "job-queue-dlq"
}

If you don't set up a DLQ, messages that fail processing will continue to be retried infinitely. When a failure occurs in production, the ease of recovery is vastly different depending on whether you have a DLQ or not.

Lambda event source mapping configuration

resource "aws_lambda_event_source_mapping" "sqs_trigger" {
  event_source_arn                   = aws_sqs_queue.job_queue.arn
  function_name                      = aws_lambda_function.processor.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 5
  function_response_types            = ["ReportBatchItemFailures"]
}

By setting ReportBatchItemFailures, you can retry only the failed messages if a partial failure occurs within a batch. If you don't know this, a single failure will cause the entire batch to be retried, resulting in duplicate processing.

EventBridge: Choose this if "the sender doesn't need to know who receives it"

Typical use case

Lambda (Order Processing) → EventBridge → Lambda (Email Notification)
                                        → Lambda (Inventory Update)
                                        → Lambda (Analysis Log Recording)

A pattern to fan out a single event to multiple Lambda functions.

Criteria for choosing EventBridge

  • Want to deliver a single event to multiple services
  • Want to decouple the sender and the receiver
  • Want to route based on event content
  • Possibility of adding more receivers in the future

Definitive difference from SQS

SQS: The sender knows "who to send it to" (specifies the queue URL).
EventBridge: The sender only issues "what happened" as an event. Who receives it is determined by rules.

This distinction is very important in terms of design. Using EventBridge ensures that you don't need to modify existing code when adding new processes.

Tips for rule design

{
  "source": ["myapp.orders"],
  "detail-type": ["OrderCompleted"],
  "detail": {
    "amount": [{ "numeric": [">", 10000] }]
  }
}

EventBridge rules are written in JSON patterns. The example above is a rule that matches "order completed events where the amount exceeds 10,000 yen."

Practical design guidelines:

  • source should be the service name (myapp.orders).
  • detail-type should be the event type (OrderCompleted).
  • Filter within the detail object.

Standardizing these naming conventions within your team makes it easier to manage as events increase.

Step Functions: Choose this if "you want to control the flow of processing"

Typical use case

Step Functions
  ├─ Step 1: CSV file validation (Lambda)
  ├─ Step 2: Determine validation result (Choice state)
  │    ├─ OK → Step 3: Bulk registration to database (Lambda)
  │    └─ NG → Step 3': Error notification (Lambda)
  └─ Step 4: Completion notification (Lambda)

Criteria for choosing Step Functions

  • Want to execute multiple Lambdas in sequence
  • There is conditional branching (Want to separate processing based on success/failure)
  • Want fine-grained control over retries and timeouts
  • Want to visualize the progress of processing
  • There is a step that waits for human approval

Don't try too hard with SQS + Lambda

You might think, "Can't I just chain Lambdas with SQS and not need Step Functions?" While technically possible:

SQS Chaining Step Functions
Branching Implemented manually inside Lambda Declaratively defined via Choice state
Error Handling DLQ + custom retry logic Declaratively defined via Retry / Catch
Visualization Following CloudWatch Logs Execution state visible in console
Timeout Managed manually One-shot via TimeoutSeconds

A practical rule of thumb is to consider Step Functions once you identify a process flow with 3 or more steps.

How to choose between Express vs Standard

Standard Express
Max execution time 1 year 5 minutes
Execution guarantee Exactly-once At-least-once / At-most-once
Pricing Charged per state transition Charged by execution time + count
Use cases Long-running workflows, approval flows High-speed processing of large data

Standard is suitable for batch processing like CSV validation → DB registration, while Express is suitable for API request processing pipelines.

Combination patterns

In practice, it is common to combine all three.

Pattern 1: API → SQS → Lambda (Basic asynchronous processing)

Client → API Gateway → Lambda → SQS → Lambda (Worker)

                         └─ Returns 202 Accepted immediately

The simplest pattern. Simply separating "reception" from "processing" stabilizes API response times.

Pattern 2: Lambda → EventBridge → Multiple Lambdas (Event-driven)

Lambda → EventBridge ─┬─ Rule A → Lambda (Notification)
                      ├─ Rule B → Lambda (Log Recording)
                      └─ Rule C → SQS → Lambda (Heavy processing)

Placing SQS after EventBridge is also a common pattern. It uses EventBridge for routing while utilizing SQS for buffering and retry control.

Pattern 3: Control everything with Step Functions

Step Functions
  ├─ Lambda (Pre-processing)
  ├─ Choice (Conditional branching)
  ├─ Parallel (Parallel processing)
  │    ├─ Lambda A
  │    └─ Lambda B
  ├─ Send to SQS (For large data)
  └─ Lambda (Post-processing/Notification)

By injecting messages into SQS from within Step Functions, you can delegate heavy data processing to Workers.

Decision flowchart

When in doubt, use this flow to think:

Do you want to make the processing asynchronous?
  ├─ Yes → Is there only 1 receiver?
  │          ├─ Yes → SQS
  │          └─ No  → EventBridge
  └─ No  → Do you need control over multiple steps?
              ├─ Yes → Step Functions
              └─ No  → Synchronous processing (Direct Lambda call) is fine

Of course, there are exceptions, but this will cover 80% of your initial decisions.

Summary

Service Best for Keywords
SQS 1-to-1 async jobs Queue, retry, DLQ, flow control
EventBridge 1-to-many event delivery Routing, decoupling, fan-out
Step Functions Controlling multiple steps Branching, visualization, orchestration

In practice, you will often use them in combination rather than choosing just one. Understand the strengths of each and use them where they fit best.


📚 Reference Books / For further learning

📖 Author's Book (PR)

  • The following is a Kindle book written by the author of this article.
  • Specification-Driven Development Practice Guide — A practical guide on validating design → implementation step-by-step using Claude Code × specification-driven approach. Can be read complementarily with the workflow in this article.

Discussion