iTranslated by AI
A Practical Guide to Choosing Between SQS, EventBridge, and Step Functions
Introduction
When trying to implement asynchronous processing on AWS, don't you find it confusing with so many choices?
- SQS (Simple Queue Service)
- EventBridge
- Step Functions
I think the question, "They're all services for passing messages, right? What's the difference?" is a hurdle that anyone starting with AWS inevitably encounters.
In my work, I use these three together within the same system. In this article, I will introduce criteria for choosing which one to use in practice, along with concrete examples.
Understanding the differences in 3 lines
| Service | In a nutshell | Suitable scenarios |
|---|---|---|
| SQS | A queue that ensures job delivery | 1-to-1 asynchronous job execution |
| EventBridge | A bus that intelligently routes events | 1-to-many event delivery & decoupling |
| Step Functions | A workflow that controls sequences of steps | Processes involving branching, retries, and waits |
SQS: Choose this if you want to "ensure the processing happens exactly once"
Typical use case
API Gateway → Lambda (Reception) → SQS → Lambda (Processing)
This is a pattern where a request received by an API is immediately returned with 202 Accepted, and the actual processing is performed in the background.
Criteria for choosing SQS
- 1-to-1 relationship between sender and receiver
- Ordering guarantee is required (FIFO queue)
- Want to retry if processing fails (Visibility timeout + Dead Letter Queue)
- Want to control throughput (Batch size, concurrency limits)
Implementation points
How to choose between FIFO vs Standard
| Standard | FIFO | |
|---|---|---|
| Throughput | Almost unlimited | 300 msg/sec (3,000 with batching) |
| Ordering guarantee | None | Yes (per MessageGroupId) |
| De-duplication | None | Yes (for 5 minutes) |
| Use cases | Log collection, notifications | Payment processing, order processing |
Practical advice: Start with Standard if you're unsure. FIFO is only necessary in cases where "it's a problem if requests from the same user are not processed in order."
Always set up a Dead Letter Queue (DLQ)
# Terraform example
resource "aws_sqs_queue" "job_queue" {
name = "job-queue"
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.job_dlq.arn
maxReceiveCount = 3
})
}
resource "aws_sqs_queue" "job_dlq" {
name = "job-queue-dlq"
}
If you don't set up a DLQ, messages that fail processing will continue to be retried infinitely. When a failure occurs in production, the ease of recovery is vastly different depending on whether you have a DLQ or not.
Lambda event source mapping configuration
resource "aws_lambda_event_source_mapping" "sqs_trigger" {
event_source_arn = aws_sqs_queue.job_queue.arn
function_name = aws_lambda_function.processor.arn
batch_size = 10
maximum_batching_window_in_seconds = 5
function_response_types = ["ReportBatchItemFailures"]
}
By setting ReportBatchItemFailures, you can retry only the failed messages if a partial failure occurs within a batch. If you don't know this, a single failure will cause the entire batch to be retried, resulting in duplicate processing.
EventBridge: Choose this if "the sender doesn't need to know who receives it"
Typical use case
Lambda (Order Processing) → EventBridge → Lambda (Email Notification)
→ Lambda (Inventory Update)
→ Lambda (Analysis Log Recording)
A pattern to fan out a single event to multiple Lambda functions.
Criteria for choosing EventBridge
- Want to deliver a single event to multiple services
- Want to decouple the sender and the receiver
- Want to route based on event content
- Possibility of adding more receivers in the future
Definitive difference from SQS
SQS: The sender knows "who to send it to" (specifies the queue URL).
EventBridge: The sender only issues "what happened" as an event. Who receives it is determined by rules.
This distinction is very important in terms of design. Using EventBridge ensures that you don't need to modify existing code when adding new processes.
Tips for rule design
{
"source": ["myapp.orders"],
"detail-type": ["OrderCompleted"],
"detail": {
"amount": [{ "numeric": [">", 10000] }]
}
}
EventBridge rules are written in JSON patterns. The example above is a rule that matches "order completed events where the amount exceeds 10,000 yen."
Practical design guidelines:
-
sourceshould be the service name (myapp.orders). -
detail-typeshould be the event type (OrderCompleted). - Filter within the
detailobject.
Standardizing these naming conventions within your team makes it easier to manage as events increase.
Step Functions: Choose this if "you want to control the flow of processing"
Typical use case
Step Functions
├─ Step 1: CSV file validation (Lambda)
├─ Step 2: Determine validation result (Choice state)
│ ├─ OK → Step 3: Bulk registration to database (Lambda)
│ └─ NG → Step 3': Error notification (Lambda)
└─ Step 4: Completion notification (Lambda)
Criteria for choosing Step Functions
- Want to execute multiple Lambdas in sequence
- There is conditional branching (Want to separate processing based on success/failure)
- Want fine-grained control over retries and timeouts
- Want to visualize the progress of processing
- There is a step that waits for human approval
Don't try too hard with SQS + Lambda
You might think, "Can't I just chain Lambdas with SQS and not need Step Functions?" While technically possible:
| SQS Chaining | Step Functions | |
|---|---|---|
| Branching | Implemented manually inside Lambda | Declaratively defined via Choice state |
| Error Handling | DLQ + custom retry logic | Declaratively defined via Retry / Catch |
| Visualization | Following CloudWatch Logs | Execution state visible in console |
| Timeout | Managed manually | One-shot via TimeoutSeconds |
A practical rule of thumb is to consider Step Functions once you identify a process flow with 3 or more steps.
How to choose between Express vs Standard
| Standard | Express | |
|---|---|---|
| Max execution time | 1 year | 5 minutes |
| Execution guarantee | Exactly-once | At-least-once / At-most-once |
| Pricing | Charged per state transition | Charged by execution time + count |
| Use cases | Long-running workflows, approval flows | High-speed processing of large data |
Standard is suitable for batch processing like CSV validation → DB registration, while Express is suitable for API request processing pipelines.
Combination patterns
In practice, it is common to combine all three.
Pattern 1: API → SQS → Lambda (Basic asynchronous processing)
Client → API Gateway → Lambda → SQS → Lambda (Worker)
│
└─ Returns 202 Accepted immediately
The simplest pattern. Simply separating "reception" from "processing" stabilizes API response times.
Pattern 2: Lambda → EventBridge → Multiple Lambdas (Event-driven)
Lambda → EventBridge ─┬─ Rule A → Lambda (Notification)
├─ Rule B → Lambda (Log Recording)
└─ Rule C → SQS → Lambda (Heavy processing)
Placing SQS after EventBridge is also a common pattern. It uses EventBridge for routing while utilizing SQS for buffering and retry control.
Pattern 3: Control everything with Step Functions
Step Functions
├─ Lambda (Pre-processing)
├─ Choice (Conditional branching)
├─ Parallel (Parallel processing)
│ ├─ Lambda A
│ └─ Lambda B
├─ Send to SQS (For large data)
└─ Lambda (Post-processing/Notification)
By injecting messages into SQS from within Step Functions, you can delegate heavy data processing to Workers.
Decision flowchart
When in doubt, use this flow to think:
Do you want to make the processing asynchronous?
├─ Yes → Is there only 1 receiver?
│ ├─ Yes → SQS
│ └─ No → EventBridge
└─ No → Do you need control over multiple steps?
├─ Yes → Step Functions
└─ No → Synchronous processing (Direct Lambda call) is fine
Of course, there are exceptions, but this will cover 80% of your initial decisions.
Summary
| Service | Best for | Keywords |
|---|---|---|
| SQS | 1-to-1 async jobs | Queue, retry, DLQ, flow control |
| EventBridge | 1-to-many event delivery | Routing, decoupling, fan-out |
| Step Functions | Controlling multiple steps | Branching, visualization, orchestration |
In practice, you will often use them in combination rather than choosing just one. Understand the strengths of each and use them where they fit best.
📚 Reference Books / For further learning
- Designing Event-Driven Systems — Design principles for event-driven architecture. Useful for EventBridge-centric design.
- AWS Lambda Practice Guide 2nd Edition — Systematizes Lambda's behavior, deployment, and operation. Useful for cold starts and runtime comparisons.
- Serverless Architecture with AWS — A book explaining serverless design centered on Lambda + API Gateway + S3.
📖 Author's Book (PR)
- The following is a Kindle book written by the author of this article.
- Specification-Driven Development Practice Guide — A practical guide on validating design → implementation step-by-step using Claude Code × specification-driven approach. Can be read complementarily with the workflow in this article.
Discussion