iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦁

8 Key Quality Attributes for System Design

に公開

Introduction

In design reviews and incident response scenarios, terms like "improving availability" or "low maintainability" frequently come up. I have used them somewhat loosely, but I didn't fully grasp their precise meanings or their differences from other related terms. To organize my own understanding, I have summarized eight quality attributes ending in "-ity" commonly used in system design.


Glossary

① Availability

The percentage of time that a system is up and available for use.

It is expressed by the following formula:

Availability = MTBF / (MTBF + MTTR)

MTBF (Mean Time Between Failures): The average time between failures
MTTR (Mean Time To Repair): The average time from a failure to recovery

In SLAs, this is often expressed as "99.9%" (three nines) or "99.99%" (four nines).

Availability Annual Downtime
99.9% Approx. 8.7 hours
99.99% Approx. 52 minutes
99.999% Approx. 5 minutes

Examples:

  • Eliminating single points of failure with a multi-AZ configuration
  • Removing failed nodes using a load balancer + health checks
  • Shortening MTTR through failure detection and automated recovery

Often confused with: Reliability

Availability is the "percentage of time operational," while Reliability is the "ability to continue operating correctly." Even if failures occur frequently, if recovery is fast, Availability can be kept high, though Reliability would be low.


② Reliability

The ability of a system to continue operating as expected. The resistance to failure.

The longer the MTBF (Mean Time Between Failures), the higher the reliability.

Examples:

  • API retry logic and idempotency (ensuring results remain consistent even if the same request is sent multiple times)
  • Preventing cascading failures with the circuit breaker pattern
  • Data integrity checks during writes

Often confused with: Availability

Availability Reliability
Question Is the system available for use? Is the system operating correctly?
Metric Uptime (%) MTBF
Improvement Redundancy/Auto-recovery Reducing bugs/Fault-tolerant design

③ Durability

The ability to ensure data is not lost. Data persistence.

While Availability refers to whether a system is "usable," Durability refers to whether data "remains intact."

Examples:

  • Amazon S3 provides "99.999999999%" (eleven nines) of Durability. Data is replicated across multiple AZs.
  • Database WAL (Write-Ahead Logging) ensures data is not lost in the event of a crash.

Often confused with: Availability

Even if an S3 bucket is temporarily inaccessible (reduced Availability), the data itself is not lost (Durability is maintained). Availability and Durability are independent characteristics.


④ Scalability

The ability of a system to expand in response to increased load.

There are two main methods for expansion:

Method Description Example
Scale-out (Horizontal) Increasing the number of servers Adding EC2 instances
Scale-up (Vertical) Increasing server specifications Changing instance types

To build a system that scales out easily, stateless design is crucial. If session information is held within the server, issues arise during scale-out, so it is kept in external caches (like Redis).

Examples:

  • Auto Scaling to automatically adjust instance counts based on traffic
  • Distributing read load with database read replicas

Often confused with: Elasticity

Scalability refers to "having the capacity to expand," while Elasticity refers to "automatically expanding and contracting based on load." Auto Scaling is an implementation example of Elasticity.


⑤ Maintainability

The ability to easily modify, update, and operate a system.

This also overlaps with the perspective of shortening MTTR (recovery time from failure). If code is easy to read and the impact of changes is limited, bug fixes can be performed quickly.

Examples:

  • Keeping functions and classes small and clarifying responsibilities
  • Communicating the intent of the code through documentation and naming
  • Organizing dependencies to localize the impact of changes

Often confused with: Extensibility

Maintainability is "how easy it is to fix existing code," while Extensibility is "how easy it is to add new features." They are similar but look at the system from different perspectives.


⑥ Observability

The ability to observe the internal state of a system from the outside.

There are three pillars of Observability:

Pillar Description Example Tools
Logs Recording events CloudWatch Logs / Datadog
Metrics Numerical time-series data CloudWatch Metrics / Prometheus
Traces Processing paths of requests AWS X-Ray / Jaeger

Often confused with: Monitoring

Monitoring is a mechanism to "detect known problems." An example would be issuing an alert when CPU usage exceeds 80%.

Observability refers to the state where "unknown problems can also be investigated." When a failure occurs, a system with high Observability allows you to combine logs, metrics, and traces to track "why it happened."


⑦ Testability

The ability to easily test a system.

A design with high testability naturally improves Maintainability.

Examples of designs that increase testability:

  • Dependency Injection (DI): By passing external dependencies (DB/external APIs) via interfaces, they can be swapped for mocks during testing.
  • Separation of side effects: By separating business logic from I/O processes, the logic portion can be unit-tested.
  • Small functions: The more a single function focuses on a single task, the simpler the test cases become.

⑧ Extensibility

The ability to add new features without changing existing code.

This corresponds to the OCP (Open/Closed Principle) of the SOLID principles: "Software entities should be open for extension, but closed for modification."

Example:

// Low extensibility: The function must be modified every time a new notification method is added
func Notify(method string, message string) {
    if method == "email" {
        sendEmail(message)
    } else if method == "slack" {
        sendSlack(message)
    }
    // Modify this every time a new method is added
}
// High extensibility: Can be extended simply by adding an interface
type Notifier interface {
    Notify(message string) error
}

func SendNotification(n Notifier, message string) error {
    return n.Notify(message)
}

Often confused with: Maintainability

Maintainability Extensibility
Question Is it easy to fix existing code? Is it easy to add new features?
Focus Bug fixes/Refactoring Feature additions/Specification changes

Trade-offs

Quality attributes can often conflict as trade-offs. Since you cannot maximize everything simultaneously, the essence of design is deciding on priorities based on system requirements.

Trade-off Description
Availability ↑ vs Consistency ↓ Redundancy across multiple nodes makes it difficult to maintain constant data consistency between nodes (CAP Theorem)
Scalability ↑ vs Maintainability ↓ Distributed systems scale horizontally easily but increase complexity and become harder to maintain
Extensibility ↑ vs Maintainability ↓ Increasing abstraction for extensibility can make the code more complex and harder to read
Observability ↑ vs Performance ↓ Detailed logging and tracing increase I/O costs and can impact performance

Conclusion

Term Japanese Translation Definition
Availability 可用性 Percentage of time the system is operational
Reliability 信頼性 Ability to continue operating as expected
Durability 耐久性 Ability to ensure data is not lost
Scalability スケーラビリティ Ability to expand in response to load increase
Maintainability 保守性 Ability to easily modify and update
Observability オブザーバビリティ Ability to observe internal state from outside
Testability テスト容易性 Ability to test easily
Extensibility 拡張性 Ability to add features without changing existing code

When these terms come up in design, being conscious of "what to prioritize" and "what the trade-offs are" will deepen the resolution of your discussions.

Discussion