iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🍃

Exploring the Performance Impact of Retries in Amazon Aurora DSQL

に公開

Summary of what I investigated

When a conflict occurs in Amazon Aurora DSQL, an exception is thrown for all transactions except the one that successfully committed first, necessitating retry logic on the application side. This article covers a quick investigation into the performance impact when actual contention occurs and retries are required.

Hypothetical Scenario

Specifically, I assumed a case where multiple threads acquire an exclusive lock on the same record and update its value. In a Java Servlet API-based web server, each HTTP request is executed as an independent thread. When these HTTP requests perform operations like updating the balance of the same point account or updating the stock quantity of the same product, cases similar to this scenario can occur.

I kept the test implementation extremely simple to facilitate a quick investigation and to highlight the difference in processing time based on different methods.

  1. First, INSERT a new record with an initial value of 0. ... Process ⓪
  2. UPDATE that record 10 times, adding 1 to the value each time. ... Processes ①–⑪
  3. Retrieve the initial value and the result of the 10 additions from the record originally inserted. ... Process ⑫

The scenario involves causing contention by making processes ②–⑪ parallel using multi-threading. The application issuing the SQL that triggers contention is running on CloudShell in the same region as the Aurora DSQL endpoint.

At the time of testing, the Aurora DSQL endpoint was outside the VPC.

case0 - Case without contention

Before looking at the case where contention occurs, I first investigated a case without contention to serve as a baseline. Specifically, this is the case where processes ②–⑪ are executed sequentially in a single thread.

Processing Time (ms)
#01 289
#02 290
#03 290
#04 294
#05 295
#06 296
#07 298
#08 302
#09 305
#10 310

I sorted the results of 12 executions in ascending order by processing time and removed the first and last rows. The average value was approximately 297 milliseconds.

case1 - Case with contention (no optimization)

This is a case where contention occurs. Specifically, processes ②–⑪ are all executed simultaneously using multi-threading. When an exception is thrown due to contention, the thread immediately executes a retry process.

Processing Time (ms)
#01 324
#02 330
#03 332
#04 351
#05 364
#06 367
#07 390
#08 407
#09 413
#10 424

I sorted the results of 12 executions in ascending order by processing time and removed the first and last rows. The average value was approximately 370 milliseconds. Although it takes about 1.2 times as long as the case without contention, no significant processing delay occurred. However, unless the retry count was set to 20 or more, there were cases where the retry limit was reached before the process could succeed, suggesting that a considerable number of retries were occurring. It is easy to imagine that the system would be under significant load if traffic volume were high and cases requiring retries were to accumulate.

case2 - Case with contention (Backoff setting added)

This is a case where contention occurs, but an interval is set for the execution of the retry process, and it is configured to introduce jitter.

Processing Time (ms)
#01 360
#02 369
#03 375
#04 378
#05 383
#06 394
#07 406
#08 409
#09 433
#10 433

I sorted the results of 12 executions in ascending order by processing time and removed the first and last rows. The average value was approximately 394 milliseconds. The processing time is about 1.06 times that of the case without contention, and it is almost the same as the case without the Backoff setting. Since the process succeeded in most cases even when the retry limit was set to 10, the number of retries was halved compared to the case without the Backoff setting, and the system load was significantly reduced.

case3 - Case with contention (Data model change)

This is a proposal where the application was modified, including the data structure, to eliminate the logic of "modifying the same record" which causes contention. Subsequently, processes ①–⑪ are all executed simultaneously in multiple threads. If an exception is thrown due to contention, the thread immediately executes a retry process.

  1. First, INSERT a new record with an initial value of 0. ... Process ⓪
  2. INSERT a new record specifying the value to be added and the target record. ... Processes ①–⑪
  3. Retrieve the initial value and the result of the 10 additions from the record originally inserted. ... Process ⑫

    UPDATE is gone, and the update process is now just INSERT
Processing Time (ms)
#01 67
#02 87
#03 88
#04 91
#05 91
#06 94
#07 99
#08 110
#09 115
#10 202

I sorted the results of 12 executions in ascending order by processing time and removed the first and last rows. The average value was approximately 104 milliseconds. This achieved about 3 times the processing performance of the case without contention, proving that bold changes to the data structure are also effective in extracting the performance of Aurora DSQL.

Exceptions due to contention also occur with INSERT INTO

When inserting a new record, it seems like contention shouldn't occur because no exclusive locks are acquired and the record to be updated is different from other transactions. However, in reality, exceptions such as the following may occur (though not always).

Caused by: org.springframework.dao.CannotAcquireLockException: PreparedStatementCallback; SQL [INSERT INTO e_table (vid, value) VALUES (?, ?)]; ERROR: schema has been updated by another transaction, please retry: (OC001)
  Position: 13
        at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:128)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:107)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:116)
        at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1556)
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:677)
        at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:972)
        at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:1016)
        at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:1026)
        at ... (omitted below)

The same CannotAcquireLockException as when trying to acquire an exclusive lock on the same record is thrown, but there are subtle differences such as a different SQLState.

Item Exception occurring in Case 1, Case 2 Exception occurring in Case 3
Exception Class CannotAcquireLockException CannotAcquireLockException
SQLState OC000 OC001
Meaning of error The changes conflicted with another transaction. The schema has been updated by another transaction.

It seems the INSERT statement itself is not the culprit; rather, it appears the error occurs because the results of the CREATE TABLE statement executed for testing have not yet been reflected in the computing resources handling each parallel connection. Since it is unclear how many seconds must pass after a schema change for it to be safe, it seems necessary to implement applications on the premise that this type of error could occur at any time.

The User Guide also mentions that OC001, along with OC000, can be resolved by retrying, so it is likely best to start by retrying.
https://docs.aws.amazon.com/aurora-dsql/latest/userguide/troubleshooting.html#troubleshooting-occ

Summary

I conducted a brief investigation into how much performance impact occurs in cases where contention actually happens and retries are required.

The conclusions are as follows:

  • If exclusive control is necessary, serializing execution using a single thread to avoid contention is advantageous for performance.
  • Frequent retries increase the system load, making it necessary to implement Backoff to reduce the retry count. If configured correctly, there is no significant performance hit compared to the no-Backoff case.
  • Revising data structures and application logic to eliminate the need for exclusive control and enable parallel processing allows for maximum utilization of Aurora DSQL's high scalability.
  • It is safer to implement retry logic across a broader scope, anticipating that retryable exceptions might arise from unexpected SQL statements.
Case Average Processing Time (ms) Description
Case0 297 Executed in a single thread to avoid contention
Case1 370 Retry immediately when contention occurs
Case2 394 Retry with jittered intervals when contention occurs
Case3 104 Modified to use INSERT instead of UPDATE

Discussion