iTranslated by AI
From the Despair of Cloud SQL Failure and Corruption to a Pseudo-Blue/Green Deployment via Read Replica Promotion
Introduction
Nice to meet you! This is my first post on Zenn.
Currently, I am operating a medical system on GCP (Google Cloud Platform).
Since it is a system used in medical settings, a high level of availability is required. However, I have encountered several heart-stopping situations during Cloud SQL maintenance and releases.
I would like to talk about those failure stories and the operational workarounds I arrived at after much trial and error.
1. Encountered Troubles: The "Death" That Follows Column Changes
During Cloud SQL column changes or large-scale data updates (UPDATE), I encountered the following phenomena.
① CPU hits 100% during the next day's user hours
Column changes were implemented during a release. Even if it seemed to finish normally at the time, there were several occasions where the CPU usage suddenly reached 100% during the next day's peak user hours, causing the response to drop extremely low (in the worst cases, it stuck at 100% and became completely unusable).
- Action: In this case, restarting the instance temporarily resolved the issue.

When this happens, it stays fixed at 100% if left alone, so until the next release, I've just been repeatedly restarting to buy time (sigh).
② DB Corruption and PITR (Point-in-Time Recovery) Failure
Even more serious was a release involving a large-scale UPDATE. The DB became unstable and eventually unreachable. Even when trying to restart, it would not recover.
- What I tried: Attempted recovery to a specific point in time using PITR (Point-in-Time Recovery), which is a standard GCP feature.
- Result: Since the instance started up in a corrupted state no matter which point in time I tried to back up from, I concluded that the instance itself was broken.
- Final solution: Fortunately, there was a dump file exported beforehand (this was done by an incredibly talented and cautious person; if it were me, I probably wouldn't have done it lol). I managed to recover by rebuilding a new Cloud SQL instance and restoring the data.

A total stalemate; this happens about once a year at most.
2. Why did it break? (Analysis of the cause)
As a result of the investigation, I found that the allocated storage capacity of Cloud SQL might be related? Well, something to that effect.
When executing large-scale DDL (such as column changes) or massive UPDATEs, it seems that internally it temporarily holds both the "data before changes" and the "data after changes," or the logs might suddenly bloat (this part is a bit vague, but anyway, it seems to use storage temporarily).
Learning:
With the default settings, I felt that "the storage capacity is unnecessarily large, I don't need that much," but it might be because a buffer is needed for such work areas.
3. Transition of Countermeasures: Toward Faster and Safer Operations
Phase 1: Obtaining a dump just before release
Following the failure in ②, I made it a rule to always take a dump using gcloud sql export immediately before release operations.
- Challenge: Depending on the scale of the instance, exporting alone could take over an hour. In release operations where I want to minimize downtime, this waiting time was a major bottleneck.
Phase 2: Switching by promoting a read replica (Current)
The method I am currently using is to use a read replica as a sacrificial substitute.
Operational flow:
- Create a read replica of the production DB just before the release operation.
- Execute the release (DB changes) on the production DB.
- If the production DB breaks or becomes unstable, promote the prepared read replica.
- Switch the application's connection destination to the newly promoted DB.
With this approach, even in the worst-case scenario, it has become possible to recover in a few minutes without waiting for "new instance construction + long restoration."
4. Future Outlook: Toward the Ideal Blue/Green
While the current method is excellent as "insurance for when things break," there is still room for improvement in operations.
- Ideal: Achieving releases including database changes without downtime.
- Challenge: Cloud SQL does not (currently) provide a standard "Blue/Green Deployment" feature that can be enabled with a single click.
In the future, I want to pursue more advanced Blue/Green environment construction, such as utilizing the Database Migration Service (DMS) or switching at the proxy layer (ProxySQL). I heard that AWS Aurora has a feature called Aurora Blue/Green Deployments that can be switched with a single button (handling write control, switching after synchronization, etc.), so I want to try that too! If anyone knows a good way to do this on GCP, please let me know!
Conclusion
Precisely because it is a medical system environment where "downtime is not allowed," I become extra cautious with DB operations (although, it's common for things to go terribly wrong due to operational mistakes). If there are others out there struggling with Cloud SQL behavior or release methods, I hope this serves as a helpful example.
If you have insights like "there's a better way!", please let me know in the comments!
I hope this article helps someone who is in trouble at their site!
- This article is a repost of content originally posted on Qiita.
Discussion