iTranslated by AI
[Terraform] Enforcing GitOps by Automatically Detecting Configuration Drift from Manual Changes
For teams managing infrastructure with Terraform, one of the most persistent headaches is "Configuration Drift."
"We opened an SG directly from the AWS Console for emergency response."
"Someone changed the resource size before we knew it."
These manual changes create a discrepancy between tfstate and reality, which can lead to unexpected errors or failures during the next deployment.
In this article, I will explain a mechanism for automatically detecting this "drift" every day using GitHub Actions and notifying Slack.
Challenge: "Configuration Drift" Progressing Out of Sight
The ideal of IaC (Infrastructure as Code) is that "the code is the source of truth," but reality is not that simple.
Especially in organizations where the team size has grown or there is no dedicated SRE, discrepancies between the code and reality tend to occur for the following reasons:
- Temptation of manual changes: Making changes from the console during incident response, etc.
- External factors: Changes by auto-scaling or other tools.
- Missing Apply: Running Plan but forgetting to Apply.
Impact of Manual Changes
For example, suppose you need to open a port urgently in a production environment.
Modifying the Terraform code and running the CI is the correct procedure, but when in a rush, it's easy to think, "I'll just change it in the console for now and fix it later."
However, that "later" is usually forgotten.
One month later, at the moment you deploy another feature, Terraform determines that "the port is closed according to the code" and an accident occurs where it unintentionally closes the port, causing the service to go down.
Solution: Daily Drift Detection
The most reliable and low-cost countermeasure is to "automatically run terraform plan every night and sound an alert if there are differences."
This allows any manual changes to be detected by the following morning, making it possible to address them before they result in an accident.
Implementation Steps with GitHub Actions
You can implement this with just a few lines of YAML using GitHub Actions.
name: Terraform Drift Detection
on:
schedule:
- cron: '0 0 * * *' # 9:00 AM JST
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Check for Drift
id: plan
# -detailed-exitcode returns exit code 2 if there are differences
run: terraform plan -detailed-exitcode
continue-on-error: true
- name: Slack Notification
if: steps.plan.outputs.exitcode == '2'
uses: 8398a7/action-slack@v3
with:
status: custom
fields: repo,workflow
custom_payload: |
{
"attachments": [{
"color": "danger",
"title": "⚠️ Drift Detected",
"text": "There is a discrepancy between the environment and the code. Please check immediately."
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
The key point is the use of the terraform plan -detailed-exitcode option.
Normally, plan returns 0 upon success, but when this option is added, it returns exit code 2 if "there are differences."
By picking this up with a GitHub Actions if condition, you can easily build the notification logic.
Summary
By introducing drift detection, you can gain the following benefits:
| Benefit | Description |
|---|---|
| Accident Prevention | Prevent unexpected reversions during deployment |
| Improved Reliability | Maintain a state where "the code is always correct" |
| Mindset Shift | Instill an awareness in the team that "manual changes will be detected" |
Simply introducing the detection mechanism described in this article will dramatically improve your team's "IaC discipline." However, to go even further and prevent drift from occurring in the first place, transitioning to a full GitOps flow using tools like Atlantis is recommended.
Further Detailed Practice Guide
In the detailed article published on the Shineos Tech Blog, in addition to the complete implementation guide including the code above, we provide in-depth explanations on the following:
- Pull-request-based Terraform operations using Atlantis (GitOps)
- Permission design for revoking AWS write permissions from developers
- "IaC Pitfalls" and countermeasures common in startups
Please see this article for details.
Discussion