iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🏗️

[Terraform] Enforcing GitOps by Automatically Detecting Configuration Drift from Manual Changes

に公開

For teams managing infrastructure with Terraform, one of the most persistent headaches is "Configuration Drift."

"We opened an SG directly from the AWS Console for emergency response."
"Someone changed the resource size before we knew it."

These manual changes create a discrepancy between tfstate and reality, which can lead to unexpected errors or failures during the next deployment.

In this article, I will explain a mechanism for automatically detecting this "drift" every day using GitHub Actions and notifying Slack.

Challenge: "Configuration Drift" Progressing Out of Sight

The ideal of IaC (Infrastructure as Code) is that "the code is the source of truth," but reality is not that simple.
Especially in organizations where the team size has grown or there is no dedicated SRE, discrepancies between the code and reality tend to occur for the following reasons:

  1. Temptation of manual changes: Making changes from the console during incident response, etc.
  2. External factors: Changes by auto-scaling or other tools.
  3. Missing Apply: Running Plan but forgetting to Apply.

Impact of Manual Changes

For example, suppose you need to open a port urgently in a production environment.
Modifying the Terraform code and running the CI is the correct procedure, but when in a rush, it's easy to think, "I'll just change it in the console for now and fix it later."

However, that "later" is usually forgotten.
One month later, at the moment you deploy another feature, Terraform determines that "the port is closed according to the code" and an accident occurs where it unintentionally closes the port, causing the service to go down.

Solution: Daily Drift Detection

The most reliable and low-cost countermeasure is to "automatically run terraform plan every night and sound an alert if there are differences."
This allows any manual changes to be detected by the following morning, making it possible to address them before they result in an accident.

Implementation Steps with GitHub Actions

You can implement this with just a few lines of YAML using GitHub Actions.

name: Terraform Drift Detection
on:
  schedule:
    - cron: '0 0 * * *' # 9:00 AM JST

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      
      - name: Terraform Init
        run: terraform init
        
      - name: Check for Drift
        id: plan
        # -detailed-exitcode returns exit code 2 if there are differences
        run: terraform plan -detailed-exitcode
        continue-on-error: true
      
      - name: Slack Notification
        if: steps.plan.outputs.exitcode == '2'
        uses: 8398a7/action-slack@v3
        with:
          status: custom
          fields: repo,workflow
          custom_payload: |
            {
              "attachments": [{
                "color": "danger",
                "title": "⚠️ Drift Detected",
                "text": "There is a discrepancy between the environment and the code. Please check immediately."
              }]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

The key point is the use of the terraform plan -detailed-exitcode option.
Normally, plan returns 0 upon success, but when this option is added, it returns exit code 2 if "there are differences."
By picking this up with a GitHub Actions if condition, you can easily build the notification logic.

Summary

By introducing drift detection, you can gain the following benefits:

Benefit Description
Accident Prevention Prevent unexpected reversions during deployment
Improved Reliability Maintain a state where "the code is always correct"
Mindset Shift Instill an awareness in the team that "manual changes will be detected"

Simply introducing the detection mechanism described in this article will dramatically improve your team's "IaC discipline." However, to go even further and prevent drift from occurring in the first place, transitioning to a full GitOps flow using tools like Atlantis is recommended.

Further Detailed Practice Guide

In the detailed article published on the Shineos Tech Blog, in addition to the complete implementation guide including the code above, we provide in-depth explanations on the following:

  • Pull-request-based Terraform operations using Atlantis (GitOps)
  • Permission design for revoking AWS write permissions from developers
  • "IaC Pitfalls" and countermeasures common in startups

Please see this article for details.

https://blog.shineos.com/posts/2026-01-07-terraform-iac-drift-automation

Discussion