🐕
GitHub Self-hosted Runners - 54% faster CI with just 13% of the cost

dangminhduc
2025/07/08に公開
Hi, I'm Duc, a member of SODA's SRE team.
Here is how we speed up our CI workflows while also reducing the cost of GitHub Actions.

 BackgroundOur team focuses on maintaining the availability and performance of the snkrdunk.com website and the SNKRDUNK application.

Besides that, we are also responsible for controlling the cost of our cloud infrastructure and other monitoring and development tools.
Every month, we review the billing of our infrastructure and tools such as AWS and GitHub etc.

We noticed that the GitHub billing was quite high, especially for the GitHub Actions part.

For context, here is the comparison of our GitHub Action usage vs our server's production workload usage.
Fargate: ~$7,000 per month with the Savings Plan. Our service has an average traffic of about ~5000 requests per second
GitHub Actions: budget is capped at $9000 per month to avoid unexpected costs, but it regularly crosses this threshold, so we must raise the budget to prevent our CICD pipeline from stopping altogether. Sometimes our GitHub bill reached as high as $18,000
That's quite insane!

 What is GitHub ActionsGitHub Actions has become increasingly popular since its release back then in 2019.

GitHub provides 2 types of runners: GitHub-hosted Runner (which runs on Azure since Microsoft acquired GitHub in 2018) and self-hosted runner.

https://github.com/features/actions
The GitHub-hosted comes with pre-installed software on Ubuntu and is very easy to use with all kinds of programming languages, but it comes with fairly high costs.

For comparison, for x86_64 with 2 cores 8GB memory 75GB of disk(just 30GB is usable)
GitHub-hosted: $0.008 per minute(round up to the nearest whole minute)
Self-hosted(on demand EC2 instance N.Virginia region, m7i-flex.large type):  $0.0958(EC2) + 0.0083(EBS) + 0.005(IPv4) = 0.1091 per hour ~ 0.001818

That's a 77% percent price difference! It is an even bigger saving if you use the spot instances.

So, how do we do it?

 The all-in-one solutionAfter a quick search on Google, 2 repositories caught my eye.
The Runs-on project: https://github.com/runs-on
The Philips Labs AWS self-hosted runner project: https://github.com/github-aws-runners/terraform-aws-github-runner

Both are all-in-one solutions, very well-documented, and easy to deploy.

Since our IaC is Terraform, I went with the Phillips Labs Terraform module.

Note: Runs-on requires a license fee($300/year) if you want to use it for commercial purposes.

The architecture of this module is as follows.



 The catchSo all you have to do is set up the parameter for the Terraform module, it should be easy, right?

After setting up the module, run terraform apply and register a GitHub App in the Organization settings, the dummy job is picked up as expected.
That's a good sign. So I changed all the test workflows to use the self-hosted runner. After a few days, I ran into some issues.

 The cost of the NAT Gateway was too highAfter a few days, I checked the billing and was surprised that the cost was very high.

Checking the Cost Explorer with the tag I allocated for this project, I found that the NAT Gateway was the main culprit.

As you can see in the image above, the cost of the NAT Gateway was about 10 times higher than the EC2 cost.

Initially, I placed all the runner instances in the private subnets, so when they need to communicate with the internet, they need to use the NAT Gateway.
On startup, the runners must install the GitHub Action runner binary and other required software to run GitHub workflows
git checkout
Docker image pull from docker.io

Although AWS does not charge for data that goes into their network, if you use NAT gateways, they will charge for the data that goes through them, no matter the direction.

To avoid the weird errors that occurred when reusing a runner instance, I used ephemeral runners, which means every job gets a new instance, and when it is completed, the instance is terminated. It also means all the steps above were running over and over again, which cost a significant amount of network traffic going through NAT Gateways.
To minimize internet communication as much as possible, I've made a few changes to the architecture.
Build and use a custom AMI instead of running everything from scratch. This will reduce the time for the runner to get ready for the job. Inside the AMIs, prepare all the software (C library for CGO, Golang migration tools, etc) and Docker images (Golang, MySQL, Redis, etc)that we use in our workflows.

(The Phillips Labs's module also provides a sample packer template for building custom AMIs. This is a good starting point.)
To reduce the number of times pulling images from Docker Hub and also avoid the rate limit of their site, we switched to AWS Public ECR with a pull-through cache setup on Private Link.
Another alternative is using public subnets instead of private ones, so each instance has a public IP and can communicate with the Internet on its own, so the NAT Gateway will not be needed. But this also means your runner instances will be exposed to the internet during their lifetime, so make sure you harden your security group to prevent unwanted access (For example: only have outbound rules and no inbound rules on the security group)
Result: Daily cost reduced from $800 to ~$100


 The Docker Hub rate limitOur CI workflow requires some Docker images. But the Docker Hub has a very strict rate limit.

https://docs.docker.com/docker-hub/usage/
(We do have a Team Plan on Docker Hub, but we don't like adding secrets into our CI workflows if we can avoid it)

This is not a problem on GitHub-hosted because every job gets a new machine with a new IPv4. But with the NAT Gateway, all of our instances communicate with Docker Hub with limited IPs, so the chance of getting 409 errors from Docker Hub is pretty high.

So we switched to ECR Public + pull-through cache + AWS VPC Endpoint.

The ECR public has almost famous Docker repositories(Golang, Redis, MySQL, etc), so you should find the repository you need quite easily.

Our developers have a minor concern that the image from Docker Hub is different than the image from ECR. So we use the docker scout command to confirm.
docker scout compare --to golang:1.24.3-bullseye public.ecr.aws/docker/library/golang:1.24.3-bullseye
    i New version 1.18.0 available (installed version is 1.17.0) at https://github.com/docker/scout-cli
    ! 'docker scout compare' is experimental and its behavior might change in the future
    ✓ SBOM obtained from attestation, 298 packages found
    ✓ Provenance obtained from attestation
    ✓ SBOM of image already cached, 298 packages indexed
   
  ## Overview
  
                      │                                  Analyzed Image                                   │                                 Comparison Image                                   
  ────────────────────┼───────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────
    Target            │  public.ecr.aws/docker/library/golang:1.24.3-bullseye                             │  golang:1.24.3-bullseye                                                            
      digest          │  cd43396a4113                                                                     │  cd43396a4113                                                                      
      tag             │  1.24.3-bullseye                                                                  │  1.24.3-bullseye                                                                   
      platform        │ linux/arm64/v8                                                                    │ linux/arm64/v8                                                                     
      provenance      │ https://github.com/docker-library/golang.git                                      │ https://github.com/docker-library/golang.git                                       
                      │  6f5593131e9bccda9a4e83f858427d4d0d16b58d                                         │  6f5593131e9bccda9a4e83f858427d4d0d16b58d                                          
      vulnerabilities │    0C     1H     3M   124L                                                        │    0C     1H     3M   124L                                                         
                      │                                                                                   │                                                                                    
      size            │ 280 MB                                                                            │ 280 MB                                                                             
      packages        │ 298                                                                               │ 298                                                                                
                      │                                                                                   │                                                                                    
    Base image        │  buildpack-deps:4724dfb3ebb274c6a19aee36c125858295ad91950e78a195b71f229228a6aaeb  │  buildpack-deps:4724dfb3ebb274c6a19aee36c125858295ad91950e78a195b71f229228a6aaeb   
      tags            │ also known as                                                                     │ also known as                                                                      
                      │   • bullseye-scm                                                                  │   • bullseye-scm                                                                   
                      │   • oldstable-scm                                                                 │   • oldstable-scm                                                                  
      vulnerabilities │    0C     1H     3M    63L                                                        │    0C     1H     3M    63L                                                         
  
  ## Environment Variables
      GOLANG_VERSION=1.24.3
      GOPATH=/go
      GOTOOLCHAIN=local
      PATH=/go/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
  
  ## Packages and Vulnerabilities
  
       298 packages unchanged
As you can see, all the packages that were installed inside both images are identical.

 The self-hosted runners regularly terminated due to the spot instances being reclaimed from AWSTo optimize cost, we utilize spot instances in the US region(which is cheapest compared to other regions). But since this spot instance type, they can be terminated anytime when AWS needs them for other customers.

But there is a trick we can use to reduce the termination rate.

By default, the EC2 Fleet API uses the lowest price strategy when allocating spot instances.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-fleet-allocation-strategy.html
Lowest price -> default
Diversified
Capacity optimized -> lowest termination rate
Price capacity optimized -> best price-termination rate
The termination rate is not the same between AZs. If the availability zone that has more capacity to spare, the termination rate should be lower. You can change the allocation strategy with the instance_allocation_strategy parameter of the module.

As you can see from the picture above, the runner termination rate improved a lot after we switched to the capacity optimized strategy.

(Since running on EC2 is significantly cheaper than GitHub-hosted, we think a little bit higher cost on spot is acceptable, and it also brings a better experience to our developers since they don't need to retry too many jobs in their PR)

 The job queue time was quite significant compared to GitHub-hosted RunnersThere were 2 reasons for this.
The preparation steps(Docker installation, Action binary setup) should take at least 1 minute when the runner boots up. So it can be resolved by using the custom AMIs as described above.
The EC2 quota limit.

Normally, we use the Tokyo region for our production workload, so we don't pay too much attention to the quota limit of other regions.

As for the first reason, a runner should be ready to run a job after a minute or so, but it took longer than that.

After looking up the scaling-up lambda function's log, we found several errors when calling the EC2 Fleet API, which were MaxSpotInstanceCountExceeded. This means our spot request has failed because we reached the quota of AWS. But what is interesting is when we checked the quota in the AWS console, the limit was a random number(648), although we hadn't requested to increase the limit yet.

It turned out that the spot instance limit quota is some kind of soft limit, that AWS continuously monitors customer usage and incrementally increases as needed. But in our case, our usage suddenly increased as we switched all the tests to use self-hosted runners, so this process did not keep up, causing the limit error.

After requesting a reasonable limit, the queue time has reduced significantly.



 Some jobs get stuck in the "queued" state foreverThis is quite a headache.

Sometimes, inside workflows that have several parallel jobs, just part of them are stuck in the queued state. GitHub console outputs nothing, so we have no idea if the job shown as queued is assigned to an EC2 or not. Since there is no instance ID, we can not determine which instance has the problem.
Considering the architecture

We've checked all the following logs.
GitHub Apps
API Gateway access log
Webhook Lambda log
Scale-up Lambda log

We realized that the scale-up Lambda log contains entries like this
{
    "level": "INFO",
    "message": "Created instance(s): i-06486eac2256681ca",
    "timestamp": "2025-07-01T10:51:09.772Z",
    "service": "runners-scale-up",
    "sampling_rate": 0,
    "xray_trace_id": "1-6863bd99-21c88f07e1a5a3e8328e7200",
    "region": "us-east-1",
    "environment": "gh-ci-x64-2core-cpu-optimized",
    "module": "runners",
    "aws-request-id": "cd33581c-5b7d-575e-b3c2-38c1e4a99eec",
    "function-name": "gh-ci-x64-2core-cpu-optimized-scale-up",
    "runner": {
        "type": "Org",
        "owner": "org-owner",
        "namePrefix": ""
    },
    "github": {
        "event": "workflow_job",
        "workflow_job_id": "45123342099"
    }
}
Looking at the GitHub console, we can see that the job ID is the part at the end of the URL.
https://github.com/{org}/{repository}/actions/runs/{workflow_run_id}/job/{job_id}
So with the job ID, we can see that if the job is assigned to any instance or not.

But when a job gets stuck, with the job ID, we still found that the instance for that job had been successfully created and run normally!

Turns out, this is a bug from the Action Runner binary.

https://github.com/actions/runner/issues/3609
Somehow, the runner instance did pick up the job but was unable to GitHub to push the log and report their status, causing the infinite queued state. This bug is still unresolved at the time of writing.

(I assume  that the connectivity problem between AWS and Azure or GitHub has some kind of internal rate limit)
With stuck jobs, our team member can manually cancel and rerun the workflow from the console. But checking nearly a hundred jobs, whether they're really stuck or they just waiting for their turn to run, is very frustrating, no one want to do that.

Finally, we came up with a simple solution.

Using GitHub-hosted runner, we run a scheduled workflow that runs regularly every 15 minutes, checking if there is any workflow that contains jobs which been queued for more than 15 minutes. If there is, cancel that workflow and rerun it.
jobs:
  retry-workflows:
    runs-on: ubuntu-24.04
    name: Retry Queued Workflows
    steps:
      - name: Check and retry queued workflows
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
            QUEUED_RUNS=$(gh api --method GET /repos/{org}/{repository}/actions/runs -F status=queued --jq '.workflow_runs[] | .id')
            CURRENT_TIME=$(date +%s)
            for run_id in $QUEUED_RUNS; do
                QUEUED_JOBS=$(gh api --method GET /repos/{org}/{repository}/actions/runs/"$run_id"/jobs --jq '.jobs[] | select(.status=="queued") | .id')
                for job_id in $QUEUED_JOBS; do
                    # Get the created_at timestamp for the run
                    CREATED_AT=$(gh api --method GET /repos/snkrdunk/snkrdunk.com/actions/jobs/"$job_id" --jq '.created_at')
                    CREATED_TIME=$(date -d "$CREATED_AT" +%s)

                    # Calculate how long the workflow has been queued (in minutes)
                    QUEUED_MINUTES=$(( ("$CURRENT_TIME" - "$CREATED_TIME") / 60 ))

                    echo "The job_id $job_id in the workflow $run_id has been queued for $QUEUED_MINUTES minutes"

                    # Only retry if queued time is between 15 and 120 minutes
                    if [ "$QUEUED_MINUTES" -ge 15 ] && [ "$QUEUED_MINUTES" -le 120 ] ; then
                        echo "Processing workflow run $run_id"
                        gh run cancel "$run_id"
                        sleep 5
                        for i in {1..5}; do
                          if gh run rerun "$run_id"; then
                            break
                          fi
                          echo "Retry $run_id attempt $i failed. Waiting 5 seconds before next attempt..."
                          sleep 5
                        done

                        break
                    fi
                done
            done
With this, we no longer need to cancel and retry stuck jobs manually.

 The temporary partiton size(tmpfs /tmp)We use Amazon Linux 2023 as the base image for our custom AMI. But this distribution of Linux from AWS comes with a caveat.

https://docs.aws.amazon.com/linux/al2023/ug/compare-al2-al2023-tmp.html
The /tmp partition is a tmpfs file system with a size limit of 50% of RAM and a maximum of one million inodes.

If you plan on using /tmp for caching, be careful because it can be filled up very quickly and causing your workflow to fail.

You can use /var/tmp(which is on EBS) as a replacement.

 Optimize workflows for faster CI runtime
 Right-sizing runner instance based on the job's specificationWith the price of the runner significantly reduced by using EC2 instances, we can scale up our instance for a better runtime.

But not all jobs can take advantage of more CPU and memory. For example, our go test is divided into serial and parallel, only the parallel jobs run faster on instances with more CPU.

Now with EC2, we can use CloudWatch to gather resource metrics and scale up instances as needed.

But if you are using GitHub-hosted runners, this is also possible.
https://github.com/marketplace/actions/workflow-telemetry
This GitHub action collects all necessary metrics when a runner runs a job and then publishes the result to the workflow summary.
After checking the result from workflow telemetry, we also realized that some of our jobs are more CPU-intensive, which means instead of using the m instance type(has a CPU-memory ratio of 1:4), we can use the c instance type(has a CPU-memory ratio of 1:2) for better price/performance.

Meanwhile, GitHub-hosted only provides one type of general instance with the CPU-memory ratio equal to the EC2 m instance type.

https://docs.github.com/en/actions/concepts/runners/about-larger-runners

This is just a simple example. If your workflow can take advantage of other instance types (Network-intensive, Memory-intensive, GPU instance, etc), you can change to that for better run time.

 Switching cache backend from GitHub cache to S3GitHub Actions has its own solution to caching files between jobs and workflows to reduce runtime.

But each repository only has a maximum of 10GB. Our repositories are a bit complex; 10GB proved to be not enough, and our cache files often get evicted by GitHub.
As we moved to self-hosted on EC2, using S3 as a cache storage makes more sense.
Virtually unlimited storage
Better download and upload speed with S3 Gateway Endpoint(300~400MB/s vs 50~100MB/s on Github Action Cache)

The runs-on project we talked about above provides a drop-in replacement for the actions/cache action.
https://github.com/runs-on/cache
  AWS_REGION
  RUNS_ON_S3_BUCKET_CACHE
You need to add these 2 environment variables above, and make sure the EC2 runner has proper permission to access the S3 bucket.

The other parameter is the same as the actions/cache, and it will automatically fall back to GitHub Action Cache if it can't use S3.

 Fix the caching problems between stepsThis's not really related to self-hosted, but there are some tricks that we used to optimize
Our test workflow has the following steps
Check if the changes contain files that need to be run test(such as *.go, *.sql)
Divide tests into several parallel jobs based on tags
Inside each job
Check out the source code
Prepare docker containers(go mod download, MySQL, Redis, etc)
Run migration
Run the tests
The test workflow looks like this.

And inside each job, steps like this

As you can see, the Setup backends consumes about 3m30s, and is the same for every parallel job.

For jobs that just consume 5 or 6 minutes, that's 50% of the runtime!

So here is what we did to speed up the whole workflow.

 Caching the MySQL containerWe use the MySQL official image, but our database requires a bit of configuration for the Japanese language before running, so we need to build it before running the test.
#4 [mysql 1/2] FROM docker.io/library/mysql:8.0.36
#4 DONE 0.2s

#5 [mysql 2/2] RUN microdnf install -y glibc-locale-source &&     localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
#5 1.177 Downloading metadata...
#5 16.20 Downloading metadata...
#5 28.49 Downloading metadata...
#5 29.00 Downloading metadata...
#5 32.86 Package                                                          Repository           Size
#5 32.86 Installing:                                                                               
#5 32.86  glibc-gconv-extra-2.28-251.0.3.el8_10.16.x86_64                 ol8_baseos_latest  1.6 MB
#5 32.86  glibc-locale-source-2.28-251.0.3.el8_10.16.x86_64               ol8_baseos_latest  4.4 MB
#5 32.86 Upgrading:                                                                                
#5 32.86  glibc-2.28-251.0.3.el8_10.16.x86_64                             ol8_baseos_latest  2.3 MB
#5 32.86   replacing glibc-2.28-236.0.1.el8_9.12.x86_64                                            
#5 32.86  glibc-common-2.28-251.0.3.el8_10.16.x86_64                      ol8_baseos_latest  1.1 MB
#5 32.86   replacing glibc-common-2.28-236.0.1.el8_9.12.x86_64                                     
#5 32.86  glibc-minimal-langpack-2.28-251.0.3.el8_10.16.x86_64            ol8_baseos_latest 76.5 kB
#5 32.86    replacing glibc-minimal-langpack-2.28-236.0.1.el8_9.12.x86_64                          
#5 32.86 Transaction Summary:
#5 32.86  Installing:        2 packages
#5 32.86  Reinstalling:      0 packages
#5 32.86  Upgrading:         3 packages
#5 32.86  Obsoleting:        0 packages
#5 32.86  Removing:          0 packages
#5 32.86  Downgrading:       0 packages
#5 32.86 Downloading packages...
#5 32.98 Running transaction test...
#5 33.33 Updating: glibc-common;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.49 Updating: glibc-minimal-langpack;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.50 Updating: glibc;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.75 Installing: glibc-gconv-extra;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.93 Installing: glibc-locale-source;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 34.31 Cleanup: glibc;2.28-236.0.1.el8_9.12;x86_64;installed
#5 34.32 Cleanup: glibc-minimal-langpack;2.28-236.0.1.el8_9.12;x86_64;installed
#5 34.33 Cleanup: glibc-common;2.28-236.0.1.el8_9.12;x86_64;installed
#5 34.52 Complete.
#5 DONE 36.3s

#6 [mysql] exporting to image
#6 exporting layers
#6 exporting layers 0.7s done
#6 writing image sha256:bc72bb57206fdb5aeee0e8bd8652e186861312b6abf67f42841c76142cb6fa64 done
#6 naming to docker.io/library/snkrdunkcom-mysql done
#6 DONE 0.7s

#7 [mysql] resolving provenance for metadata file
#7 DONE 0.0s
The build process takes ~30s. But our MySQL configurations are rarely changed(MySQL version up, perhaps), so we don't need to build over and over again in every job.

So we add a MySQL Build step before running all the tests.
Check the hash of the etc/mysql folder, which contains the custom Dockerfile and other MySQL config files.
Using the hash key as a cache key, it checks if there is a built container exists in the cache.
If not, run the build command.
Save the MySQL container to disk, archive the file, and push it to the cache(S3).
mysql-build:
    runs-on: self-hosted-linux-x64-4core-cpu-optimized
    name: Building MySQL image
    steps:
      - name: Checkout
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
        with:
          sparse-checkout: |
            etc/docker/mysql
            docker-compose.ci.yml

      - name: Get mysql folder hash
        run: echo "MYSQL_IMAGE_HASH=$(git ls-files -s etc/docker/mysql | git hash-object --stdin)" >> "$GITHUB_ENV"

      - name: Check if cache exists
        id: cache-hit-check
        uses: runs-on/cache/restore@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
        env:
          RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
        with:
          path: /tmp/docker-build/mysql
          lookup-only: true
          key: test-${{ runner.os }}-${{ runner.arch }}-snkrdunkcom-mysql-${{ env.MYSQL_IMAGE_HASH }}

      - name: Build mysql image
        if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
        run: |
          cp etc/docker/.env.default etc/docker/.env
          docker compose -f docker-compose.ci.yml build mysql

      - name: Cache preparation
        if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
        run: |
          mkdir -p /tmp/docker-build/mysql
          docker save -o /tmp/docker-build/mysql/snkrdunkcom-mysql.tar snkrdunkcom-mysql

      - name: Saving mysql image
        if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
        id: save-mysql-image
        uses: runs-on/cache/save@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
        env:
          RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
        with:
          path: /tmp/docker-build/mysql
          key: test-${{ runner.os }}-${{ runner.arch }}-snkrdunkcom-mysql-${{ env.MYSQL_IMAGE_HASH }}
Then, in each test job, you can download the MySQL container from S3 and extract it
docker load < /tmp/docker-build/mysql/snkrdunkcom-mysql.tar
With the same name in the Docker compose file, the MySQL container shouldn't be built again.

So instead of building a custom MySQL container, which costs 30 seconds of runtime, now we only need ~1 second to download the container file(~350MB) from S3 and ~5 seconds to extract it.

 Caching the migrated databaseTo run tests, the migration command must be run to prepare the schema. Our service, which began in 2018, now requires more than 550 migration steps must be executed.

This step takes >150 seconds total in each job!

But not all changes contain a database migration.

So we added a job before all the test jobs, just to run the data migration process.
Check the hash of the /migrations folder.
Using the hash key as a cache key, it checks if there is a migrated database file exists in the cache.
If not, spin up the MySQL container, then run the migration command.
Stop the MySQL container, archive the database file, and push it to the cache(S3).
db-migrate:
    runs-on: self-hosted-linux-x64-4core-cpu-optimized
    name: Database migration
    steps:
      - name: Checkout
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

      - name: Get migration hash
        run: echo "MIGRATION_HASH=$(git ls-files -s migrations | git hash-object --stdin)" >> "$GITHUB_ENV"

      - name: Check if migration cache exists
        id: cache-hit-check
        uses: runs-on/cache/restore@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
        env:
          RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
        with:
          path: /var/tmp/db_data
          lookup-only: true
          key: test-${{ runner.os }}-${{ runner.arch }}-db-migration-${{ env.MIGRATION_HASH }}

      - name: Setup db
        if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
        run: |
          cp etc/docker/.env.default etc/docker/.env
          docker compose -f docker-compose.ci.yml up -d mysql
          docker run --network snkrdunkcom_default jwilder/dockerize:v0.9.3 -wait tcp://mysql:3306 -timeout 3m
          docker compose exec mysql mysql -uroot -psnkrdunk -e 'SET GLOBAL default_collation_for_utf8mb4=utf8mb4_general_ci'
          make migrate-up DB_NAME=snkrdunk_test

      - name: Cache preparation
        if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
        run: |
          docker compose -f docker-compose.ci.yml down
          sudo chmod -R 775 /var/tmp/db_data/

      - name: Saving DB migrated data for test
        if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
        id: save-migrated-db-data
        uses: runs-on/cache/save@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
        env:
          RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
        with:
          path: /var/tmp/db_data
          key: test-${{ runner.os }}-${{ runner.arch }}-db-migration-${{ env.MIGRATION_HASH }}
With the migrated data already loaded into the database, instead of running the migration command that costs 150 seconds inside each job every time, now we only need < 1 second to download the MySQL datafile(~31MB)

That's a huge saving of runtime!

 Caching the go mod downloadThis is just running go mod download before test jobs, and then restoring the volume with the installed Go packages to prevent the Golang container from running go mod download again.

There's nothing fancy here, so don't bother.

 The resultsAfter all, our workflow has transformed from

to
For the result, as we monitored from DataDog via CI Pipeline Visibility
Our CI runtime is significantly reduced 🎉

Based on the numbers from GitHub Action Performance Metrics, our jobs now average 20% faster than before, with the best-case scenario up to 54%.



And the best part is, after migrating to EC2 self-hosted runners, our latest monthly GitHub Actions usage should have cost $24,700, now just $3000 💸

That's 87.5% cost savings!

 ConclusionSelf-hosted is much cheaper than GitHub-hosted, even with On-demand instances.
When using Spot Instances, remember to change the default allocation strategy for a better termination rate.
You control everything and can tailor-make your runners to optimize your CICD pipeline.
CI observability is very useful. We do have another blog post about DataDog CI Pipeline Visibility here: https://zenn.dev/team_soda/articles/b10194a91dbd34
Be aware of stuck 'Queued' jobs, they can be very annoying.

 Final thoughtSure, migrating to self-hosted requires some maintenance costs, but for us, the benefits from faster CICD pipelines and especially much cheaper GitHub bills are worthwhile.

Recently, we started using AI agents to write code and automatically push code to GitHub. We're expecting the number of jobs running on GitHub Actions will skyrocket from now on. With this migration, the impact on our bills will be even bigger over time.
SODA Engineering Blog
株式会社SODAの開発組織がお届けするZenn Publicationです。是非Entrance Bookもご覧ください！ → recruit.soda-inc.jp/engineer
Background

What is GitHub Actions

The all-in-one solution

The catch

The cost of the NAT Gateway was too high

The Docker Hub rate limit

The self-hosted runners regularly terminated due to the spot instances being reclaimed from AWS

The job queue time was quite significant compared to GitHub-hosted Runners

Some jobs get stuck in the "queued" state forever

The temporary partiton size(tmpfs /tmp)

Optimize workflows for faster CI runtime

Right-sizing runner instance based on the job's specification

Switching cache backend from GitHub cache to S3

Fix the caching problems between steps

Caching the MySQL container

Caching the migrated database

Caching the go mod download

The results

Conclusion

Final thought

Discussion