GitHub Self-hosted Runners - 54% faster CI with just 13% of the cost
Hi, I'm Duc, a member of SODA's SRE team.
Here is how we speed up our CI workflows while also reducing the cost of GitHub Actions.
Background
Our team focuses on maintaining the availability and performance of the snkrdunk.com website and the SNKRDUNK application.
Besides that, we are also responsible for controlling the cost of our cloud infrastructure and other monitoring and development tools.
Every month, we review the billing of our infrastructure and tools such as AWS and GitHub etc.
We noticed that the GitHub billing was quite high, especially for the GitHub Actions part.
For context, here is the comparison of our GitHub Action usage vs our server's production workload usage.
- Fargate: ~$7,000 per month with the Savings Plan. Our service has an average traffic of about ~5000 requests per second
- GitHub Actions: budget is capped at $9000 per month to avoid unexpected costs, but it regularly crosses this threshold, so we must raise the budget to prevent our CICD pipeline from stopping altogether. Sometimes our GitHub bill reached as high as $18,000
That's quite insane!
What is GitHub Actions
GitHub Actions has become increasingly popular since its release back then in 2019.
GitHub provides 2 types of runners: GitHub-hosted Runner (which runs on Azure since Microsoft acquired GitHub in 2018) and self-hosted runner.
The GitHub-hosted comes with pre-installed software on Ubuntu and is very easy to use with all kinds of programming languages, but it comes with fairly high costs.
For comparison, for x86_64 with 2 cores 8GB memory 75GB of disk(just 30GB is usable)
- GitHub-hosted: $0.008 per minute(round up to the nearest whole minute)
- Self-hosted(on demand EC2 instance N.Virginia region, m7i-flex.large type): $0.0958(EC2) + 0.0083(EBS) + 0.005(IPv4) = 0.1091 per hour ~ 0.001818
That's a 77% percent price difference! It is an even bigger saving if you use the spot instances.
So, how do we do it?
The all-in-one solution
After a quick search on Google, 2 repositories caught my eye.
- The Runs-on project: https://github.com/runs-on
- The Philips Labs AWS self-hosted runner project: https://github.com/github-aws-runners/terraform-aws-github-runner
Both are all-in-one solutions, very well-documented, and easy to deploy.
Since our IaC is Terraform, I went with the Phillips Labs Terraform module.
Note: Runs-on requires a license fee($300/year) if you want to use it for commercial purposes.
The architecture of this module is as follows.
The catch
So all you have to do is set up the parameter for the Terraform module, it should be easy, right?
After setting up the module, run terraform apply
and register a GitHub App in the Organization settings, the dummy job is picked up as expected.
That's a good sign. So I changed all the test workflows to use the self-hosted runner. After a few days, I ran into some issues.
The cost of the NAT Gateway was too high
After a few days, I checked the billing and was surprised that the cost was very high.
Checking the Cost Explorer with the tag I allocated for this project, I found that the NAT Gateway was the main culprit.
As you can see in the image above, the cost of the NAT Gateway was about 10 times higher than the EC2 cost.
Initially, I placed all the runner instances in the private subnets, so when they need to communicate with the internet, they need to use the NAT Gateway.
- On startup, the runners must install the GitHub Action runner binary and other required software to run GitHub workflows
- git checkout
- Docker image pull from docker.io
Although AWS does not charge for data that goes into their network, if you use NAT gateways, they will charge for the data that goes through them, no matter the direction.
To avoid the weird errors that occurred when reusing a runner instance, I used ephemeral runners, which means every job gets a new instance, and when it is completed, the instance is terminated. It also means all the steps above were running over and over again, which cost a significant amount of network traffic going through NAT Gateways.
To minimize internet communication as much as possible, I've made a few changes to the architecture.
- Build and use a custom AMI instead of running everything from scratch. This will reduce the time for the runner to get ready for the job. Inside the AMIs, prepare all the software (C library for CGO, Golang migration tools, etc) and Docker images (Golang, MySQL, Redis, etc)that we use in our workflows.
(The Phillips Labs's module also provides a sample packer template for building custom AMIs. This is a good starting point.) - To reduce the number of times pulling images from Docker Hub and also avoid the rate limit of their site, we switched to AWS Public ECR with a pull-through cache setup on Private Link.
Another alternative is using public subnets instead of private ones, so each instance has a public IP and can communicate with the Internet on its own, so the NAT Gateway will not be needed. But this also means your runner instances will be exposed to the internet during their lifetime, so make sure you harden your security group to prevent unwanted access (For example: only have outbound rules and no inbound rules on the security group)
Result: Daily cost reduced from $800 to ~$100
The Docker Hub rate limit
Our CI workflow requires some Docker images. But the Docker Hub has a very strict rate limit.
(We do have a Team Plan on Docker Hub, but we don't like adding secrets into our CI workflows if we can avoid it)
This is not a problem on GitHub-hosted because every job gets a new machine with a new IPv4. But with the NAT Gateway, all of our instances communicate with Docker Hub with limited IPs, so the chance of getting 409 errors from Docker Hub is pretty high.
So we switched to ECR Public + pull-through cache + AWS VPC Endpoint.
The ECR public has almost famous Docker repositories(Golang, Redis, MySQL, etc), so you should find the repository you need quite easily.
Our developers have a minor concern that the image from Docker Hub is different than the image from ECR. So we use the docker scout
command to confirm.
docker scout compare --to golang:1.24.3-bullseye public.ecr.aws/docker/library/golang:1.24.3-bullseye
i New version 1.18.0 available (installed version is 1.17.0) at https://github.com/docker/scout-cli
! 'docker scout compare' is experimental and its behavior might change in the future
✓ SBOM obtained from attestation, 298 packages found
✓ Provenance obtained from attestation
✓ SBOM of image already cached, 298 packages indexed
## Overview
│ Analyzed Image │ Comparison Image
────────────────────┼───────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────
Target │ public.ecr.aws/docker/library/golang:1.24.3-bullseye │ golang:1.24.3-bullseye
digest │ cd43396a4113 │ cd43396a4113
tag │ 1.24.3-bullseye │ 1.24.3-bullseye
platform │ linux/arm64/v8 │ linux/arm64/v8
provenance │ https://github.com/docker-library/golang.git │ https://github.com/docker-library/golang.git
│ 6f5593131e9bccda9a4e83f858427d4d0d16b58d │ 6f5593131e9bccda9a4e83f858427d4d0d16b58d
vulnerabilities │ 0C 1H 3M 124L │ 0C 1H 3M 124L
│ │
size │ 280 MB │ 280 MB
packages │ 298 │ 298
│ │
Base image │ buildpack-deps:4724dfb3ebb274c6a19aee36c125858295ad91950e78a195b71f229228a6aaeb │ buildpack-deps:4724dfb3ebb274c6a19aee36c125858295ad91950e78a195b71f229228a6aaeb
tags │ also known as │ also known as
│ • bullseye-scm │ • bullseye-scm
│ • oldstable-scm │ • oldstable-scm
vulnerabilities │ 0C 1H 3M 63L │ 0C 1H 3M 63L
## Environment Variables
GOLANG_VERSION=1.24.3
GOPATH=/go
GOTOOLCHAIN=local
PATH=/go/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
## Packages and Vulnerabilities
298 packages unchanged
As you can see, all the packages that were installed inside both images are identical.
The self-hosted runners regularly terminated due to the spot instances being reclaimed from AWS
To optimize cost, we utilize spot instances in the US region(which is cheapest compared to other regions). But since this spot instance type, they can be terminated anytime when AWS needs them for other customers.
But there is a trick we can use to reduce the termination rate.
By default, the EC2 Fleet API uses the lowest price
strategy when allocating spot instances.
- Lowest price -> default
- Diversified
- Capacity optimized -> lowest termination rate
- Price capacity optimized -> best price-termination rate
The termination rate is not the same between AZs. If the availability zone that has more capacity to spare, the termination rate should be lower. You can change the allocation strategy with the instance_allocation_strategy
parameter of the module.
As you can see from the picture above, the runner termination rate improved a lot after we switched to the capacity optimized
strategy.
(Since running on EC2 is significantly cheaper than GitHub-hosted, we think a little bit higher cost on spot is acceptable, and it also brings a better experience to our developers since they don't need to retry too many jobs in their PR)
The job queue time was quite significant compared to GitHub-hosted Runners
There were 2 reasons for this.
- The preparation steps(Docker installation, Action binary setup) should take at least 1 minute when the runner boots up. So it can be resolved by using the custom AMIs as described above.
- The EC2 quota limit.
Normally, we use the Tokyo region for our production workload, so we don't pay too much attention to the quota limit of other regions.
As for the first reason, a runner should be ready to run a job after a minute or so, but it took longer than that.
After looking up the scaling-up lambda function's log, we found several errors when calling the EC2 Fleet API, which wereMaxSpotInstanceCountExceeded
. This means our spot request has failed because we reached the quota of AWS. But what is interesting is when we checked the quota in the AWS console, the limit was a random number(648), although we hadn't requested to increase the limit yet.
It turned out that the spot instance limit quota is some kind of soft limit, that AWS continuously monitors customer usage and incrementally increases as needed. But in our case, our usage suddenly increased as we switched all the tests to use self-hosted runners, so this process did not keep up, causing the limit error.
After requesting a reasonable limit, the queue time has reduced significantly.
Some jobs get stuck in the "queued" state forever
This is quite a headache.
Sometimes, inside workflows that have several parallel jobs, just part of them are stuck in the queued
state. GitHub console outputs nothing, so we have no idea if the job shown as queued is assigned to an EC2 or not. Since there is no instance ID, we can not determine which instance has the problem.
Considering the architecture
We've checked all the following logs.
- GitHub Apps
- API Gateway access log
- Webhook Lambda log
- Scale-up Lambda log
We realized that the scale-up Lambda log contains entries like this
{
"level": "INFO",
"message": "Created instance(s): i-06486eac2256681ca",
"timestamp": "2025-07-01T10:51:09.772Z",
"service": "runners-scale-up",
"sampling_rate": 0,
"xray_trace_id": "1-6863bd99-21c88f07e1a5a3e8328e7200",
"region": "us-east-1",
"environment": "gh-ci-x64-2core-cpu-optimized",
"module": "runners",
"aws-request-id": "cd33581c-5b7d-575e-b3c2-38c1e4a99eec",
"function-name": "gh-ci-x64-2core-cpu-optimized-scale-up",
"runner": {
"type": "Org",
"owner": "org-owner",
"namePrefix": ""
},
"github": {
"event": "workflow_job",
"workflow_job_id": "45123342099"
}
}
Looking at the GitHub console, we can see that the job ID is the part at the end of the URL.
https://github.com/{org}/{repository}/actions/runs/{workflow_run_id}/job/{job_id}
So with the job ID, we can see that if the job is assigned to any instance or not.
But when a job gets stuck, with the job ID, we still found that the instance for that job had been successfully created and run normally!
Turns out, this is a bug from the Action Runner binary.
Somehow, the runner instance did pick up the job but was unable to GitHub to push the log and report their status, causing the infinite queued
state. This bug is still unresolved at the time of writing.
(I assume that the connectivity problem between AWS and Azure or GitHub has some kind of internal rate limit)
With stuck jobs, our team member can manually cancel and rerun the workflow from the console. But checking nearly a hundred jobs, whether they're really stuck or they just waiting for their turn to run, is very frustrating, no one want to do that.
Finally, we came up with a simple solution.
Using GitHub-hosted runner, we run a scheduled workflow that runs regularly every 15 minutes, checking if there is any workflow that contains jobs which been queued for more than 15 minutes. If there is, cancel that workflow and rerun it.
jobs:
retry-workflows:
runs-on: ubuntu-24.04
name: Retry Queued Workflows
steps:
- name: Check and retry queued workflows
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
QUEUED_RUNS=$(gh api --method GET /repos/{org}/{repository}/actions/runs -F status=queued --jq '.workflow_runs[] | .id')
CURRENT_TIME=$(date +%s)
for run_id in $QUEUED_RUNS; do
QUEUED_JOBS=$(gh api --method GET /repos/{org}/{repository}/actions/runs/"$run_id"/jobs --jq '.jobs[] | select(.status=="queued") | .id')
for job_id in $QUEUED_JOBS; do
# Get the created_at timestamp for the run
CREATED_AT=$(gh api --method GET /repos/snkrdunk/snkrdunk.com/actions/jobs/"$job_id" --jq '.created_at')
CREATED_TIME=$(date -d "$CREATED_AT" +%s)
# Calculate how long the workflow has been queued (in minutes)
QUEUED_MINUTES=$(( ("$CURRENT_TIME" - "$CREATED_TIME") / 60 ))
echo "The job_id $job_id in the workflow $run_id has been queued for $QUEUED_MINUTES minutes"
# Only retry if queued time is between 15 and 120 minutes
if [ "$QUEUED_MINUTES" -ge 15 ] && [ "$QUEUED_MINUTES" -le 120 ] ; then
echo "Processing workflow run $run_id"
gh run cancel "$run_id"
sleep 5
for i in {1..5}; do
if gh run rerun "$run_id"; then
break
fi
echo "Retry $run_id attempt $i failed. Waiting 5 seconds before next attempt..."
sleep 5
done
break
fi
done
done
With this, we no longer need to cancel and retry stuck jobs manually.
The temporary partiton size(tmpfs /tmp)
We use Amazon Linux 2023 as the base image for our custom AMI. But this distribution of Linux from AWS comes with a caveat.
The /tmp partition is a tmpfs
file system with a size limit of 50% of RAM and a maximum of one million inodes.
If you plan on using /tmp for caching, be careful because it can be filled up very quickly and causing your workflow to fail.
You can use /var/tmp(which is on EBS) as a replacement.
Optimize workflows for faster CI runtime
Right-sizing runner instance based on the job's specification
With the price of the runner significantly reduced by using EC2 instances, we can scale up our instance for a better runtime.
But not all jobs can take advantage of more CPU and memory. For example, our go test is divided into serial and parallel, only the parallel jobs run faster on instances with more CPU.
Now with EC2, we can use CloudWatch to gather resource metrics and scale up instances as needed.
But if you are using GitHub-hosted runners, this is also possible.
This GitHub action collects all necessary metrics when a runner runs a job and then publishes the result to the workflow summary.
After checking the result from workflow telemetry, we also realized that some of our jobs are more CPU-intensive, which means instead of using the m instance type(has a CPU-memory ratio of 1:4), we can use the c instance type(has a CPU-memory ratio of 1:2) for better price/performance.
Meanwhile, GitHub-hosted only provides one type of general instance with the CPU-memory ratio equal to the EC2 m instance type.
This is just a simple example. If your workflow can take advantage of other instance types (Network-intensive, Memory-intensive, GPU instance, etc), you can change to that for better run time.
Switching cache backend from GitHub cache to S3
GitHub Actions has its own solution to caching files between jobs and workflows to reduce runtime.
But each repository only has a maximum of 10GB. Our repositories are a bit complex; 10GB proved to be not enough, and our cache files often get evicted by GitHub.
As we moved to self-hosted on EC2, using S3 as a cache storage makes more sense.
- Virtually unlimited storage
- Better download and upload speed with S3 Gateway Endpoint(300~400MB/s vs 50~100MB/s on Github Action Cache)
The runs-on project we talked about above provides a drop-in replacement for theactions/cache
action.
AWS_REGION
RUNS_ON_S3_BUCKET_CACHE
You need to add these 2 environment variables above, and make sure the EC2 runner has proper permission to access the S3 bucket.
The other parameter is the same as the actions/cache
, and it will automatically fall back to GitHub Action Cache if it can't use S3.
Fix the caching problems between steps
This's not really related to self-hosted, but there are some tricks that we used to optimize
Our test workflow has the following steps
- Check if the changes contain files that need to be run test(such as
*.go
,*.sql
) - Divide tests into several parallel jobs based on tags
- Inside each job
- Check out the source code
- Prepare docker containers(go mod download, MySQL, Redis, etc)
- Run migration
- Run the tests
The test workflow looks like this.
And inside each job, steps like this
As you can see, the Setup backends
consumes about 3m30s, and is the same for every parallel job.
For jobs that just consume 5 or 6 minutes, that's 50% of the runtime!
So here is what we did to speed up the whole workflow.
Caching the MySQL container
We use the MySQL official image, but our database requires a bit of configuration for the Japanese language before running, so we need to build it before running the test.
#4 [mysql 1/2] FROM docker.io/library/mysql:8.0.36
#4 DONE 0.2s
#5 [mysql 2/2] RUN microdnf install -y glibc-locale-source && localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
#5 1.177 Downloading metadata...
#5 16.20 Downloading metadata...
#5 28.49 Downloading metadata...
#5 29.00 Downloading metadata...
#5 32.86 Package Repository Size
#5 32.86 Installing:
#5 32.86 glibc-gconv-extra-2.28-251.0.3.el8_10.16.x86_64 ol8_baseos_latest 1.6 MB
#5 32.86 glibc-locale-source-2.28-251.0.3.el8_10.16.x86_64 ol8_baseos_latest 4.4 MB
#5 32.86 Upgrading:
#5 32.86 glibc-2.28-251.0.3.el8_10.16.x86_64 ol8_baseos_latest 2.3 MB
#5 32.86 replacing glibc-2.28-236.0.1.el8_9.12.x86_64
#5 32.86 glibc-common-2.28-251.0.3.el8_10.16.x86_64 ol8_baseos_latest 1.1 MB
#5 32.86 replacing glibc-common-2.28-236.0.1.el8_9.12.x86_64
#5 32.86 glibc-minimal-langpack-2.28-251.0.3.el8_10.16.x86_64 ol8_baseos_latest 76.5 kB
#5 32.86 replacing glibc-minimal-langpack-2.28-236.0.1.el8_9.12.x86_64
#5 32.86 Transaction Summary:
#5 32.86 Installing: 2 packages
#5 32.86 Reinstalling: 0 packages
#5 32.86 Upgrading: 3 packages
#5 32.86 Obsoleting: 0 packages
#5 32.86 Removing: 0 packages
#5 32.86 Downgrading: 0 packages
#5 32.86 Downloading packages...
#5 32.98 Running transaction test...
#5 33.33 Updating: glibc-common;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.49 Updating: glibc-minimal-langpack;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.50 Updating: glibc;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.75 Installing: glibc-gconv-extra;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 33.93 Installing: glibc-locale-source;2.28-251.0.3.el8_10.16;x86_64;ol8_baseos_latest
#5 34.31 Cleanup: glibc;2.28-236.0.1.el8_9.12;x86_64;installed
#5 34.32 Cleanup: glibc-minimal-langpack;2.28-236.0.1.el8_9.12;x86_64;installed
#5 34.33 Cleanup: glibc-common;2.28-236.0.1.el8_9.12;x86_64;installed
#5 34.52 Complete.
#5 DONE 36.3s
#6 [mysql] exporting to image
#6 exporting layers
#6 exporting layers 0.7s done
#6 writing image sha256:bc72bb57206fdb5aeee0e8bd8652e186861312b6abf67f42841c76142cb6fa64 done
#6 naming to docker.io/library/snkrdunkcom-mysql done
#6 DONE 0.7s
#7 [mysql] resolving provenance for metadata file
#7 DONE 0.0s
The build process takes ~30s. But our MySQL configurations are rarely changed(MySQL version up, perhaps), so we don't need to build over and over again in every job.
So we add a MySQL Build
step before running all the tests.
- Check the hash of the
etc/mysql
folder, which contains the custom Dockerfile and other MySQL config files. - Using the hash key as a cache key, it checks if there is a built container exists in the cache.
- If not, run the build command.
- Save the MySQL container to disk, archive the file, and push it to the cache(S3).
mysql-build:
runs-on: self-hosted-linux-x64-4core-cpu-optimized
name: Building MySQL image
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
sparse-checkout: |
etc/docker/mysql
docker-compose.ci.yml
- name: Get mysql folder hash
run: echo "MYSQL_IMAGE_HASH=$(git ls-files -s etc/docker/mysql | git hash-object --stdin)" >> "$GITHUB_ENV"
- name: Check if cache exists
id: cache-hit-check
uses: runs-on/cache/restore@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
env:
RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
with:
path: /tmp/docker-build/mysql
lookup-only: true
key: test-${{ runner.os }}-${{ runner.arch }}-snkrdunkcom-mysql-${{ env.MYSQL_IMAGE_HASH }}
- name: Build mysql image
if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
run: |
cp etc/docker/.env.default etc/docker/.env
docker compose -f docker-compose.ci.yml build mysql
- name: Cache preparation
if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
run: |
mkdir -p /tmp/docker-build/mysql
docker save -o /tmp/docker-build/mysql/snkrdunkcom-mysql.tar snkrdunkcom-mysql
- name: Saving mysql image
if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
id: save-mysql-image
uses: runs-on/cache/save@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
env:
RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
with:
path: /tmp/docker-build/mysql
key: test-${{ runner.os }}-${{ runner.arch }}-snkrdunkcom-mysql-${{ env.MYSQL_IMAGE_HASH }}
Then, in each test job, you can download the MySQL container from S3 and extract it
docker load < /tmp/docker-build/mysql/snkrdunkcom-mysql.tar
With the same name in the Docker compose file, the MySQL container shouldn't be built again.
So instead of building a custom MySQL container, which costs 30 seconds of runtime, now we only need ~1 second to download the container file(~350MB) from S3 and ~5 seconds to extract it.
Caching the migrated database
To run tests, the migration command must be run to prepare the schema. Our service, which began in 2018, now requires more than 550 migration steps must be executed.
This step takes >150 seconds total in each job!
But not all changes contain a database migration.
So we added a job before all the test jobs, just to run the data migration process.
- Check the hash of the
/migrations
folder. - Using the hash key as a cache key, it checks if there is a migrated database file exists in the cache.
- If not, spin up the MySQL container, then run the migration command.
- Stop the MySQL container, archive the database file, and push it to the cache(S3).
db-migrate:
runs-on: self-hosted-linux-x64-4core-cpu-optimized
name: Database migration
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Get migration hash
run: echo "MIGRATION_HASH=$(git ls-files -s migrations | git hash-object --stdin)" >> "$GITHUB_ENV"
- name: Check if migration cache exists
id: cache-hit-check
uses: runs-on/cache/restore@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
env:
RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
with:
path: /var/tmp/db_data
lookup-only: true
key: test-${{ runner.os }}-${{ runner.arch }}-db-migration-${{ env.MIGRATION_HASH }}
- name: Setup db
if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
run: |
cp etc/docker/.env.default etc/docker/.env
docker compose -f docker-compose.ci.yml up -d mysql
docker run --network snkrdunkcom_default jwilder/dockerize:v0.9.3 -wait tcp://mysql:3306 -timeout 3m
docker compose exec mysql mysql -uroot -psnkrdunk -e 'SET GLOBAL default_collation_for_utf8mb4=utf8mb4_general_ci'
make migrate-up DB_NAME=snkrdunk_test
- name: Cache preparation
if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
run: |
docker compose -f docker-compose.ci.yml down
sudo chmod -R 775 /var/tmp/db_data/
- name: Saving DB migrated data for test
if: ${{ steps.cache-hit-check.outputs.cache-hit != 'true' }}
id: save-migrated-db-data
uses: runs-on/cache/save@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
env:
RUNS_ON_S3_BUCKET_CACHE: dummy-bucket
with:
path: /var/tmp/db_data
key: test-${{ runner.os }}-${{ runner.arch }}-db-migration-${{ env.MIGRATION_HASH }}
With the migrated data already loaded into the database, instead of running the migration command that costs 150 seconds inside each job every time, now we only need < 1 second to download the MySQL datafile(~31MB)
That's a huge saving of runtime!
Caching the go mod download
This is just running go mod download
before test jobs, and then restoring the volume with the installed Go packages to prevent the Golang container from running go mod download
again.
There's nothing fancy here, so don't bother.
The results
After all, our workflow has transformed from
to
For the result, as we monitored from DataDog via CI Pipeline Visibility
Our CI runtime is significantly reduced 🎉
Based on the numbers from GitHub Action Performance Metrics, our jobs now average 20% faster than before, with the best-case scenario up to 54%.
And the best part is, after migrating to EC2 self-hosted runners, our latest monthly GitHub Actions usage should have cost $24,700, now just $3000 💸
That's 87.5% cost savings!
Conclusion
- Self-hosted is much cheaper than GitHub-hosted, even with On-demand instances.
- When using Spot Instances, remember to change the default allocation strategy for a better termination rate.
- You control everything and can tailor-make your runners to optimize your CICD pipeline.
- CI observability is very useful. We do have another blog post about DataDog CI Pipeline Visibility here: https://zenn.dev/team_soda/articles/b10194a91dbd34
- Be aware of stuck 'Queued' jobs, they can be very annoying.
Final thought
Sure, migrating to self-hosted requires some maintenance costs, but for us, the benefits from faster CICD pipelines and especially much cheaper GitHub bills are worthwhile.
Recently, we started using AI agents to write code and automatically push code to GitHub. We're expecting the number of jobs running on GitHub Actions will skyrocket from now on. With this migration, the impact on our bills will be even bigger over time.

株式会社SODAの開発組織がお届けするZenn Publicationです。 是非Entrance Bookもご覧ください! → recruit.soda-inc.jp/engineer
Discussion