iTranslated by AI
Why Achieving Over 5 Million Yen in Annual Cost Savings Feels Strangely Empty
Table of Contents
Introduction
What I Did and the Results
Improving SHAP Value Processing
Background
Issue 1: Using lambda expressions inside for-loops
Issue 2: Converting int to str
Solution
Execution Time
Calculating Cost Avoidance
Skipping Preprocessing in Model Creation
Background
Solution
Calculating Cost Reduction
Why It Feels Empty
Closing Thoughts
Introduction
By optimizing SHAP value processing, I successfully reduced processing time by over 99.99% and suppressed AWS compute costs by more than 4.4 million yen per year.
I also enabled skipping pre-processing steps that created heavy intermediate tables during every batch run, reducing Athena scan fees by over 600,000 yen per year.
This should be highly impactful and meaningful engineering from a business perspective.
Yet, why?
Why does it feel so strangely plain?
Why is there such a lack of a "mission accomplished" feeling?
Why does it feel so empty...
So, partly to sort out this feeling of emptiness, I decided to document what I did.
What I Did and the Results
As mentioned earlier, I did the following two things:
- Optimized SHAP value processing, reducing processing time by over 99.99% and suppressing AWS compute costs by more than 4.4 million yen per year.
- Enabled skipping pre-processing that created heavy intermediate tables during every batch run, reducing Athena scan fees by over 600,000 yen per year.
As a result, I achieved a cumulative cost reduction of over 5 million yen per year.

I will now break down the background and specific measures for each.
Improving SHAP Value Processing
Background
This is about cost avoidance.
By cost avoidance here, I mean preventing costs that would have originally been incurred.
A new SHAP value processing step was being added, but since the implementation used in other features was high-cost, I optimized it so it could handle large-scale models.
SHAP values are metrics that fairly quantify how much each feature contributes to a model's prediction. They are used to explain why a machine learning model output a specific value.
The background of this effort involved overfitting. Training errors were extremely small, but generalization errors were over 20%—and in some cases, more than 45%. Therefore, SHAP was introduced as one of the tasks to isolate the cause of the increased generalization error. (In reality, it wasn't overfitting but data leakage, but that's a story for another time.)
Issue 1: Using lambda expressions inside for-loops
Since other models already employed SHAP processing, I tried applying that same logic, but the logs stopped in the middle of the SHAP process. No error was thrown, but the process didn't seem to be moving at all.
Upon investigation, I found it was stuck at the following process:
for i in Series.index:
if Series.loc[:, i].max() == Series_median[i]:
Series.loc[:, i] = Series.loc[:, i].apply(lambda x: 1 if x==Series_median.loc[:, i].max() else 0)
else:
...
This is preprocessing for SHAP interpretation, but the problem wasn't the logic itself, but the implementation.
The above uses a lambda expression to return 1 if the condition is met and 0 otherwise, calling a Python function for each unit in the Series.
Implementing lambda with apply can reduce the amount of code, but internally it calls a Python function for every element, which prevents vectorization.
As a result, the overhead of function calls multiplied by the number of iterations in the for loop, making it a likely bottleneck for big data.
...There were three such for loops in the existing SHAP value processing, causing it to take an incredibly long time.
Issue 2: Converting int to str
Following the previous example, I looked for other slow areas and found this.
When calculating SHAP values, the positive and negative signs are crucial.
Specifically, we look at the magnitude of the feature itself and its impact on the prediction. We check the degree of influence by dividing the features into four quadrants—++, +-, -+, and --—for instance, if a feature has a large magnitude and a positive impact on the prediction, it's ++.
To calculate this, the existing SHAP value processing converted int to str to identify these four quadrants as shown below:
Series = Series.astype(str)
shap_values = shap_values.astype(str)
answer = Series + shap_value
...
for i in Series:
df_shap.loc[df_shap["fea_name"]==i, "fea+_shap+"] = len(answer[answer[i]=="11"])
...
This was quite an anti-pattern because:
- The cost of converting
inttostris high. -
strincreases memory usage. - Vectorized operations don't work on
str, causing it to fall back to Python processing. - Being Python processing, the overhead piles up.
It had become a rather egregious piece of processing.
Solution
So, I refactored it. First, for Issue 1, I changed it as follows:
for i in X_median.index:
if X.loc[:, i].max() == X_median[i]:
X.loc[:, i] = np.where(X.loc[:, i] == X.loc[:, i].max(), 1, 0)
else:
...
The only point I focused on was: "Eliminate lambda processing and unify with NumPy to maximize vectorized operations."
Vectorized operations mean that instead of processing one element at a time with a for-loop, the same calculation can be written for the entire NumPy array at once.
Previously, it took an enormous amount of time because a Python function was being called using lambda for each element. With the above process, the operation X.loc[:, i] == X.loc[:, i].max() is applied to the entire series at once, so it finishes in an instant.
For Issue 2, I similarly ensured that vectorized operations could be applied, and I devised a way to effectively represent the four quadrants like ++.
df_shap["fea+_shap+"] = 0
df_shap["fea+_shap-"] = 0
df_shap["fea-_shap+"] = 0
df_shap["fea-_shap-"] = 0
for i in Z_median.index:
# Generate a unique numeric column representing the 4 states (0, 1, 2, 3) from two 0/1 columns
# (Z=1, S=1) -> 3 (++)
# (Z=1, S=0) -> 2 (+-)
# (Z=0, S=1) -> 1 (-+)
# (Z=0, S=0) -> 0 (--)
combined_code = Z[i] * 2 + shap_values_df[i]
counts = combined_code.value_counts()
row_index = df_shap["fea_name"] == i
df_shap.loc[row_index, "fea+_shap+"] = counts.get(3, 0)
df_shap.loc[row_index, "fea+_shap-"] = counts.get(2, 0)
df_shap.loc[row_index, "fea-_shap+"] = counts.get(1, 0)
df_shap.loc[row_index, "fea-_shap-"] = counts.get(0, 0)
In addition to ensuring everything is completed with vectorized operations, I introduced a bitwise-style calculation combined_code = Z[i] * 2 + shap_values_df[i] instead of using strings like 11 to represent ++.
As a result, independent values are returned—such as 3 for ++ and 2 for +-—allowing the contribution to be calculated as shown below:
| Z | SHAP | combined_code |
|---|---|---|
| + | + | 3 |
| + | - | 2 |
| - | + | 1 |
| - | - | 0 |
When I thought of this process, I truly felt, "I'm so glad I studied algorithms." You never know what will come in handy.
Execution Time
This model calculates the contact rate for specific media, and it typically trains for about 48 media outlets. Since there are nearly 150 features for each medium, the total number of features processed through the for loops is 7,200, which is quite large-scale.
The execution times for each step before and after the optimization are summarized below.
| Before Optimization | After Optimization | |
|---|---|---|
| 1st for-loop | 17 seconds per feature | 0.5 seconds for 150 features |
| 2nd for-loop | Unmeasurable (process froze) | 5.6 seconds for 150 features |
| 3rd for-loop | Unmeasurable | 15 seconds for 150 features |
| SHAP calculation | Unmeasurable | 0.015 seconds for 150 features |
| Total | Unmeasurable | 30 seconds per medium |
Before the optimization, the process froze due to swap thrashing in the middle of the first for-loop, so the second and subsequent loops could not be measured. Looking at the values after the optimization, it seems it would have taken at least 30 seconds per loop even in the best-case scenario...
Calculating Cost Avoidance
To calculate the minimum cost avoidance, let's assume that the execution time for all steps before the modification, including the SHAP value calculation, took the same amount of time as the first for-loop.
The time required to train one medium for each of the four phases is:
Since each phase takes 34 hours, the total processing time for all four phases is:
This means that calculating SHAP values for all media alone would take more than five and a half days.
These are batch processes executed regularly at the beginning of every month. Additionally, when code changes occur, it is necessary to execute them in an environment that copies the production environment for test debugging. Since this happens about 2 to 6 times a month, we can assume they run at least 3 times every month.
The batch process runs on AWS EC2, and the instance used is an r5.24xlarge—a monster-class high-spec machine costing $6.05 per hour. Therefore, the annual compute cost is:
In contrast, calculating for the same scenario after the modification takes 1440 seconds for 48 media, which is 0.4 hours:
Therefore, the minimum annual cost avoidance is:
That's over 4.4 million yen in cost avoidance. Hmm, wonderful.
The following diagram illustrates this:

Hmm, I can't really tell. I guess it means cost avoidance on a literally different scale.
Skipping Preprocessing in Model Creation
Background
Next is cost reduction. As mentioned earlier, this model is a batch process and has both production and test environments. Below is the architecture diagram.

The production environment runs at least once a month, and the test environment runs at least twice. Since it is a large-scale model that trains on billions of records, the Athena scan fees are enormous.
It costs about 55,000 yen per execution.
However, while working on pipeline automation and implementing data leakage fixes or minor refactoring in the future, test debugging is indispensable.
As such, the execution cost of test debugging was a major bottleneck.
Solution
In a previous effort, I found that the bulk of the Athena scan fees came from creating intermediate tables during preprocessing, rather than fetching feature data. That previous effort is documented here:
I thought if I could skip the creation of those intermediate tables, I could achieve significant cost reductions. Upon investigation, I discovered that the intermediate tables were created using CTAS (Create Table As Select) and physically stored on S3.
So, I modified the code to skip the creation of these preprocessing intermediate tables when a specific string is passed as an argument. Since the production environment runs without this argument, I can narrow down the impact scope to the test environment only, as shown below.

Calculating Cost Reduction
After the modification, I performed test debugging, checked the Athena scan amount, and compared it with the results before the modification as follows. The fee calculation method is:
| Before Modification | After Modification | |
|---|---|---|
| Scan Amount | 72314455573227 | 1347775563955 |
| Fee | Approx. 55,000 yen | Approx. 1,010 yen |
Therefore, approximately 54,000 yen in costs can be saved per re-execution. To calculate the minimum cost reduction, assuming one regular execution and two test debug runs as before, the reduction effect can be expected for the second test debug run.
The minimum annual cost reduction is:
Illustrated below:

Compared to the cost avoidance measures, the visual clarity in the diagram increased by about 1%, but the amount doesn't seem like much, does it?
However, last month's test debug fees were about 340,000 yen for that month, meaning roughly six full executions were performed. Since refactoring will continue, if we can reduce costs for five runs every month:
This results in a very large cost reduction.
Illustrated again:

...The ratio doesn't change, so the appearance won't change either, right?
However, since the numerical values increase, combined with cost avoidance, I believe I contributed to at least 5 million yen per year, and realistically over 7.6 million yen in cost reduction.

Why Does It Feel So Empty?
Looking at the numbers alone, I think this modification was a sufficiently impactful achievement. Yet, for some reason, that "I really did it" feeling didn't well up, and a sense of emptiness remained.
After thinking about it a lot, it suddenly dawned on me.
Nobody praised me...
This initiative wasn't a task to "create something new," but rather a task to "cut away problems that were supposed to happen."
Since I cut them away, it's natural that no problems occurred, but infrastructure improvement tasks like MLOps are often unconsciously perceived as "supposed to work as a matter of course." As a result, it ends up looking as if "nothing happening = nothing was done."
I believe the plainness and emptiness I felt this time didn't come because the achievement was small, but rather from the temperature difference caused by the lack of reaction from those around me—despite the fact that I took responsibility for modifying the core of the process and produced a massive result of over 5 million yen in annual cost reduction.
Praise and recognition are important, aren't they? When I have juniors, I intend to praise them plenty.
Closing Thoughts
This modification was an effort aimed at structurally reducing risks and running costs for the continuous operation of the model.
Personally, I work as an engineer with the goal of "creating business value through engineering," so I believe this outcome is an ideal way to deliver value.
While I mainly calculated the minimum cost reduction in this article,
- The long execution times I eliminated included risks such as re-executions caused by network errors or OOM (Out Of Memory) interruptions, which have also been suppressed.
- Test debugging, where the model is run partway through training, occurs frequently.
- Regular executions also fail about once every three months, causing the production environment to be run twice in a month.
And so on—it was anticipated that much higher costs would originally be incurred, and I believe the core of this modification was being able to prevent them.
It's not a one-off cost reduction but a continuous one through system improvements, so there's a certain beauty in how it becomes more effective the longer this model is used (if I do say so myself).
Although the actual cost reduction cannot be calculated, I was able to perform engineering that makes me dream—thinking it might even reach tens of millions per year.
Nevertheless, MLOps is a job where "working as intended is the norm," and since the work itself is relatively plain, there isn't much praise, which makes it sad that the feeling of having done something truly amazing doesn't quite sink in.
However, that is precisely why there is value in verbalizing and documenting it like this, and I believe it's important for showcasing my hard work and results to those around me.
I've come to realize that promoting my own achievements to get praised and using them as leverage for unit price increases, raises, and promotions is also part of my job. Since I've learned mindsets from this improvement that can be horizontally applied to other large-scale batch processing and MLOps projects, I will continue to devote myself and promote my work to be praised as much as possible.
Discussion
コスト推移のグラフをつけるとか文章を読まなくてもわかるようにするとか、アピールの仕方を工夫したらいいかもしれないですね。
記事を読む限りやってやったぜって仕事をしたのは確かなのに、内容を理解して評価しようとする人はたぶんあんまりいません。だって終わったことだし面倒だもの。
くらいの説明(スライド5,6毎)があった上で、この記事の内容が続けば読まれやすいし評価もされやすいのかなって思ったりした。
ご指摘ありがとうございます!
コスト推移が直感的に伝わるよう、グラフやアーキテクチャ図を追加してみました。
まだ改善できそうな点があれば、ぜひまた教えてください!
門外漢なので技術的なことは分かりませんが。
むなしさの正体は、「成果が是正であり創造ではないこと」よりも、「設計段階で織り込まれるべき前提が実装されていなかった」点にある気がします。
①最初は(とりあえず)動くことが大切。
②運用後は、止まるリスクは避けたい。(=リファクタリングもリスク?)
組織に改善(もしくはコンセプト・デバッグ)を評価奨励する仕組みがあると良いのですが……。難しいですね。
コメントありがとうございます!
確かに、リファクタリング前提ではなく設計段階で考えておきたい観点ですよね。
今回は運用からの参画でしたが、今後設計から関わる場合には、事前に組み込めるよう意識していきます!
技術的な武勇伝として読めばとても面白く、改善内容そのものは素晴らしいと思いました。
組織批判も、より良くしたい気持ちの表れなので健全だと感じます。
ただ、「褒めてほしい」という方向に寄ってしまうと、せっかくの技術的成功が
感情に引っ張られてしまい、少しもったいない印象もありました。
実際には本番運用前に気づいて修正した形だと思うので、
変数の取り違えや条件判定のミスと同じく、事前検証で防いだ“バグ修正”の範囲にも見えます。
記事の中で一番気になったのは、この問題がどこで埋め込まれたのかが
はっきりしていない点です。
実装段階のミスであれば良い経験として次に活かせばよいですが、
設計段階の想定漏れであれば、かなり危機感を持つべき内容だと思います。
どこで問題が入り込んだのかを特定し、再発防止につなげることをおすすめします。
コメントありがとうございます!
技術的な観点から冷静に読んでいただき、とても参考になります。
補足として、記事の構成や表現について少しだけ背景を共有させてください。
まずアウトプットの意図についてですが、私が所属するSESという業態では、
案件先では成果が前提化されやすく、所属会社側からは日々の取り組みが把握されにくいという構造になりやすいと感じています。
そのため、キャリア形成や単価交渉の文脈も含め、価値ある成果を再現性をもって発揮できることを、外部に言語化・可視化することを重視しています。
ただし、実績を淡々と列挙するだけでは人となりが伝わらず、記事としても面白みに欠けると考えているため、意図的に感情表現も含めた構成にしています。
また、本件は私が新規に設計・実装したモデルではなく、本番運用中のモデルに対してMLOpsとして途中参画した案件です。
機械学習では、動くモデルを作らなければ価値があるかの検証できないという特性があり、初期段階ではスピードを優先する判断が取られることも多いと考えています。
私自身は、そうして作られたモデルを壊さずに運用し、継続的に改善していくことに責任を持つ立場で関与しています。
そのため、個人や過去の判断を評価・批判する意図はなく、あくまで「運用の中でどう価値を高めていったか」に焦点を当てています。
ご指摘いただいた「どこで問題が入り込んだのか」という視点は非常に重要で、実務上は再発防止の観点で整理・対応していますが、あくまで記事としての可読性やエンタメ性を優先し、詳細な説明は省略しています。
いただいた視点は今後のアウトプットにも活かしていきたいと思います。
改めて、丁寧なコメントありがとうございました。
丁寧な補足ありがとうございます。
背景について理解が深まりました。
お話を伺って、現場でしっかり価値を出されてきたことは伝わりましたし、
その積み重ねが今回の記事につながっているのだと思います。
一方で、現場では「当たり前を当たり前に保つ」ことが前提になりやすく、
顧客からの感謝は期待通りの 100 点ではなく、期待を超えた 120 点を出した時に
ようやく言葉として返ってくるものだと感じています。
これは良し悪しではなく、プロの現場の空気のようなものですね。
今回の改善は、そうした前提の中での“第一歩としての 100 点”だったと思いますし、
ここからさらに積み上げていけば、より大きな価値すなわち評価につながっていくはずです。
また、担当者が個人的に「助かった」「ありがとう」と言ってくれることはあっても、
会議の場で感謝が言葉として出てくることはほとんどありません。
どうしても会議体は課題や次のアクションに意識が向くため、
成果は“当たり前”として扱われがちだと思います。
成果の可視化という点では、
所属会社への業務報告の中でコスト削減や改善効果を
定量的に織り込むだけでも十分に伝わる部分が多いと思いますし、
今回のような記事はその補完としてとても良い形だと感じました。
今後の発信も楽しみにしています。
面白かったです!
コスト削減のうちの20-30%ぐらいを賞与に還元してほしいですよね!(故意にコスト増やしたわけじゃないし)
コメントありがとうございます!
還元されたら最高ですね🤣
今後も継続的に成果を出して、還元していただけるよう精進します!