iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🏆

Why I Won a Machine Learning Competition by Partnering with Claude as a Manufacturing Engineer

に公開

"With all the feature names masked, what on earth are we supposed to do?"

It was the morning the final project for our training session, a machine learning competition, began. A colleague sitting next to me grumbled that. There were over 100 features, all named in the feature_001 format. The target variable was a value that changed over time. The evaluation metric was RMSE. We had half a day. As I listened to my colleague, I was thinking about something else—this data looks similar to patterns I've seen many times in our factory.

Half a day later, I placed 1st out of 20 colleagues. Claude wrote almost all the code. What I did was make three decisions based on on-site intuition. Yet, those were the three things that determined the result.

Target Audience: Engineers who handle manufacturing site data daily but think machine learning is for experts. Those who have started using AI tools but aren't sure how much to trust them.


This data, I've seen it at the site

The data structure was as follows:

  • Features: Over 100. All names masked (feature_001 format).
  • Target Variable: Continuous values changing over time.
  • Task: Regression (predicting target variables following the training data).
  • Evaluation Metric: RMSE (lower is better).
  • Leaderboard: Nearly 300 entries, including past trainees and instructors.

When I plotted the target variable against time, I knew immediately. The standard deviation in the first half of the interval is about twice that of the second half. Something had changed.

In a manufacturing setting, this is a familiar structure. The startup period after equipment changeovers, the unstable period after scheduled maintenance, quality variations immediately after changing raw material lots—using data from periods where conditions have changed to predict stable periods lowers accuracy. Even with masked column names, the patterns shown by the time-series graph overlapped with what I'd seen on the factory floor many times.

"This is a 'discard period' in the first half, and the second half is the real deal." The moment that hypothesis formed, even without column names, I felt I could compete in half a day.


A 5-phase brainstorming log

During the session, I recorded my collaboration with Claude in real-time in a file named COMPETITION_LOG.md. Looking back, it was divided into five phases.

Phase Main Contributor Content Achievement
1 Claude Adopted TimeSeriesSplit Preserved time-series order
Myself Decided on sample_weight strategy (from observing Y distribution) Axis for subsequent improvements
2 Claude Parallel comparison of methods PLS 0.6378, RF 0.5938
3 Claude + Myself Visualization of train size vs CV RMSE (Claude) -> Method selection (Myself) RF n=31 visualization -> Adopted LASSO
4 Claude Alpha range exploration alpha=0.070821 confirmed
5 Myself -> Claude Proposal/Implementation of LASSO -> PLS 2-stage pipeline CV RMSE 0.5355

Phase 1: Adopting Time-Series CV and Focusing on the Second Half

The first prompt was simple:

"I want you to explore the optimal method by comparing multiple regression models while respecting the TimeSeries structure. Please also exhaustively optimize the hyperparameters."

Claude adopted TimeSeriesSplit. It was a reasonable choice to preserve the time-series structure, a judgment stemming from Claude's knowledge.

In parallel, I was doing something else. I plotted the target variable over time and observed the difference in distribution between the first and second half. The first half was scattered, and the second half was calm. "I should use sample_weight to emphasize the second half." This was my own judgment, derived from observing the Y distribution. This strategy would become the axis for all subsequent improvements.

Then, Claude proposed:

"As for feature engineering, I recommend adding time-series lag features."

I added them and tried. CV RMSE did not improve. Claude doesn't know which of the masked features have time-series significance. I decided "this is a miss" and discarded the proposal around 11:00 AM. Five and a half hours remaining. It was the first time Claude missed.

Phases 2–3: Claude for comparison, Myself for selection

From here, I had Claude run the method comparison. I had it evaluate PLS, Random Forest, ElasticNet, and LASSO in parallel, and the results came back.

PLS(n=3):  CV RMSE 0.6378
RF(tuned): CV RMSE 0.5938
ElasticNet / LASSO : 0.59 range

Looking strictly at the numbers, RF is the best. Here, I decided "which one to adopt."

Observing the leaderboard, I sensed a trend. Models with extremely high CV accuracy showed signs of collapsing in real-world performance. Looking at the distribution of nearly 300 submission histories, the top performers weren't flashy models, but rather modestly narrowed-down systems. It wasn't an explicit basis, but a trend I felt from the 'atmosphere' of the leaderboard.

In a situation with 100 features of unknown meaning, complex models like RF have too much room for overfitting. "I'll go with LASSO, which is robust to noise and can automatically prune variables." It was my decision, combining the trend I felt from the leaderboard with my intuition. Claude can explain the characteristics of methods deductively, but it cannot "read the atmosphere of the leaderboard."

Phase 4: Claude dials in the alpha

Following the decision to adopt LASSO, I left the alpha search to Claude. It performed an exhaustive search in the range of 0.001 to 1.0 and returned that CV RMSE is minimized at alpha=0.070821. During this time, I didn't write a single line of code.

Phase 5: I proposed based on on-site intuition

In Phase 5, I thought: rather than using the variables remaining after LASSO directly for prediction, it should be better to compress them one step further with PLS. PLS was a method I often used in on-site multivariate analysis.

"After excluding features with zero coefficients in LASSO, how about a 2-stage approach where we perform PLS regression on the remaining variables?"

"That is a valid approach. LASSO functions as a feature selector, and PLS absorbs the collinearity between remaining variables. I will implement it."

From this response to code generation was just a few minutes.

from sklearn.linear_model import Lasso
from sklearn.cross_decomposition import PLSRegression
import numpy as np

# Exponential weighting (mentioned later, my decision)
n = len(X_train_w)
sample_weights = np.exp(np.linspace(0, 1, n))

# Phase 1: Feature selection with LASSO
lasso = Lasso(alpha=0.070821)
lasso.fit(X_train_w, y_train, sample_weight=sample_weights)
selected = X_train_w.columns[lasso.coef_ != 0]

# Phase 2: PLS with remaining features
pls = PLSRegression(n_components=3)
pls.fit(X_train_w[selected], y_train)
y_pred = pls.predict(X_test_w[selected])

CV RMSE 0.5355—it was within the 1st place range.


Claude missed a second time—The trap of ensembles

When I reached a CV RMSE of 0.5355, I had less than two hours remaining. I asked Claude for the next move.

"Stacking or Blending is effective for further improving accuracy."

I implemented it and submitted. The CV RMSE dropped even further. However, when I looked at the actual performance score, my heart stopped. The score had actually dropped by 0.02.

I had 90 minutes left. For a moment, I wavered between having Claude analyze the cause and continuing.

"There is a possibility that feature_321 is noise specific to the training data. The ensemble may have amplified the impact of that noise."

I was able to remain calm because I had the experience of seeing patterns many times on the factory floor where "accuracy is high in CV but fails in practice." This is overfitting. Go back.—I made that decision and reverted to the configuration from before the ensemble.

I was able to revert because I didn't have much time. If I had more room, I might have made another move and gotten stuck in the swamp. This was the second time Claude missed.


Three decisions driven by manufacturing engineer intuition

① Making the shape of sample_weight "exponential"

How should I implement the strategy decided in Phase 1 to "emphasize the second half"? This was subtle, but effective.

At first, I tried a linear approach (a straight line from 0.5 at the start to 1.5 at the end). It wasn't bad, but I wanted to apply a bit more weight to the very end of the data, which was my primary focus. When I changed it to an exponential curve, the CV RMSE stabilized.

sample_weights = np.exp(np.linspace(0, 1, n))  # Exponential increase from start to end

In a manufacturing setting, this is the same as the intuition of placing the most stable, recent operating data at the center of gravity for prediction. Equipment conditions change over time. Rather than treating old data and new data equally, giving more weight to the most recent data improves prediction accuracy—the same idea I had repeated on the factory floor worked here as well, just in a different form.

② Excluding Time-series features

"If the distribution differs between the first and second half, time-series features must be noise specific to the training period."—As I watched the graph, that hypothesis formed. I excluded time-series features, including feature_321, from training.

This was not something Claude proposed. The intuition I developed from repeatedly deciding to "exclude data from periods like post-maintenance or immediately after equipment changeovers from analysis" on the factory floor worked the same way on data with masked variable names.

③ Switching from IQR to winsorize

I initially wrote the outlier processing using the IQR method. Through brainstorming with Claude, I was taught that "winsorize has less information loss because it aligns outliers to boundary values instead of setting them to zero," so I switched to that.

from scipy.stats.mstats import winsorize
X_train_w = X_train.apply(lambda col: winsorize(col, limits=[0.05, 0.05]))

This was a suggestion from Claude that I adopted. In terms of supplementing knowledge, it was the most straightforward use of AI.


Results—Numbers on the leaderboard

Final submission: submission_lasso_ab.csv (Time-series exclusion + winsorize + exponential weight + LASSO → PLS)

CV RMSE: 0.5355 → Final Score (RMSE): 0.5689

As soon as my score was reflected on the leaderboard screen, I was 1st out of 20 trainee colleagues. I also placed near the top of a leaderboard with nearly 300 entries, including past trainees and instructors.

Seeing that, my colleague next to me laughed and asked, "Is it true you hardly wrote the code?" It was true. Since I kept a brainstorming log with Claude, I could look back at exactly what I didn't write line by line.


Organizing the division of roles

Domain Claude Myself
Parallel method comparison / numerical calculation Fast / Comprehensive
Method selection (what to adopt) Suggests candidates and characteristics Final decision
Code implementation Fast / Accurate Hardly write anything
Reading graph patterns Reading with site intuition
Basis for judgment Limited to within the data Bringing in from outside the data

Claude missed the direction twice. The first was the lag feature proposal, and the second was the ensemble proposal. Both stemmed from Claude not possessing the context of "what this data represents."

I only made three decisions—weighting, time-series exclusion, and outlier processing. Claude wrote all the other code. Nevertheless, those three things were what determined the results.


To manufacturing engineers—3 steps you can try starting tomorrow

In your site's data, there are column names. Temperature, pressure, takt time, equipment number, lot number—only the engineers on site know what those mean. Decisions like "don't include data from this period" or "data from this equipment is special" aren't written in the column names.

If you were to try this approach with your site's data, it would look like this:

  1. Always plot time-series graphs: Line up the target variables by time and search visually for "intervals where the distribution changes." Once you see that, put the sample_weight there.
  2. Cut CV with TimeSeriesSplit: Manufacturing data often has time-series structure. Do not use random CV, as it leaks future information.
  3. Prune features with LASSO: The more sensors you have, the more "variables that aren't effective" you have. LASSO will prune them automatically.

You can let Claude write the code. What you possess is the power to read the "meaning of the interval."

The ability to read context is a true weapon possessed by engineers who have worked on the floor. Not being able to write code is not a weakness—rather, because you can hand over the time spent writing code to AI, you are in a position to focus on making judgments.

GitHubで編集を提案

Discussion