iTranslated by AI
Refactoring DQ Validation Code: Using AI to Fix Redundant Code Without a Senior Engineer
Introduction
I am currently participating in a project, but I belong to an SES (System Engineering Service) company.
Naturally, there is no senior colleague to guide me step-by-step through OJT.
I believe the standard path to growth as an engineer is to repeatedly experiment while looking at the code of highly skilled seniors. However, that is difficult given the environment at my workplace, and I am struggling with how to sharpen my skills.
Therefore, I decided to consult with "AI-sensei," who has likely learned a vast amount of high-quality code from senior engineers.
While I felt it was a good approach with many insights, there were also many instances where I thought, "I've never seen code like this before," so I strongly feel that I want opinions from those active in the industry.
If you find yourself thinking, "I would write it this way" while reading, I would be very happy if you could let me know in the comments!
Learnings
- It seems the basic methods aren't necessarily bad
- Serial processing is common in production environments
- I was able to experience software design principles like DRY and Fail Fast first-hand
What I asked the AI
#1
-
What I am doing
- Performing calculations on existing integer columns, then filtering based on them.
- Creating and adding new columns because I want to keep the calculation results before filtering.
-
What I am concerned about
- It looks long and redundant.
- I feel it's hard to tell what's happening at a glance because I'm adding columns to the same DataFrame consecutively.
- Maybe there's a more efficient way that I don't know about.
process_df["A"] = abs(process_df["B"] - process_df["C"])
process_df["D"] = abs(process_df["E"] - process_df["F"])
process_df["G"] = process_df["A"] <= 0.1
process_df["H"] = process_df["B"] <= 0.1
-
Conclusion
- While there are suggestions for improvement, there seem to be no issues with the performance or the processing logic itself.
-
Using methods like
sub()orabs()is more robust againstNaNbecause you can specify afill_value.- Since subsequent processes treat
NaNas specific values, maybe the current approach is fine?
- Since subsequent processes treat
- Column additions can be consolidated using
df.assign.
-
Solution
- Use
df.assign. - If you expect more steps to be added in the future, modularize the calculations into functions and use
df.pipe.
- Use
process_df = process_df.assign(
A = (process_df["B"] - process_df["C"]).abs(),
D = (process_df["E"] - process_df["F"]).abs(),
G = lambda d: d["A"] <= 0.1,
H = lambda d: d["B"] <= 0.1
)
def add_diff_cols(df):
df2 = df.copy()
df2["A"] = (df2["B"] - df2["C"]).abs()
df2["D"] = (df2["E"] - df2["F"]).abs()
return df2
def add_flags(df, thr=0.1):
df2 = df.copy()
df2["G"] = df2["A"] <= thr
df2["H"] = df2["B"] <= thr
return df2
process_df = (
process_df
.pipe(add_diff_cols)
.pipe(add_flags, thr=0.1)
)
#2
-
What I am doing
- Outputting logs before and after changes to visually verify if column modifications were performed correctly.
- The available columns differ between existing and new data, but I want to standardize the columns used in subsequent processes (to minimize modification points).
- Separating the intermediate DataFrame and the output DataFrame to inspect what values the intermediate DataFrame holds.
-
What I am concerned about
- Visual inspection is prone to human error.
- The code is vertically long and difficult to read.
- I wonder if there is actually any benefit to separating the intermediate and output DataFrames.
print("======== adjusted_process_df: After adjustment ========")
# In reality, column names are long, so they are listed vertically rather than horizontally
display(adjusted_process_df[[
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K"
]])
# Since subsequent processes use L and M, copy J and K to L and M
output_df = adjusted_process_df.copy()
output_df["L"] = output_df["J"]
output_df["M"] = output_df["K"]
print("======== output_df: Dataframe for saving adjusted CSV ========")
display(output_df[[
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"L",
"M"
]])
-
Conclusion
-
This violates the DRY principle, so it's not ideal.
- Don't Repeat Yourself: Do not write the same logic in multiple places.
-
In a production environment, separating DataFrames is redundant; serial processing is preferable.
- Re-runnability and reproducibility must be guaranteed.
- Consider that using
copy()requires more memory. - Even with serial processing, you can achieve the same level of visibility by logging.
- If concerned, saving in a lightweight format like Parquet is an option.
-
For simple logic, run it in a test environment, and if there are no issues, remove log outputs in the production environment.
- In this case, if you're worried, you can just write a process to compare the columns.
-
This violates the DRY principle, so it's not ideal.
-
Solution
-
Store overlapping columns in a variable and reuse them, then add the specific columns required.
- The goal is to make it so that if a problem occurs, it can be fixed by modifying only one place.
-
Change to serial processing.
- To align with production environment implementation standards.
-
Since the task was to copy existing columns to new ones, handle it with
rename.- Following the Fail Fast principle, validate the columns beforehand so it throws an error if they are missing.
-
Store overlapping columns in a variable and reuse them, then add the specific columns required.
base_cols = ["A","B","C","D","E","F","G","H","I",]
rename_map = {"J": "L", "K": "M"}
old_cols = ["J","K"]
new_cols = ["L","M"]
before_cols = base_cols + old_cols
after_cols = base_cols + new_cols
# Verify column existence. If J or K are missing, an error is raised.
# If they exist, it performs the equivalent of a copy, eliminating the need for visual inspection.
# It's fine if L or M are overwritten.
expected = set(rename_map.keys())
actual = set(adjusted_process_df.columns)
missing = sorted(expected - actual)
if missing:
raise KeyError(f"Columns to rename do not exist: missing={missing}")
print("======== adjusted_process_df: before ========")
display(adjusted_process_df[before_cols])
print("======== adjusted_process_df: after ========")
adjusted_process_df = adjusted_process_df.rename(columns=rename_map)
display(adjusted_process_df[after_cols])
Final Thoughts
I am truly glad that AI has advanced so much (cliché as it sounds).
However, to be honest, I wish I could have grown under a talented supervisor (cliché, I know).
Setting aside programming, I wonder how I should catch up on upstream processes and design through self-study...
If you've read this article and have any guidance or advice like "here's how you should do it," I would be very happy.
Discussion