iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
⚙️

Refactoring DQ Validation Code: Using AI to Fix Redundant Code Without a Senior Engineer

に公開

Introduction

I am currently participating in a project, but I belong to an SES (System Engineering Service) company.
Naturally, there is no senior colleague to guide me step-by-step through OJT.

I believe the standard path to growth as an engineer is to repeatedly experiment while looking at the code of highly skilled seniors. However, that is difficult given the environment at my workplace, and I am struggling with how to sharpen my skills.

Therefore, I decided to consult with "AI-sensei," who has likely learned a vast amount of high-quality code from senior engineers.

While I felt it was a good approach with many insights, there were also many instances where I thought, "I've never seen code like this before," so I strongly feel that I want opinions from those active in the industry.

If you find yourself thinking, "I would write it this way" while reading, I would be very happy if you could let me know in the comments!

Learnings

  • It seems the basic methods aren't necessarily bad
  • Serial processing is common in production environments
  • I was able to experience software design principles like DRY and Fail Fast first-hand

What I asked the AI

#1

  • What I am doing
    • Performing calculations on existing integer columns, then filtering based on them.
    • Creating and adding new columns because I want to keep the calculation results before filtering.
  • What I am concerned about
    • It looks long and redundant.
    • I feel it's hard to tell what's happening at a glance because I'm adding columns to the same DataFrame consecutively.
    • Maybe there's a more efficient way that I don't know about.
before.py
process_df["A"] = abs(process_df["B"] - process_df["C"])
process_df["D"] = abs(process_df["E"] - process_df["F"])
process_df["G"] = process_df["A"] <= 0.1
process_df["H"] = process_df["B"] <= 0.1
  • Conclusion
    • While there are suggestions for improvement, there seem to be no issues with the performance or the processing logic itself.
    • Using methods like sub() or abs() is more robust against NaN because you can specify a fill_value.
      • Since subsequent processes treat NaN as specific values, maybe the current approach is fine?
    • Column additions can be consolidated using df.assign.
  • Solution
    • Use df.assign.
    • If you expect more steps to be added in the future, modularize the calculations into functions and use df.pipe.
after_simple.py
process_df = process_df.assign(
    A = (process_df["B"] - process_df["C"]).abs(),
    D = (process_df["E"] - process_df["F"]).abs(),
    G = lambda d: d["A"] <= 0.1,
    H = lambda d: d["B"] <= 0.1
)
after_func.py
def add_diff_cols(df):
    df2 = df.copy()
    df2["A"] = (df2["B"] - df2["C"]).abs()
    df2["D"] = (df2["E"] - df2["F"]).abs()
    return df2

def add_flags(df, thr=0.1):
    df2 = df.copy()
    df2["G"] = df2["A"] <= thr
    df2["H"] = df2["B"] <= thr
    return df2

process_df = (
    process_df
    .pipe(add_diff_cols)
    .pipe(add_flags, thr=0.1)
)

#2

  • What I am doing
    • Outputting logs before and after changes to visually verify if column modifications were performed correctly.
    • The available columns differ between existing and new data, but I want to standardize the columns used in subsequent processes (to minimize modification points).
    • Separating the intermediate DataFrame and the output DataFrame to inspect what values the intermediate DataFrame holds.
  • What I am concerned about
    • Visual inspection is prone to human error.
    • The code is vertically long and difficult to read.
    • I wonder if there is actually any benefit to separating the intermediate and output DataFrames.
before.py
print("======== adjusted_process_df: After adjustment ========")
# In reality, column names are long, so they are listed vertically rather than horizontally
display(adjusted_process_df[[
    "A",
    "B",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K"
]])

# Since subsequent processes use L and M, copy J and K to L and M
output_df = adjusted_process_df.copy()
output_df["L"] = output_df["J"]
output_df["M"] = output_df["K"]
print("======== output_df: Dataframe for saving adjusted CSV ========")
display(output_df[[
    "A",
    "B",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "L",
    "M"
]])
  • Conclusion
    • This violates the DRY principle, so it's not ideal.
      • Don't Repeat Yourself: Do not write the same logic in multiple places.
    • In a production environment, separating DataFrames is redundant; serial processing is preferable.
      • Re-runnability and reproducibility must be guaranteed.
      • Consider that using copy() requires more memory.
      • Even with serial processing, you can achieve the same level of visibility by logging.
      • If concerned, saving in a lightweight format like Parquet is an option.
    • For simple logic, run it in a test environment, and if there are no issues, remove log outputs in the production environment.
      • In this case, if you're worried, you can just write a process to compare the columns.
  • Solution
    • Store overlapping columns in a variable and reuse them, then add the specific columns required.
      • The goal is to make it so that if a problem occurs, it can be fixed by modifying only one place.
    • Change to serial processing.
      • To align with production environment implementation standards.
    • Since the task was to copy existing columns to new ones, handle it with rename.
      • Following the Fail Fast principle, validate the columns beforehand so it throws an error if they are missing.
after.py
base_cols = ["A","B","C","D","E","F","G","H","I",]
rename_map = {"J": "L", "K": "M"}
old_cols = ["J","K"]
new_cols = ["L","M"]
before_cols = base_cols + old_cols
after_cols = base_cols + new_cols

# Verify column existence. If J or K are missing, an error is raised.
# If they exist, it performs the equivalent of a copy, eliminating the need for visual inspection.
# It's fine if L or M are overwritten.
expected = set(rename_map.keys())
actual   = set(adjusted_process_df.columns)
missing  = sorted(expected - actual)
if missing:
    raise KeyError(f"Columns to rename do not exist: missing={missing}")

print("======== adjusted_process_df: before ========")
display(adjusted_process_df[before_cols])
print("======== adjusted_process_df: after ========")
adjusted_process_df = adjusted_process_df.rename(columns=rename_map)
display(adjusted_process_df[after_cols])

Final Thoughts

I am truly glad that AI has advanced so much (cliché as it sounds).
However, to be honest, I wish I could have grown under a talented supervisor (cliché, I know).

Setting aside programming, I wonder how I should catch up on upstream processes and design through self-study...

If you've read this article and have any guidance or advice like "here's how you should do it," I would be very happy.

Discussion