iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔁

How I Ended Up Redoing E2E Test Evidence Collection with AI 5 Times

に公開

Premise

I am involved in the maintenance and development of a web service that has been in operation for nearly 20 years. In a legacy environment where a PHP admin panel (EUC-JP) and a Rails API coexist, I tried to have Claude Code collect evidence (proof of operation confirmation) for a bug-fix PR using automated browser operations.

To get straight to the point, I had to redo it five times. What I realized through that process is the fact that "AI can operate a browser, but it doesn't understand what it needs to prove."

Failure 1: Screenshots were taken, but they weren't "evidence"

I asked Claude Code to collect evidence using agent-browser. It took seven screenshots of various pages in the admin panel and reported, "Task completed."

What was needed was a flow showing the causal relationship: "State before operation → Execution of operation → DB state after operation." AI can perform the act of "taking a screenshot." However, it doesn't think as far as "what should be captured and where to prove the point."

Failure 2: Passing tests don't count as evidence

Next, I had it create Playwright tests. All tests passed. However, when I checked the test reports, there were three problems:

  • Preconditions for the test subject are not documented
  • No points to verify intermediate states
  • No screenshots of the screen to support the operation results

Passing a test is different from establishing evidence. It was necessary to use test.info().attach() to attach preconditions so that a third party could understand "under what conditions, what was verified, and what the result was."

Failure 3: Running away from errors

An error occurred on the details page of the admin panel. Claude Code then suggested a workaround: "Let's record the DB values in a text file as evidence."

I instructed it, "Wait a second, investigate the code."

As a result, two bugs were found: a parameter error in PHP's in_array() and an uninitialized $id variable. AI tends to jump to the "easy detour." Unless a human corrects the direction by saying "investigate the root cause," essential problems will be overlooked.

Failure 4: Missing contradictions in business logic

After directly changing a certain status in the database, a contradictory display appeared on the screen. The AI reported it exactly as it was.

"Isn't it strange that this display appears in this state?"

The cause was that the table referenced by the UI was different from what was expected. Domain knowledge is necessary to account for the "business consistency" of test data. Since AI lacks this knowledge, it overlooks such contradictions.

Failure 5: Back to square one due to context overflow

In a single day, session interruptions occurred more than 8 times. Browser sessions (login states) were also lost. I had to redo the same "Login → Page Transition → Screenshot" flow from the beginning many times. Tool selection also fluctuated, such as switching from agent-browser to Playwright halfway through.

Long-running tasks like E2E testing are not a good fit for context limits.

How I finally finished it

I narrowed it down to three Playwright tests.

// Test 1: Clearly specify preconditions as attachments and capture operation results
test("Setting changes are correctly reflected", async ({ page }, testInfo) => {
  // Attach preconditions to the evidence
  await testInfo.attach("Preconditions", {
    body: "Target: Test Account\nCurrent Setting: Pattern A\nExpected Result: Change to Pattern B",
    contentType: "text/plain",
  });

  // Capture the state before operation
  await page.goto("/admin/target-page");
  await page.screenshot({ path: "before-change.png" });

  // Execute the setting change
  await page.click('[data-action=\"submit\"]');

  // Capture the state after operation
  await page.screenshot({ path: "after-change.png" });
});
  • Test 1 -- Specify preconditions as attachments and capture the state before and after the operation
  • Test 2 -- Highlight and capture the "waiting for processing" status in the admin panel's change history
  • Test 3 -- Verify the current settings and the state after changes on the admin panel

I also had to fix two bugs on the admin panel side along the way. Ultimately, it came down to a workflow where a human defines the "template" for the evidence and then has the AI execute it.

What I learned

I will summarize the lessons learned from these five failures.

1. Define "what to prove" first

Use a template for instructions: "State before operation → Operation → Change after operation." Simply asking an AI to "take evidence" only results in a pile of screen captures.

2. Always make it record preconditions

Have it attach information that a third party can understand, such as test account status and configuration values. Whether a test passed or not does not constitute proof on its own.

3. Humans must supplement domain knowledge

An AI cannot make judgments like "this display is strange in this state." The correctness of business logic can only be checked by a human who understands that business.

4. Make it chase the root cause, not a workaround

When an error occurs, the instruction should be "investigate the cause" rather than "provide an alternative." Left to their own devices, AIs tend to try to bypass problems via the shortest path.

5. Design intermediate saves for long-running tasks

Save browser operation scripts frequently to prepare for context overflow. In tasks where the sequence of events is crucial, such as E2E testing, a session timeout can be fatal.

Retrospective

Entrusting browser operations to an AI was practical enough in itself. However, designing "what to capture" and "what to prove" is a human's job, and skipping that part only leads to more rework.

Collecting evidence was fundamentally about "designing the proof," not "automating the operations."

Discussion