iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦾

More Than Just Generating Test Cases: Introducing manual-bb-test-harness

に公開

TL;DR

  • I created an OSS called manual-bb-test-harness.
  • manual-bb-test-harness is designed to structure QA design knowledge and release decision-making for black-box testing.
  • Specifically, it handles the flow of reading specifications, extracting perspectives, assigning risks, breaking down into manual test cases, estimating effort, and finally making a Go/No-Go decision.
  • If manual work becomes a bottleneck in the testing phase, the outcome efficiency decreases compared to the development speed assumed by AI.
  • My ideal is to strengthen test code and reduce manual testing itself.
  • However, in environments with low maturity, there is still value in using traditional test design knowledge as a template.

Introduction

I have officially released an OSS called manual-bb-test-harness. It is licensed under MIT.

https://github.com/RNA4219/manual-bb-test-harness

I suspect that even if you read the README and design documents, it might be difficult to tell whether this is "a test case generation tool," "a collection of prompts for QA," or "something for automated testing."

The reason is that what I attempted to handle with this OSS is not "test cases" themselves, but the judgments test engineers make before creating test cases.

For example, there are situations before a release where QAs or developers think about things like this:

  • What should we check to say this feature is sufficiently tested?
  • Besides the happy path, which unhappy paths should we look at?
  • Is it okay to decide on expected results that aren't written in the specifications?
  • Even if automated tests pass, where should we look manually?
  • What needs to be remaining to stop a release?
  • Is it acceptable to release conditionally if something is still outstanding?

manual-bb-test-harness was created to organize these types of judgments in order.

Therefore, this is not a "tool that has AI create test cases." The author often tends to explain it that way, but that's just because I'm being lazy and oversimplifying.

A tool for organizing "what we view as dangerous," "the basis for saying OK," and "how much we need to check before releasing" before writing test cases.

If you look at it this way, it becomes much easier to read.

To be a bit more precise, it is a harness that connects the judgment axes of test engineers when designing manual black-box tests, from input, perspectives, risks, cases, and effort to quality gates and Go/No-Go decisions.

In this article, I will unpack the contents from the author's perspective.

Why I Created It

The motivation is quite simple: as development speed increases due to AI-driven processes, the flow of humans creating test cases and executing them bit by bit in the testing phase becomes a bottleneck.

The problem I see is that if the processing speed of the testing phase remains old-fashioned, the overall outcome efficiency of development drops. More specifically, now that generative AI has removed the barriers to coding, the testing phase itself has become a speed bottleneck, so we should move toward eliminating that.

Implementation is faster, and drafts for test code are produced quickly. AI can also significantly assist in summarizing specifications and identifying perspectives.

In that state, if only the final testing phase is stuck in manual case creation and procedure execution, things get clogged. When bugs appear near the deadline, it leads to rescheduling the entire delivery. The procedures increase, making it even more cumbersome.

Furthermore, if I were to speak purely about ideals, I want to reduce manual testing itself as much as possible.

Ideally, we should invest heavily in test code, increase high-quality coverage through integration testing, and expand the range that can be checked automatically. If humans are going to trace the same procedures every time, it is too slow unless it is recorded as code, run in CI, and detected every time there is a change.

On the other hand, there are still things that humans must ask at the end:

  • Is it really okay to determine the expected result based on these specifications?
  • Which state transitions are dangerous?
  • Which permission differences are easy to miss?
  • Which risks are lowered by the evidence from automated tests?
  • Which unresolved risks can we accept?
  • From what point should we stop the release?

I believe that significant points that require human judgment remain here.

However, these judgments are prone to becoming localized to specific individuals. Experienced people think about them naturally, but they often proceed without being articulated.

In reality, not every team can suddenly achieve high automation maturity:

  • Test code is thin.
  • Coverage is not being monitored.
  • Acceptance criteria are vague.
  • Test design is siloed to specific people.
  • Test case creation and execution are outsourced.
  • Yet, there are few people who can explain what is dangerous.

In teams with such low maturity, traditional test design knowledge is still very useful.

Equivalence partitioning, boundary value analysis, decision tables, state transitions, risk-based testing. These are not just old knowledge; they serve as a scaffold for reducing omissions in places where automation is not yet sufficient.

manual-bb-test-harness is an OSS for that purpose.

To be very honest, if you are going to have an outsourcing partner create test cases and just run the manual execution, I think there are cases where it is better to build a framework of perspectives and risks using a harness like this, and have junior engineers perform the tests to gain experience.

Test execution is also an opportunity to acquire domain knowledge that isn't in the specifications, as well as an intuitive sense of "something feels wrong" when touching the actual device. When junior engineers handle the device with a framework of perspectives and risks in mind, they are more likely to develop an intuition for "why is the specification like this?" or "wouldn't this break if I did that?"

Meanwhile, the ideal is to strengthen automated tests and connect specifications, test code, and release decisions earlier in the process. However, in teams that haven't reached that point yet, it is better to externalize the minimum judgment axes than to run manual QA without any form or structure.

This OSS allows you to treat the following judgments, which experienced members perform mentally during manual QA, as a reproducible flow of deliverables:

  • What to check.
  • What specifications or acceptance criteria to use as a basis.
  • Which unhappy paths to prioritize.
  • Where the release risk is considered high.
  • How to handle cases where there is no basis for the expected result.
  • Which results qualify as "Go."
  • Which unresolved risks result in "No-Go."

In this article, I will call the flow of these deliverables an artifact chain. In short, it means "retaining intermediate judgments in a confirmable way rather than just as notes."

Why this tool feels difficult is because it requires two types of professional knowledge.

One is knowledge of test engineering—concepts such as equivalence partitioning, boundary value analysis, decision tables, state transitions, experience-based testing, and risk-based testing, as treated in the JSTQB Foundation Level.

The other is knowledge of harness engineering. It is not just about a single output, but about normalizing inputs, connecting intermediate deliverables as types, creating verifiable contracts, and finally carrying them to a decision.

If you don't look at both of these at the same time, manual-bb-test-harness will just look like "some kind of QA text generation tool."

So, What Exactly Is This Tool?

To be precise, manual-bb-test-harness is a harness to transform manual black-box QA from a process based on individual intuition into an explainable, verifiable release decision-making process.

It is easy to understand if you think of the "harness" here as a framework for accepting inputs, organizing intermediate judgments, and ultimately reaching a release decision.

The general flow is as follows:

Specifications, Acceptance Criteria, Business Rules, Changes, Existing Test Results

First, organize the prerequisites of the target functionality

Break down what needs to be checked by perspective

Identify check perspectives

Estimate where the risks are

Translate important items into manual test cases

Estimate the effort required for execution

Decide whether it is safe to release

Summarize in a format that can be explained to stakeholders

The goal of this tool is to be in a state where you can explain the grounds for why it is okay to release, why you should stop, and what is still unknown. This is the most critical aspect of judgment as QA.

What Does It Mean to "Create Unhappy Path Tests"?

Understanding unhappy path testing (error handling testing) is crucial when looking at this OSS. Unhappy path testing verifies whether a system can perform appropriate error handling without causing fatal failures when encountering unexpected inputs, human errors, or unforeseen external factors.

Fatal failures here include data corruption, information leakage, incorrect state transitions, double execution, and irrecoverable failures.

Read the specifications. Touch the actual device. Look at prerequisites not written in the acceptance criteria. Suspect that it might break if you do something based on perspectives like state, permissions, history, boundaries, exceptions, and recovery.

Add those doubts as tests.

This is creating unhappy path tests. That is roughly the image, whether for exploratory or unhappy path testing.

Image of unhappy path testing

Therefore, if you leave test generation to AI without QA knowledge, it will naturally approach areas that are not explicitly stated, which leads to weak results. Even if it can generate normal paths or clear input errors, "ways to break things" such as state transitions, permission differences, history dependencies, double execution, and interruption/resumption tend to become thin.

Thus, manual-bb-test-harness has a knowledge harness for unhappy paths pre-installed.

For example, suppose the normal path is as follows:

If the order is before shipping, the user themselves can cancel it.

This normal path has several prerequisites:

  • The order is before shipping.
  • The operator is the person themselves.
  • The order exists.
  • The payment status is in a cancellable state.
  • Communication is successful.
  • The same operation is performed only once.

In unhappy path testing, you look at the error handling when these prerequisites are broken.

  • What happens if it is after shipping?
  • What happens if it is not the person themselves?
  • What happens if the order does not exist?
  • What happens if it is in a non-refundable state?
  • What happens if communication is lost?
  • What happens if it is executed twice?

What is important here is not to break everything at once.

If you break multiple conditions that might lead to similar errors at the same time, you will not know what caused the failure. You must carefully create invalid_single_fault data, meaning you break only one prerequisite.

manual-bb-test-harness treats this concept as a data layer:

canonical_valid       Data that succeeds normally
invalid_single_fault  Data that breaks only one prerequisite
boundary3             min-1 / min / min+1, max-1 / max / max+1
rule_combo            Important combinations of business rules
state_seed            Initial state to observe state transitions
history_seed          Data that depends on past operations or history

I treat test data here as a carrier of coverage.

Don't Arbitrarily Determine Expected Results

A critical safety feature emphasized in this OSS is that it does not treat expected results without a basis as formal test cases with procedures.

This is quite important.

While not limited to test cases, when you ask AI to create something, it can sometimes produce artifacts that seem plausible at first glance.

However, in testing, "plausible-looking expected results" are dangerous.

Essentially, expected results require a basis.

Bases can include, for example:

  • Specifications
  • Acceptance criteria
  • Business rules
  • API contracts
  • State transition tables
  • Comparison with older versions
  • Approved existing behavior
  • Logs or DB diffs

In manual-bb-test-harness, if you are creating a formal test case with procedures, you must always provide the basis for the expected result.

The basis here refers to materials that can explain "why you can say that result is correct." Specifications, acceptance criteria, business rules, API contracts, and existing approved behavior fall under this category. If there is no basis, we do not force it into a fixed test case.

In such cases, we either leave it as [Confirm Required], move it to notes for exploratory verification, or treat it as an issue requiring specification clarification first.

For readers who are not QA specialists, this can be explained as follows:

0 This tool is not intended to let AI fabricate correct answers, but to distinguish between what has a basis for the correct answer and what still requires verification.

How to Handle Risks and Priority

In actual QA, gathering materials is difficult, and tasks often involve time constraints and require long-term concentration, so you cannot verify everything with the same depth.

Therefore, you need to decide where to start.

manual-bb-test-harness does not just look at what seems "important," but evaluates risk across multiple axes:

  • Is the impact high?
  • Is it likely to occur?
  • Is it difficult to detect?
  • Is the change scope wide?
  • Are there external dependencies or network differences?
  • Does it involve permissions or personal information?
  • How effective are the automatic test traces?

Based on this evaluation, we assign a priority from P0 to P3.

Reaching Quality Gates is the Core

Many test support tools end at the point of creating test cases.

manual-bb-test-harness does not stop there.

Finally, it proceeds to gate judgment.

For gate judgment, we combine materials such as the following at a minimum:

  • Are the specifications sufficient?
  • Are the traces and traceability sufficient?
  • Are there automatic test traces?
  • What are the manual results for P0/P1?
  • Are there any remaining defects?
  • Is there an agreement with the release owner on residual risks?

Then, we decide on go, conditional_go, or no_go.

What is important here is not to decide release feasibility based solely on test coverage.

Even if coverage is high, if items that should be verified with the highest priority have failed, you should not release.

If significant prerequisites remain unverified, you should not release.

Also, if the remaining risks exceed the permissible range, and there is no person responsible for accepting those risks, nor any approval for proceeding exceptionally, you should stop the release.

On the other hand, even if some medium-level risks remain, if the responsible person, the deadline for response, the method to suppress impact, the rollback means, and the approval are clear, you may be able to release conditionally.

This is a mindset of elevating QA deliverables from "tested" to "we can release under these conditions / we should stop under these conditions."

What Exists as Harness Engineering?

So far, this has been about test engineering.

The other axis is harness engineering.

If you take the word "harness" to mean simply a "test runner," this repository becomes difficult to understand.

Here, "harness" has a broader meaning.

It is a framework for accepting inputs, normalizing them, connecting the artifacts at each stage as types, making them verifiable, and carrying them to the final judgment.

manual-bb-test-harness has the following structure:

  • Skill body
  • Design policy separated into references
  • Artifact contract
  • JSON schema
  • Examples
  • Goldens
  • Evaluation rubric
  • CLI
  • Validator
  • Release bundle
  • Assistance for integration with TestRail / Xray / Notion

This is a structure for putting output into operation, not just generating text.

For example, feature_spec, test_model, observation_set, risk_register, manual_case_set, effort_plan, gate_decision, and release_brief are each treated as artifacts.

The point of making them artifacts is that they can be reviewed, verified, tracked for differences, and integrated with external tools.

Without this, AI output becomes one-off text.

When harnessed, it can be integrated into the team process.

What is Not Easily Visible as OSS?

The value of this repository is not visible in the amount of code or the flashiness of the UI.

Rather, the value lies in plain designs like the following (I am writing internal parameters as they are here):

  • Keep SKILL.md short and offload detailed policies to references
  • Define artifact contracts
  • Sync schema, examples, goldens, and rubrics
  • Prohibit expected results without an oracle
  • Include mobile differences in the coverage model, not in device names
  • Do not decide Gates based on coverage alone
  • Separate blocked, degraded, and ok
  • Issue up to a release brief

This is difficult to convey "why it is important" unless you are constantly doing QA design or tool operation.

However, accidents in practice usually happen around these areas:

  • Deciding on expected results while specifications were still vague
  • Only verifying cases where things go well
  • Not checking behavior after state changes
  • Not checking differences in movement due to permissions or roles
  • Assuming that because there are automatic tests, it is enough
  • High test coverage, but significant risks remained
  • No decision maker for exceptional release procedures
  • Could not explain "who decided it was okay to release, and based on what" after the release

manual-bb-test-harness has a structure to reduce these types of failures.
Well, doing this reduces risk, but it does not mean there will be zero bugs...

Simple Workflow Explanation

1. It is not a tool for increasing test cases

The purpose of this tool is to use specifications, change contents, existing test results, etc., to organize:

  • What should be verified
  • Where the risks are
  • What conditions make it acceptable to release

2. It formalizes the thought process of experienced engineers

Experienced test engineers do not start by writing test cases immediately.

First, they break down what needs to be verified:

  • Screen and operation flows
  • State changes
  • Business rules
  • Input data
  • Permissions and roles
  • Impacts on existing features

Based on that, they estimate where the risks lie, verify the basis for expected results, and then distill the important items into test cases.

This tool allows you to follow that same sequence every time.

3. Do not treat the unknown as known

If there is no basis for an expected result, the tool does not let the AI fabricate a plausible-sounding correct answer.

Instead, it classifies items into:

  • Those requiring specification clarification
  • Those to be verified through exploration
  • Those that are dangerous and must be resolved before release

This acts as a safety device for quality judgment.

4. Finally, lead to release decisions

The output of this tool does not end with test cases.

By combining the results of automated tests, manual verification results for high-priority items, remaining bugs, and unresolved risks, it provides materials to judge whether to:

  • Proceed with the release
  • Proceed with conditions
  • Not proceed

Where Should Developers Look?

If you are a developer, this perspective is suitable:

It is a tool to support whether you can say it is acceptable to release, rather than whether it is implemented.

Developers know the inside of the implementation.

Precisely because of that, omissions in behavior from an external perspective can occur.

For example:

  • Operations after a state change
  • Operations by users with different permissions
  • Double execution
  • Interrupt and resume
  • External integration failure
  • Regression of existing flows
  • Differences in mobile permissions / background / offline handling

These are things that the people writing the code might overlook. manual-bb-test-harness does not deny the implementer's viewpoint, but rather adds the perspectives of black-box QA and release judgment to it.

Where Should QA Engineers Look?

If you are a QA engineer, this perspective is suitable:

It is a tool that turns perspective-finding, which was previously based on intuition and experience, into a reviewable sequence.

It is especially valuable for facilitating the sharing of ideas between less experienced members and experienced ones.

Experienced members naturally think, "For this feature, it's dangerous if we don't look at state transitions."

However, less experienced people might only create test cases for normal operations on the screen.

In this harness, things to be verified are divided into several perspectives:

  • Operation flow
  • State changes
  • Business rules
  • Input data
  • Permissions and roles
  • Impacts on existing features
  • Differences in usage environments and devices

By dividing them this way, it becomes easier to discuss "what is missing" during reviews. Also, because items without a basis for expected results are not treated as formal test cases, it becomes easier to notice insufficient specifications.

What Should You Do First When Trying It?

It is better not to try to put it into a large-scale operation immediately.

First, choose one feature with medium risk. It is good to run through the entire process once: organizing specifications, verification perspectives, risks, test cases, effort estimation, and the summary for release judgment.

Things you should look at initially are:

  • What is missing in specifications or acceptance criteria?
  • Which acceptance criteria can be used as the basis for expected results?
  • Which expected results become "Confirm Required"?
  • Which verification perspectives should be prioritized?
  • How much effort is required if it is distilled into manual tests?
  • What must be resolved before release judgment to avoid danger?
  • Is the final judgment note in a format that conveys the situation to stakeholders?

Even this one trial will reveal quite a few weaknesses in the product or development process.

Are the specifications weak? Are the acceptance criteria weak? Is the organization of state transitions weak? Is the organization of permissions weak? Are automated test traces weak? Is the person responsible for release judgment ambiguous?

Being able to bring that to light is the value of this harness.

One-Line Summary

manual-bb-test-harness is a tool for structuring QA design knowledge for black-box testing and release judgment.

Main Documents in the Repo Used as Reference

  • manual-bb-test-harness/README.md
  • manual-bb-test-harness/docs/human-readme.md
  • manual-bb-test-harness/BLUEPRINT.md
  • manual-bb-test-harness/GUARDRAILS.md
  • manual-bb-test-harness/skills/manual-bb-test-harness/SKILL.md
  • manual-bb-test-harness/skills/manual-bb-test-harness/references/case-design-policy.md
  • manual-bb-test-harness/skills/manual-bb-test-harness/references/risk-and-gate-policy.md

Discussion