iTranslated by AI
GenAI-Powered E2E Testing: Research Notes on AutoE2E
Introduction
There was a study on automatically generating End-to-End tests using generative AI, so I'll take a look at the details.
The code used in the study above is as follows:
It seems to navigate through screens to create a list of functions and state transition diagrams.
On the other hand, from what I've seen, the code for generating test cases does not appear to be public.
(Someone who reached the same conclusion has opened an issue.)
Overview
The workflow of AutoE2E is as follows:
What it does is access the actual URL based on the base URL specified in the config, extract Actions that can be performed via buttons or forms, predict and record which functions exist on which screens (states) using an LLM, and create a list of functions and state transition diagrams.
Simple Flow
- Build a State by collecting operations like buttons and form processing (hereafter referred to as Actions) from the starting URL.
- Add to the crawling queue.
- Repeat the following until the crawling queue is empty:
- Retrieve a State from the queue.
- Create the context of the State using an LLM.
- Perform the following for all Actions within the State:
- Determine if the Action is a critical, irreversible one.
- If it is not a critical Action:
- Actually execute the Action.
- If it is a Form Action, fill in the necessary information for execution using an LLM.
- Build a State by retrieving the list of Actions from the screen after the Action is executed.
- If it is a new State, add it to the crawling queue and state transition information, and do not perform function extraction.
- Actually execute the Action.
- If function extraction needs to be performed:
- Extract functions based on the current State and Action.
- Add the extracted functions to ActionFunctionDB and FunctionDB as necessary.
- Extract functions based on the current State and Action, and the previous State and Action.
- Add the extracted functions to ActionFunctionDB and FunctionDB as necessary.
- Update the FunctionDB scores.
- Extract functions based on the current State and Action.
- Determine if it is the final Action.
- Create a graph with States as nodes and Actions as edges.
Data Important for Analysis
MongoDB
MongoDB is used to manage the following two collections:
- FunctionDB
- ActionFunctionDB
FunctionDB
Stores functions and their likelihood scores.
| Attribute | Description |
|---|---|
| app | Application name |
| text | Function name |
| embedding | The result of embedding text with OpenAIEmbeddings(model="text-embedding-3-large")
|
| score | Cumulative likelihood score of the function |
| final | Flag indicating that it has been judged as "terminal" in this state×action observation |
| executable | At least one action is associated with an executable function |
ActionFunctionDB
Maintains scores for functions occurring in actions within each State.
| Attribute | Description |
|---|---|
| app | Application name |
| url | URL of the State |
| state | State.get_id(BY_ACTIONS), state_id representing the State |
| prev_state | state_id of the previous State |
| action | Element ID |
| prev_action | ID of the previous Action |
| test_id | element.test_id of the Action |
| depth | Depth of the operation |
| type | SINGLE/DOUBLE (Score calculation using previous State information) |
| rank_score | Geometric score of the rank returned by the LLM for the corresponding function (under that State×Action condition) |
| func_pointer | ID of the document returned when inserting into FunctionDB |
| final | Flag indicating that it has been judged as "terminal" in this state×action observation |
| should_execute | Seems to be always True |
Main Data Storage Classes
Action
Action represents button clicks and form actions.
| Property | Type | Description |
|---|---|---|
| element | Element | Element of the Action |
| action_type | ActionType | FormActionType/ClickActionType |
| should_execute | bool | |
| parent_state_id | str | Parent state_id |
State
State is a class that stores information about the state (screen).
| Property | Type | Description |
|---|---|---|
| unique_id | str | Unique ID (uuid4) |
| evaluator | StateIdEvaluator | BY_UNIQUE/BY_URL/BY_DOM/BY_ACTIONS |
| url | str | URL of the page |
| dom | str | DOM of the page (driver.page_source) |
| actions | list[Action] | List of actions that the state has |
| crawl_path | CrawlPath | Information representing the States and Actions through which the State transitioned |
| context | str | Context of the State |
Where LLM is Used
Creating State Context Using LLM
In extract_state_context, a screen capture of the State, the context of the previous State, and the HTML of the Action element are used to create the context of the State with the following prompt.
The result of this function is a State context such as "a page displaying search results for a product query."
System Prompt:
CONTEXT_EXTRACTION_SYSTEM_PROMPT
Given the provided information about a webpage, your task is to provide a brief and abstract description of the webpage's primary purpose or function.
Output Guidelines:
* Brevity: Keep the description concise (aim for 1-2 sentences).
* Abstraction: Avoid specific details or variable names. Use general terms to describe the content and function. (Example: Instead of "a page showing results for searching for a TV," say "a page displaying search results for a product query.")
* Focus on Purpose: Prioritize describing the main intent of the page. What is it designed for the user to do or learn?
* No Extra Explanations: Just provide the context. Avoid adding commentary or assumptions.
User Prompt
CONTEXT_EXTRACTION_USER_PROMPT
The description of the website is: {description}
The previous state was: {previous_state}
The previous action was: {previous_action}
description: "None"
previous_state: Previous State
previous_action: outerHTML of the element possessed by the Action executed in the previous State
Checking if it's a critical Action
is_action_critical determines from the Action's outerHTML whether it is a critical Action whose effects are irreversible, such as account deletion or making a purchase.
System Prompt
CRITICAL_ACTION_SYSTEM_PROMPT
Given an element in a web application, your task is to determine if the element is a critical action.
A critical action is an action that its effects are irreversible, such as deleting an account or making a purchase.
Please return a boolean value indicating if the element is a critical action. The boolean should be in Python format (True or False).
Just return the boolean and no further explanation.
User Prompt
The outerHTML of the Action's element
Filling information required for execution using LLM for Form Actions
create_form_filling_values extracts form information required for submission using the Action's outerHTML. The output result is in JSON format.
System Prompt
FORM_VALUE_SYSTEM_PROMPT
Given a form element in a web application, your task is to generate a set of values so that the form can be submitted successfully.
The format for your response should be a JSON where the keys are the data-testid attributes of the input elements and the values are the values that should be filled in.
If the elements are radios or checkboxes, the values should be booleans.
If the elements are selects, the values should be the value attribute of the selected option.
Your response should be parsable by json.loads. Just include your response in the JSON, no additional information is needed. Avoid formatting the JSON for markdown or any other format.
User Prompt
The outerHTML of the Action's element
Extracting Functionalities
extract_action_functionalities uses the current State's context and the Action's outerHTML (the previous one can also be used if necessary) to retrieve up to five candidate functionalities (e.g., "add item to cart") associated with that Action.
The result is expected to be a JSON like the following:
[
{
probability: (0.0 to 1.0) Likelihood of the functionality's existence,
feature: Concise description of the user action (e.g., "add item to cart")
}
... Up to 5 items, descending order of probability
]
System Prompt
FUNCTIONALITY_EXTRACTION_SYSTEM_PROMPT
Given a webpage's purpose and content (webpage_context), the outerHTML of an action element (action_element), and optionally the user's last action that led to this state, your task is to infer the most likely functionalities associated with that action element.
These functionalities should be user-centric actions that produce measurable outcomes within the application, are testable through E2E testing, and are essential to the presence of the action element.
Output Format:
Your is enclosed in two tags:
<Reasoning>:
- An enumerated list of at most five functionalities potentially connected to the element.
- For each functionality, answer the following questions concisely:
1. Would developers write E2E test cases for this in the real world? It should be non-navigational, not menu-related, and not validation.
2. Is the functionality a final user goal in itself or is it always a step in doing something else?
3. Is this overly abstract/vague? If so, break it down into more testable sub-functionalities.
- Avoid repeating the questions in your responses every time.
<Response>:
- A JSON array of objects, each containing:
- probability: (0.0 to 1.0) Likelihood of this functionality exists.
- feature: A concise description of the user action (e.g., "add item to cart").
- Sorted by probability in descending order.
- Parsable by `json.loads`.
- Can be an empty array if no valid functionalities are found.
User Prompt
{
"webpage_context": webpage_context,
"action_element": action_element,
"previous_action": previous_action
}
webpage_context: Context of the State
action_element: outerHTML of the Action
previous_action: outerHTML of the previous Action, if any
Adding to ActionFunctionDB and FunctionDB based on extracted functionalities as needed
insert_functionalities performs insertion or updating of FunctionDB based on whether that functionality already exists or not. At this time, since the functionality names are obtained via LLM as mentioned earlier, they are not deterministic. Therefore, it vectorizes the functionality name with OpenAIEmbedding, extracts similar candidates from FunctionDB, and uses an LLM to determine if they are the same functionality.
In this process, if an existing functionality is found, it creates a functionality name that considers both the existing name and the current name, and registers it as the functionality name.
The LLM returns a response like the following:
{
"match": true if any functionality in the list matches the base functionality,
"match_index": Array of indices of matched functionalities. This key is included only if there is a match.,
"combined_text": If matched, a concise description of that functionality. Redundant words may be omitted. Include this key only if there is a match.
}
System Prompt
SIMILARITY_SYSTEM_PROMPT
Given a description of a software feature and a list of other software feature descriptions, your task is to determine if the initial feature matches any features in the list.
Output format:
Your analysis is enclosed in two tags:
<Reasoning>:
- For each item in the list, argue why the base feature and the feature in the list are or aren't describing the same action being performed in the app.
- Are they exactly or semantically equivalent?
- If they are different, how are they different?
- Avoid repeating the questions in your responses every time.
- Your analyses should be short and concise.
<Response>:
- A JSON object containing the following keys:
- match: true If any feature in the list matches the base feature, false if not.
- match_index: An array of indices of matched features in the list. Only include this key if there is a matching feature.
- combined_text: If the features match, a concise description of that feature. Only include this key if there is a matching feature. You can omit some of the redundant words to keep this sentence simple.
- Parsable by `json.loads`.
User Prompt
f'Base feature:\n{base_functionality}\nThe list of functionalities:\n{functionalities}'
base_functionality: Functionality name
functionalities: List of functionality names similar to base_functionality, separated by newlines.
Determining if it's the final Action
mark_final_functionalities executes an LLM that returns a boolean value to determine whether the current State's Action is the final action in the chain, and then updates the final flag in FunctionDB and ActionFunctionDB as necessary.
System Prompt
FINALITY_SYSTEM_PROMPT
Given the context of a webpage, an action element, and a list of features and scenarios, your task is to determine whether the action is the final action in the chain of actions for performing each of the features.
Output format:
Your analysis is enclosed in two tags:
<Reasoning>:
- For each feature in the list, argue why executing the action would or would not conclude the feature.
- Avoid repeating the description of the feature.
- Your analyses should be short and concise.
<Response>:
- An array of Python booleans, where index i is True if the action concludes feature i.
User Prompt
f'The context of the webpage is: {context}\nThe action element is: {clean_children_html(action_element)}\nThe list of functionalities:\n{functionalities}'
context: Context of the current State
action_element: outerHTML of the Action
functionalities: List of candidate functionalities linked to the current State and Action, separated by newlines.
Summary
I understand that the goal is to comprehensively create state transition diagrams and functionality lists for an application by utilizing general crawling and LLMs.
The ideas of creating State context using images and previous States/Actions, the method for functionality extraction, and the refinement through similar functionality extraction and updating with combined_text seem like highly applicable concepts.
On the other hand, considering issues such as user authentication, the volume of LLM usage, and the actual generation of test cases, there are doubts about whether the code on GitHub can be used as is. Therefore, it should probably be used primarily as a reference.

Discussion