iTranslated by AI
Building a Property-Based Testing Library (Stateful Testing Edition)
This is a continuation of the previous article, "I Tried Making a Property-Based Testing Library". I recently implemented stateful testing, so I would like to introduce the knowledge I gained during that process.
Please note that this article is not intended for learning stateful testing itself. For those who want to learn about stateful testing, I highly recommend reading Practical Property-Based Testing.
Disclaimer
The explanations in this article are based on my work, kiri-check. The stateful testing in kiri-check is modeled after major libraries like ScalaCheck, so there shouldn't be many significant differences from other libraries.
I have tried to minimize elements unique to kiri-check as much as possible, but due to differences in the development language (Dart), it is difficult to completely mimic other libraries. When trying out stateful testing, please prioritize the documentation of the library you are using.
What is Stateful Testing?
As the name implies, stateful testing is testing that "has state." Even within property-based testing, basic tests that use randomly generated data are called stateless tests because they do not maintain state. Stateless testing is suitable for processes that can be completed within a single test, while stateful testing is suitable for processes that need to account for complex state transitions.
A widely used method for performing stateful testing involves preparing two implementations—a model representing the desired behavior and the actual system in operation—and comparing their states before and after performing arbitrary operations. This method is sometimes called model-based testing. While it is more complex and labor-intensive compared to stateless property testing, it allows for simulations that are closer to real-world operations.
Support for stateful testing varies by library. Most major libraries seem to support it. Among the several PBT libraries for Dart, it appears that kiri-check is the only one that implements stateful testing.
Sample Code
Here is a sample code for kiri-check. This code is included in the repository. It is also explained in the Quick Start section of the documentation, so please feel free to read that as well.
import 'package:kiri_check/kiri_check.dart';
import 'package:kiri_check/stateful_test.dart';
// Model implementation
final class CounterModel {
int count = 0;
void reset() {
count = 0;
}
void increment() {
count++;
}
void decrement() {
count--;
}
}
// Actual system implementation
final class CounterSystem {
// In this sample, assume data is held in JSON
Map<String, int> data = {'count': 0};
int get count => data['count']!;
set count(int value) {
data['count'] = value;
}
void reset() {
data['count'] = 0;
}
void increment() {
data['count'] = data['count']! + 1;
}
void decrement() {
data['count'] = data['count']! - 1;
}
}
// Test specification
final class CounterBehavior extends Behavior<CounterModel, CounterSystem> {
@override
CounterModel initialState() {
return CounterModel();
}
@override
CounterSystem createSystem(CounterModel s) {
return CounterSystem();
}
@override
List<Command<CounterModel, CounterSystem>> generateCommands(CounterModel s) {
return [
Action0(
'reset',
nextState: (s) => s.reset(),
run: (system) {
system.reset();
return system.count;
},
postcondition: (s, count) => count == 0,
),
Action0(
'increment',
nextState: (s) => s.increment(),
run: (system) {
system.increment();
return system.count;
},
postcondition: (s, count) => s.count + 1 == count,
),
Action0(
'decrement',
nextState: (s) => s.decrement(),
run: (system) {
system.decrement();
return system.count;
},
postcondition: (s, count) => s.count - 1 == count,
),
];
}
@override
void destroySystem(CounterSystem system) {}
}
void main() {
property('counter', () {
// Execute the stateful test
runBehavior(CounterBehavior());
});
}
The code above is a test for a simple counter. This counter has an integer property representing the count value. The available operations are increment, decrement, and reset (back to 0).
In this instance, I have intentionally made the specification of the actual system complex by storing the count value as JSON-convertible data. I have deliberately introduced this complexity to differentiate the implementation of the model from the actual system.
This code defines three classes:
-
CounterModel: The model -
CounterSystem: The actual system -
CounterBehavior: The test specification. It includes various callbacks
CounterModel is the model representing the counter's specification. Since the model serves as the baseline for stateful testing, implementing it with a focus on the accuracy of the specification is the highest priority. Performance and extensibility are secondary; a simple implementation that is as error-resistant as possible is ideal.
CounterSystem is the implementation of the actual system. In this sample, because it is a simple counter, the difference from the model is minimal.
CounterBehavior is responsible for creating the model and the actual system, as well as generating the commands to be executed. While the class name Behavior is specific to kiri-check, it consolidates APIs typically used in other libraries into one place.
The main methods are shown below:
-
initialState: Creates the model in its initial state. -
createState: Creates the actual system. It takes the model created byinitialStateas an argument. -
generateCommands: Generates a list of commands to be executed.
Commands
A "command" is a unit of processing for the model or the actual system. In kiri-check, the command that allows users to specify processing is the Action command. Action can use Arbitraries to generate random values. Action0 is a command used for processing that does not require random values.
In this test, we prepare three commands: increment, decrement, and reset. Since they are all almost the same, let's just look at the increment command.
Action0(
'increment', // Description of the command content
nextState: (s) => s.increment(),
run: (system) {
system.increment();
return system.count;
},
postcondition: (s, count) => s.count + 1 == count,
),
The arguments are as follows. Although not used this time, there is also a precondition that represents the precondition.
-
nextState: Updates the state of the model. -
run: Performs processing on the actual system. -
postcondition: The postcondition after executingrun.
The flow of command callbacks is: "precondition check precondition -> actual system processing run -> postcondition check postcondition -> model update nextState ". The execution order has its quirks, so I will discuss it in detail in the next section.
Execution
To run the test, execute dart test.
dart test example/stateful_counter.dart
Setting KiriCheck.verbosity will display the execution process.
void main() {
// Display details
KiriCheck.verbosity = Verbosity.verbose;
property('counter', () {
runBehavior(CounterBehavior());
});
}
Example output:
...
Cycle 10
Generate commands...
Create state: CounterModel
Create system: CounterSystem
Step 1: reset
Step 2: decrement
Step 3: increment
Step 4: increment
Step 5: increment
Step 6: decrement
Step 7: increment
Step 8: reset
Step 9: reset
Step 10: decrement
...
By default, the cycle of executing 50 random commands in succession is repeated 100 times.
Shrinking
Just like with stateless tests, shrinking is performed in stateful testing when a test fails. Let's try introducing a bug into the actual system to see how it works. We'll modify CounterSystem.decrement so that "the count value stops decreasing if it is greater than 5."
void decrement() {
if (data['count']! > 5) {
return;
}
data['count'] = data['count']! - 1;
}
When you run the test after this change, the failed command sequence is displayed at the end:
Falsifying example sequence:
Step 1: increment
Step 2: increment
Step 3: increment
Step 4: decrement
Step 5: increment
Step 6: increment
Step 7: increment
Step 8: increment
Step 9: increment
Step 10: increment
Step 11: decrement
This command sequence is not the original pattern that first failed, but rather a reduced version of the original produced through shrinking. The original is shown below:
Original command sequence where the error occurred
Cycle 8
Generate commands...
Create state: CounterModel
Create system: CounterSystem
Step 1: increment
Step 2: decrement
Step 3: decrement
Step 4: increment
Step 5: decrement
Step 6: decrement
Step 7: reset
Step 8: increment
Step 9: decrement
Step 10: decrement
Step 11: reset
Step 12: increment
Step 13: increment
Step 14: increment
Step 15: decrement
Step 16: reset
Step 17: decrement
Step 18: increment
Step 19: increment
Step 20: reset
Step 21: increment
Step 22: decrement
Step 23: decrement
Step 24: decrement
Step 25: decrement
Step 26: decrement
Step 27: decrement
Step 28: reset
Step 29: increment
Step 30: decrement
Step 31: reset
Step 32: decrement
Step 33: increment
Step 34: increment
Step 35: reset
Step 36: increment
Step 37: increment
Step 38: increment
Step 39: reset
Step 40: reset
Step 41: decrement
Step 42: reset
Step 43: increment
Step 44: increment
Step 45: increment
Step 46: increment
Step 47: increment
Step 48: increment
Step 49: decrement
Error: postcondition is not satisfied
While the original failed at step 49, it has been shortened to 11 steps by shrinking. All reset commands in the original were cut, and the failure is reproduced using only increment and decrement commands.
In the shrunk command sequence, the count value increases continuously from step 5 to step 10, making it clear that the count value exceeded 5 during this sequence. Since the count value at step 11 was 8, it did not reach the point of identifying the exact minimum value that triggers the bug, but reducing the number of steps to about 1/5 still makes debugging much easier.
Execution Model
In general, stateful testing execution is divided into two phases. The first phase involves generating the commands, and the second phase involves executing various processes, including the commands. Some callbacks are shared between both phases, but their behavior after being called differs.
All the callbacks mentioned in this section are processes that the user should define.
Command Generation Phase
In the first phase, the commands to be used are randomly generated (execution takes place in the next phase). At this stage, only the model is generated. The actual system is not created.
The flow of the command generation phase is shown below:
First, Behavior.initializeState() creates the model. The created model is treated as the initial state. Next, Behavior.initializePrecondition(State) is called to check the initial conditions of the created model. If the return value is false, the test fails.
Next, Behavior.generateCommands(State) is called to generate a list of candidates for the commands to be executed in the next phase. After that, it enters the command selection loop.
In the command selection loop, a command is randomly selected from the list. Command.precondition(State) is executed for the selected command. If the return value is true, it is adopted; if false, it is rejected, and the selection moves on to the next command.
Once the precondition check is passed, Command.nextState(State) is executed to update the state of the model according to the command. If the required number of commands has not been reached, the selection continues.
Execution Phase
In the execution phase, the commands generated in the previous phase are applied to the actual system. Since the model is recreated, the information generated during the command generation phase is not carried over. Basically, shrinking also follows the same flow as the execution phase.
The flow of the execution phase is shown below:
The process of generating the model up to Behavior.initializePrecondition(State) is the same as in the previous phase. In the execution phase, Behavior.createSystem(State) then creates the actual system. The generated model is passed as an argument, and the actual system corresponding to that model is created.
From here, the command execution loop for the actual system begins. For each command execution, Command.precondition(State) is called first to check the precondition. However, unlike the command generation phase, if the return value is false, shrinking begins. It is not immediately considered a failure.
If the precondition is met, Command.run(System) is called to perform the process on the actual system. For commands using Arbitraries (Action), random values are generated here. If an error occurs during processing, shrinking begins. run can return an arbitrary value, which is then passed to the postcondition called next.
Command.postcondition(State, Result) checks the postcondition. If the return value is false, shrinking begins. The arguments for postcondition are the model from the state immediately preceding the run execution and the return value of run. If the return value of the postcondition is true, Command.nextState(State) is called to update the model's state.
A basic implementation of postcondition using these arguments would be: "model of the previous state + difference to the next state = arbitrary value representing the next state of the actual system (return value of run)". For example, for a command that increments a counter value by 1, run would return the counter value of the actual system after the increase, and postcondition would be "counter value of the model before increase + 1 == counter value of the actual system after increase".
One thing to note here is the timing when postcondition is called. Intuitively, it might seem like the order would be run -6 nextState -6 postcondition. One might expect that you'd update the model state in nextState to synchronize the number of operations between the model and actual system before calling postcondition. However, in kiri-check and other typical stateful testing frameworks, postcondition is called immediately after run, and the arguments passed are the run return value and the model from one step prior—a slightly indirect flow. Furthermore, the actual system itself is not passed.
The reason for this somewhat roundabout method is that the subject of stateful testing is the model. The focus should be on the model implementation, and ensuring the model represents the correct specification is paramount. If the model is properly implemented, flaws in the actual system will be automatically exposed. That is both the strength and the difficulty of stateful testing. It requires the cost of preparing an implementation separate from the actual system, and model design is more complex than in stateless testing.
If no error occurs in nextState, it proceeds to the next command execution. The execution phase ends when all commands have finished or when shrinking is complete. Finally, destroySystem(System) is called to perform cleanup for the actual system, and the test ends.
Shrinking
When an error occurs in the execution phase, shrinking begins. The goal of shrinking in stateful testing is to discover the minimum combination of commands that produces the error. If a command uses random values, those values are also shrunk to find the minimum values.
In kiri-check, shrinking is performed in three phases. Other libraries generally follow a similar approach.
In the first phase, the original command sequence is split into multiple subsequences. Since it simply splits them by position and count, it includes subsequences that do not include the first command or the last command. Each subsequence is executed individually, and one failed subsequence is selected to proceed to the next phase. In the counter example, a subsequence starting from step 36 was selected because an error occurred there.
In the next phase, individual commands are removed one by one from the selected subsequence to test them. In the counter example, the "increment," "decrement," and "reset" commands are each removed to check which commands are necessary (or unnecessary) for the error to occur. Through this process, it was determined that the presence or absence of the reset command had no effect on the occurrence of the error, so it was cut.
In the final phase, the random values generated by the commands are shrunk (just as in stateless testing). In the counter example, no random values were used, so nothing is executed in this phase.
The above flow is shown below:
Summary
Property-based testing, whether stateless or stateful, can easily seem like a black box to users. You don't know what values are being tested (unless you explicitly display them), and the range of control varies by library. Stateful testing, in particular, can feel like magic if you don't know the algorithm.
However, these magic-like algorithms are surprisingly simple once understood, and they are not that difficult. The basic APIs are well-established, and it should be relatively easy to implement one by referring to other libraries. Why not try building your own as a summer research project?
I would be happy if you could give me a badge. I'd also love to hear your feedback if you try using kiri-check. It will go toward my coffee fund and motivation.
Reference Links
- Practical Property-Based Testing
- kiri-check: Stateful Testing
- Introduction to Property-Based Testing
- Stateful PBT with gopter (gopter development seems to have mostly stopped. Is rapid the leading choice now?)
- Stateful Testing in Scala (English)
Discussion