ProofData for organizing intermediate definitions and properties
日本語版の記事もあります。
This article introduces a pattern for organizing definitions and proofs using type classes to efficiently reuse results about intermediate data. While the technique itself may not be novel, I hope that documenting and sharing these techniques makes software verification more practical and accessible for real-world applications.
Motivation
I have been working on lean-regex, an NFA-based regular expression engine implemented in Lean 4 with correctness proof in mind. At the moment, I was able to verify that the regex search function returns .some
only when there is a match in the haystack (theorem). However, I haven't verified that the function correctly reports the matched positions because such reporting depends on capture groups, which I haven't formalized yet.
Therefore, my next step is to formalize the semantics of capture groups and prove that the implementation is correct with respect to the semantics. However, I was reluctant to do because it's a lot more work I'm afraid that the current organization of my proofs doesn't scale well with more complicated properties.
Before diving into the problem I faced, let me give a brief overview of the implementation of lean-regex. The library compiles a regular expression (regex) into a non-deterministic finite automaton (NFA), which allows us to perform a linear-time search using a regex. This compilation is performed incrementally through a function called pushRegex
. Specifically, pushRegex nfa next e
extends the given NFA nfa
, producing a new NFA whose transitions from the start state lead to the state next
if and only if the input string matches the regex e
.
For example, let’s consider how a regex of the form e₁ | e₂
(alternation) is compiled. We start with a single-state NFA called done
. This NFA has a designated accepting state, and reaching this state during the regex search signifies a successful match.
To compile e₁ | e₂
, we call pushRegex done 0 (e₁ | e₂)
. This process unfolds recursively:
- First, we construct the NFA for
e₁
by appending its states todone
, resulting innfa₁ := pushRegex done 0 e₁
. - Next, we append the states for
e₂
, formingnfa₂ := pushRegex nfa₁ 0 e₂
. - Finally, we add a split node to enable ε-transitions (transitions on empty input) to the starting states of
e₁
ande₂
, producing the final NFAnfa' := pushNode nfa₂ (.split nfa₁.start nfa₂.start)
.
As you can see, constructing the NFA for e₁ | e₂
involves several intermediate steps, including defining nfa₁
and nfa₂
. Proving the correctness of the compiled NFA relies on analyzing and verifying key properties of these intermediate NFAs, such as how their transitions behave and how they relate to the compiled NFA.
The challenge is that intermediate data like nfa₁
and nfa₂
are declared as local variables within pushRegex. This means they are not accessible as top-level definitions outside the function, making it impossible to reuse them in different proofs. As a result, I often had to prove the same properties for these intermediate data repeatedly, which was both tedious and time-consuming.
ProofData pattern
Actually, it is possible to elevate these intermediate data structures to top-level definitions using type classes. This approach allows us to encapsulate base definitions and properties in a reusable framework, eliminating redundancy in proofs. I first encountered this idea in this Zulip message from Floris, and I adopted the term "ProofData" to describe this pattern.
In this case, we can create a ProofData
class to encapsulate the inputs to the pushRegex
function. By bundling these inputs, such as the initial NFA, target state, and regex expression, into a single type class, we can reuse and reference them more conveniently in proofs. For instance, with a ProofData
instance in scope, we can define nfa'
as the result of the compilation:
class ProofData where
nfa : NFA
next : Nat
e : Expr
def ProofData.nfa' [ProofData] : NFA := pushRegex nfa next e
While the base ProofData
class is a good starting point, it lacks variant-specific information required for different regexes. To address this, I created specialized subclasses for each regex variant. These subclasses extend ProofData
by including additional data and a property, expr_eq
, that links the additoinal data back to the original expression. For example, the Alternate
class handles the alternation case:
class Alternate extends ProofData where
e₁ : Expr
e₂ : Expr
expr_eq : e = .alternate e₁ e₂
Using an Alternate
instance, we can define intermediate data, such as nfa₁ and nfa₂, as top-level definitions. This approach not only organizes these definitions clearly but also allows us to prove and encapsulate their properties. These properties can then be reused throughout the correctness proofs, as shown below:
namespace Alternate
variable [Alternate]
-- Definitions of the intermediate data
-- constants in these definitions come from the ProofData and Alternate instance
def nfa1 : NFA := nfa.pushRegex next e₁
def nfa2 : NFA := nfa1.pushRegex next e₂
-- properties of the intermediate data
theorem size_lt₂ : nfa₂.nodes.size < nfa'.nodes.size := ...
theorem size_lt₁ : nfa₁.nodes.size < nfa'.nodes.size := ...
theorem get_start : nfa'[nfa'.start] = .split nfa₁.start nfa₂.start := ...
end
When using these properties in correctness proofs, we can introduce an instance of the type class as a local variable. Then, Lean automatically picks up the instance when resolving the names under the Alternate
namespace.
-- eq : nfa.pushRegex next e = result
induction e with
| ...
| alternate e₁ e₂ ih₁ ih₂ =>
-- utility function to introduce an instance of Alternate
let pd := ProofData.Alternate.intro eq
-- At this point, variant-specific definitions and properties are available
-- For example:
-- - `pd.nfa1` represents the NFA compiled for `e₁`
-- - `pd.get_start` provides the property about the split node
Result
The transition to the ProofData
technique is available in this pull request. While the total line count increased due to the additional boilerplate, the reuse of properties about the intermediate data significantly shortened individual proofs. For complex proofs, I observed 30-40% reductions in line count.
Moreover, ProofData
introduced a principled structure for organizing intermediate data and their associated properties. I hope this change makes it easier to scale the library as more features are added. Overall, the refactoring has proven to be a worthwhile improvement, and I plan to continue using this technique in future projects.
Shortcomings
While ProofData
is a useful pattern for structuring proofs, I found some mild annoyances when applying it.
The first challenge lies in the mismatch between induction proofs and the per-variant definitions of ProofData
. During an induction proof, the hypothesis provides a general regex e
, whereas ProofData
is defined for specific regex variants. As a result, I often needed to re-state the theorem for each variant to move the result into the appropriate ProofData
namespace (example).
Another issue arises when using ProofData
in the hypothesis of induction statements. For instance, consider the following theorem, which states that pushRegex
does not modify the contents of the original NFA (as it appends new states to the end):
theorem ProofData.pushRegex_get_lt (pd : ProofData) (i : Nat) (h : i < pd.nfa.nodes.size) :
pd.nfa'[i] = pd.nfa[i]
While this style of statements offer a convinent way to access the result (as pd.pushRegex_get_lt
), it introduced two issues in the proof:
- During induction,
induction pd.e with
loses the equations betweenpd.e
and the variables introduced for each branch. Althoughgeneralize expr_eq : pd.e = e'
can address this, it's annoying to work around. - More importantly,
pd : ProofData
doesn't carry variant-specific data, so we need to introduce one of the subclasses (e.g.,Alternate
) in each branch again. Now you have twonfa
s that come from different proof data! Even though they are equal, everything became quite confusing quickly.
Due to these issues, I reverted to introducing nfa
, next
, and e
as free variables in theorem statements. For example:
theorem pushRegex_get_lt {nfa next e result} (eq : pushRegex nfa next e = result) (i : Nat) (h : i < nfa.nodes.size) :
result.val[i] = nfa[i] := by
induction e with
| ...
| alternate e₁ e₂ ih₁ ih₂ =>
let pd := Alternate.intro eq
-- proof continues
This approach is more straightforward but introduces redundancy. Each branch now includes two definitions of nfa
(nfa
and pd.nfa
). Even though they are definitionally equal, tactics like simp
, rw
, and omega
often struggles with seeing it, requiring frequent use of rfl
to resolve equations.
One possible room for improvement is creating a new induction principle that operates directly on pd
, replacing it with the appropriate per-variant ProofData
. While I have not yet verified the feasibility of this approach, it could provide a more streamlined experience when performing an induction.
To address these limitations, one potential answer is creating a custom induction principle that directly operates on pd
, dynamically replacing it with the appropriate per-variant ProofData
. While I have not yet explored this approach, it could simplify the workflow and reduce redundancy. Alternatively, tweaking the reducibility might resolve the equality issues at least. Input from Lean experts would be greatly appreciated!
Discussion