📝

ProofData for organizing intermediate definitions and properties

2024/12/05に公開

日本語版の記事もあります。

This article introduces a pattern for organizing definitions and proofs using type classes to efficiently reuse results about intermediate data. While the technique itself may not be novel, I hope that documenting and sharing these techniques makes software verification more practical and accessible for real-world applications.

Motivation

I have been working on lean-regex, an NFA-based regular expression engine implemented in Lean 4 with correctness proof in mind. At the moment, I was able to verify that the regex search function returns .some only when there is a match in the haystack (theorem). However, I haven't verified that the function correctly reports the matched positions because such reporting depends on capture groups, which I haven't formalized yet.

Therefore, my next step is to formalize the semantics of capture groups and prove that the implementation is correct with respect to the semantics. However, I was reluctant to do because it's a lot more work I'm afraid that the current organization of my proofs doesn't scale well with more complicated properties.

Before diving into the problem I faced, let me give a brief overview of the implementation of lean-regex. The library compiles a regular expression (regex) into a non-deterministic finite automaton (NFA), which allows us to perform a linear-time search using a regex. This compilation is performed incrementally through a function called pushRegex. Specifically, pushRegex nfa next e extends the given NFA nfa, producing a new NFA whose transitions from the start state lead to the state next if and only if the input string matches the regex e.

For example, let’s consider how a regex of the form e₁ | e₂ (alternation) is compiled. We start with a single-state NFA called done. This NFA has a designated accepting state, and reaching this state during the regex search signifies a successful match.

To compile e₁ | e₂, we call pushRegex done 0 (e₁ | e₂). This process unfolds recursively:

  1. First, we construct the NFA for e₁ by appending its states to done, resulting in nfa₁ := pushRegex done 0 e₁.
  2. Next, we append the states for e₂, forming nfa₂ := pushRegex nfa₁ 0 e₂.
  3. Finally, we add a split node to enable ε-transitions (transitions on empty input) to the starting states of e₁ and e₂, producing the final NFA nfa' := pushNode nfa₂ (.split nfa₁.start nfa₂.start).

As you can see, constructing the NFA for e₁ | e₂ involves several intermediate steps, including defining nfa₁ and nfa₂. Proving the correctness of the compiled NFA relies on analyzing and verifying key properties of these intermediate NFAs, such as how their transitions behave and how they relate to the compiled NFA.

The challenge is that intermediate data like nfa₁ and nfa₂ are declared as local variables within pushRegex. This means they are not accessible as top-level definitions outside the function, making it impossible to reuse them in different proofs. As a result, I often had to prove the same properties for these intermediate data repeatedly, which was both tedious and time-consuming.

ProofData pattern

Actually, it is possible to elevate these intermediate data structures to top-level definitions using type classes. This approach allows us to encapsulate base definitions and properties in a reusable framework, eliminating redundancy in proofs. I first encountered this idea in this Zulip message from Floris, and I adopted the term "ProofData" to describe this pattern.

In this case, we can create a ProofData class to encapsulate the inputs to the pushRegex function. By bundling these inputs, such as the initial NFA, target state, and regex expression, into a single type class, we can reuse and reference them more conveniently in proofs. For instance, with a ProofData instance in scope, we can define nfa' as the result of the compilation:

class ProofData where
  nfa : NFA
  next : Nat
  e : Expr
  
def ProofData.nfa' [ProofData] : NFA := pushRegex nfa next e

While the base ProofData class is a good starting point, it lacks variant-specific information required for different regexes. To address this, I created specialized subclasses for each regex variant. These subclasses extend ProofData by including additional data and a property, expr_eq, that links the additoinal data back to the original expression. For example, the Alternate class handles the alternation case:

class Alternate extends ProofData where
  e₁ : Expr
  e₂ : Expr
  expr_eq : e = .alternate e₁ e₂

Using an Alternate instance, we can define intermediate data, such as nfa₁ and nfa₂, as top-level definitions. This approach not only organizes these definitions clearly but also allows us to prove and encapsulate their properties. These properties can then be reused throughout the correctness proofs, as shown below:

namespace Alternate

variable [Alternate]

-- Definitions of the intermediate data
-- constants in these definitions come from the ProofData and Alternate instance
def nfa1 : NFA := nfa.pushRegex next e₁
def nfa2 : NFA := nfa1.pushRegex next e₂

-- properties of the intermediate data
theorem size_lt₂ : nfa₂.nodes.size < nfa'.nodes.size := ...
theorem size_lt₁ : nfa₁.nodes.size < nfa'.nodes.size := ...
theorem get_start : nfa'[nfa'.start] = .split nfa₁.start nfa₂.start := ...

end

When using these properties in correctness proofs, we can introduce an instance of the type class as a local variable. Then, Lean automatically picks up the instance when resolving the names under the Alternate namespace.

-- eq : nfa.pushRegex next e = result
induction e with
| ...
| alternate e₁ e₂ ih₁ ih₂ =>
  -- utility function to introduce an instance of Alternate
  let pd := ProofData.Alternate.intro eq

  -- At this point, variant-specific definitions and properties are available
  -- For example:
  -- - `pd.nfa1` represents the NFA compiled for `e₁`
  -- - `pd.get_start` provides the property about the split node

Result

The transition to the ProofData technique is available in this pull request. While the total line count increased due to the additional boilerplate, the reuse of properties about the intermediate data significantly shortened individual proofs. For complex proofs, I observed 30-40% reductions in line count.

Moreover, ProofData introduced a principled structure for organizing intermediate data and their associated properties. I hope this change makes it easier to scale the library as more features are added. Overall, the refactoring has proven to be a worthwhile improvement, and I plan to continue using this technique in future projects.

Shortcomings

While ProofData is a useful pattern for structuring proofs, I found some mild annoyances when applying it.

The first challenge lies in the mismatch between induction proofs and the per-variant definitions of ProofData. During an induction proof, the hypothesis provides a general regex e, whereas ProofData is defined for specific regex variants. As a result, I often needed to re-state the theorem for each variant to move the result into the appropriate ProofData namespace (example).

Another issue arises when using ProofData in the hypothesis of induction statements. For instance, consider the following theorem, which states that pushRegex does not modify the contents of the original NFA (as it appends new states to the end):

theorem ProofData.pushRegex_get_lt (pd : ProofData) (i : Nat) (h : i < pd.nfa.nodes.size) :
  pd.nfa'[i] = pd.nfa[i]

While this style of statements offer a convinent way to access the result (as pd.pushRegex_get_lt), it introduced two issues in the proof:

  1. During induction, induction pd.e with loses the equations between pd.e and the variables introduced for each branch. Although generalize expr_eq : pd.e = e' can address this, it's annoying to work around.
  2. More importantly, pd : ProofData doesn't carry variant-specific data, so we need to introduce one of the subclasses (e.g., Alternate) in each branch again. Now you have two nfas that come from different proof data! Even though they are equal, everything became quite confusing quickly.

Due to these issues, I reverted to introducing nfa, next, and e as free variables in theorem statements. For example:

theorem pushRegex_get_lt {nfa next e result} (eq : pushRegex nfa next e = result) (i : Nat) (h : i < nfa.nodes.size) :
  result.val[i] = nfa[i] := by
  induction e with
  | ...
  | alternate e₁ e₂ ih₁ ih₂ =>
    let pd := Alternate.intro eq
    -- proof continues

This approach is more straightforward but introduces redundancy. Each branch now includes two definitions of nfa (nfa and pd.nfa). Even though they are definitionally equal, tactics like simp, rw, and omega often struggles with seeing it, requiring frequent use of rfl to resolve equations.

One possible room for improvement is creating a new induction principle that operates directly on pd, replacing it with the appropriate per-variant ProofData. While I have not yet verified the feasibility of this approach, it could provide a more streamlined experience when performing an induction.

To address these limitations, one potential answer is creating a custom induction principle that directly operates on pd, dynamically replacing it with the appropriate per-variant ProofData. While I have not yet explored this approach, it could simplify the workflow and reduce redundancy. Alternatively, tweaking the reducibility might resolve the equality issues at least. Input from Lean experts would be greatly appreciated!

Discussion