🏰

Detecting sentence boundaries in legal documents

に公開

TL;DR

  • BoostDraft needs to detect the boundary of sentences to improve the accuracy of its analyzer.
  • In languages like English, punctuation "." is not limited to appearing at sentence boundaries, especially in legal documents; so simply splitting at "." often generates bad result.
  • We introduce a new algorithm: developers can define rules independently; rules work like voters and emit a score, which are then summed up to determine sentence boundaries. This also allows a smooth transition from rule-based method to statistical models.

Introduction

Detecting sentence boundaries means splitting a paragraph into sentences. The need of detecting sentence boundaries comes from at least two completely different applications:

  1. BoostDraft supports comparison of two documents. Since the size of diff might be much smaller than the whole document, we want to hide unchanged part. We do not want to cut it off at some weird location so we want to make it at boundaries of sentences.
  2. BoostDraft supports proofreading the document. Some issues are local: we can detect its occurance from the sentence containing it. This implies that as long as we can cut the document into sentences, then we can afford algorithms of higher complexity (e.g. quadratic, which is common in many NLP algorithms). Also, knowing the boundaries of sentences can improve the result of proofreading. For example, a word at the beginning of a sentence is always capitalized, which implies that (1) if it is not capitalized then there is something wrong (2) we cannot determine whether a word is proper noun or not based on its capitalization at this location.

The two applications require different targets: for the first one, the wrong result can be quite noticeable and we want to understand why it happens and fix bad results as soon as possible. This requires our model to be (1) explainable (2) at least short-circuitable (which means we can set high priority hand-crafted rules); for the second one, since the detecting of sentence boudaries can be used in preprocessing stage, it should run very fast, ideally in linear time.

The discussion below is not restricted to the current implementation. It contains explorations, wrong attempts and future plans.

Challenge and idealized solution

A simple and direct idea is to split at ".", and let's check an example

see United States v. X-Citement Video, Inc., 513 U.S. 64, 76-78, 115 S.
Ct. 464, 130 L. Ed.|P| 2d 372 (1994)

After dividing at ".", we get meaningless fragments

see United States v.
X-Citement Video, Inc.
, 513 U.
S.
64, 76-78, 115 S.
Ct.
464, 130 L.
Ed.
|P| 2d 372 (1994)

The pattern is quite common in legal documents.

Of course we can introduce pre-defined rules to check abbreviations. The issue is that rules can complete with each other: they have different opinions on the nature of a single punctuation. To assign priorities for different rules, the code soon becomes hard to maintain and it is painful to add new rules. Another issue is that it is hard to debug: the result is simply a segmented list of sentences and when we find an incorrect segmentation, we have to run the code step by step to check what's going wrong.

It's always good to set up an ideal target at the beginning and find out how to obtain a satisfying approximation. In general, more information is always preferred and can improve the quality of downstream analysis. Here one natural wish is that the periods should come with their meaning. Instead of a simple ".", we hope to see something like

see United States v.(I am "vs") X-Citement Video, Inc. (I am "Incorporated"), 513 U.(I am "United")S.(I am "States") 64...

If we have such an annotation, it is trivial to determine sentence boundaries: just look at those symbols standing for sentence ending and we are done. In other words, the tasks of determining sentence boundaries is a subtask of determining the meaning of a word in a sentence. Here "word" is interpreted in a general sense: it also includes punctuations, and "." is not the only one we need to deal with, but in the following part of this article, we will focus on "." because it is the most challenging one and everything else can be solved in a similar way.

Of course, deciding the meaning is like the holy grail of language processing and it is impossible to solve it by a simple algorithm. To make the current task (by reducing it to a general problem) practical, we need to identify a very rough approximation of "meaning".

Linearize everything

In math, the only thing we can compute efficiently (i.e. in polynomial time) is linear problems and the situation is similar here: we want to represent the "meaning" by something additive.

A word or punctuation has a list of possible meanings. For example, "." can stands for full stop (sentence ending, that's what we are looking for), full dot (abbreviation), decimal point ("12.34kg"), delimiter ("Section 1.2"), etc.

As a concrete example, we can take a look at the end of previous paragraph: ", etc." where we can make the following estimation: since we see "etc", it is likely that the "." stands for abbreviation, but we cannot rule out the possibility of a real sentence ending, after all it is totally normal for "etc." to appear at the end of a sentenece; the other possibilities are very small, so as a result we might want something like

[full stop: 0.499, full dot(etc.): 0.499, decimal point: 0.001, delimiter: 0.001]

and if we agree with a particular order of meanings, we can drop the label for simpliciy

[0.499, 0.499, 0.001, 0.001]

Now it looks very similar to a vector in the mathematical sense. However not every sequence of numbers is a vector and we need to justify it by showing add two together makes sense. After all, at first glance, it might seem like a combinatorial problem where we select one meaning of "." from a collection of candidates. Let's call them M_i,\, i=1,2,\dots,k; where M for Meaning and k is the number of possible meanings of ".".

I would suggest to view it as a voting process: voters give a score for each candidate M_i, and the total score of the candidate M_i is simply the sum of scores from each voter. In this sense the desired output is linearized.

Return to the previous example, we can imagine that one voter blindly vote \mathrm{fullstop} = 0.9 for all "." (because statistically most of them are really sentence ending), but another voter notice the skeptical "etc" and vote 0.499 for full dot and -0.401 for full stop, and yet another voter simply assign 0.001 for all tags that looks impossible to avoid a zero score. Now if we sum them together, we will recover the previous proposal, but now from three independent voters.

There is a practical motivation: our original implementation consists of a lot of nested if-else statement. It is gradually out of control. Maintaining the current codebase is harder and harder: we don't know the impact of adding a new rule, it is even kind of challenging to find the correct position in the nested codes, not to say debugging. However, in the above voter metaphor, voters - corresponding to rules - make their decision independently, and we do not need to worry about where to add a new rule: we know addition always commutes, and so does the location of rules.

That's what I mean by "linear": it is about making things additive and take advantage of +'s commutativity (hence avoids nested codes and now different rules are aligned in a linear line); yeah it sounds very different from "linear algebra" where we fights with vectors and matrices, but they are essentially the same thing.

Technically, these numbers can be interpreted as probability and is not what we really use. The actual number is log-probability which is additive for independent events, and it also explains why we try to add 0.001 to every candidate: logarithmic is undefined at 0. So we do not need to keep them to be something sum up to 1, but for demonstration purpose we will not change the above numbers to their logarithm.

We have to make one further approximation. At this moment, each word/punctuation has its own list of meanings. It is not practical to analyze and store all of them and many of them are irrelevant to our current task. For example, one might not care about whether "." is decimal point ("12.34kg") or delimiter ("Section 1.2"): in both cases they cannot end a sentence.

Therefore, it is better to project them into a small subspace, i.e. do a very crash classification: for the task we care about, each word can be assigned one of the four labels: B(eginning of a sentence), I(nner), L(ast word of a sentence), O(ut of a sentence). Again let's return to the example above

[full stop: 0.499, full dot(etc.): 0.499, decimal point: 0.001, delimiter: 0.001, ...]

The proejction for full stop is 1-1 to L(ast word of a sentence) but full dot(etc.) is controversal, and let's do it in a simple way: half(L)-half(I).

So the projection (notice that to preserve additivity it is again a linear transformation) is

full stop: 0.499 -> L: 0.499
full dot(etc.): 0.499 -> L: 0.2495 + I: 0.2495
decimal point: 0.001 -> I: 0.001
delimiter: 0.001 -> I: 0.001

As a result

[B: 0, I: 0.2515, O: 0, L: 0.7485]

This is a common trick in machine learning: to classify something, we predict a vector score (s_i) and the probability of each target label t is given by the softmax \frac{e^{s_t}}{\sum e^{s_i}}. Notice that its logarithm equals to -\log(1+\sum e^{s_i-s_t}), in n-select-k case it is generalized to \log(1+\sum e^{s_i})+\log(1+\sum e^{−s_j}) where the first sum is over selected labels and the second for non-selected ones.

The detailed formulae are not important (there does exist some heuristic argument but experiment shows using other function also works) and feel free to skip them when getting confused. The point is that we are arriving at an interpolation of machine learning algorithms and rule-based algorithms: it is possible to let some of the voters be neural networks and the remaining ones be hand-craft rules. Also the same pattern can be potentially applied to other problems. For example, Recall that in the motivation part, we explained that one way the current task is raised is trying to determine undefined terms in a document, and it can be reduced to a very similar problem (just change "B/I/L/O of sentence" to "B/I/L/O of an undefined term"). Also there are some models directly employing the n-select-k strategy to select a few possible pair of (B,L) from all possible pairs (that might sound horrible but notice that the current task is to break them into sentences, and after that an algorithm with quadratic complexity becomes acceptable). In this sense, by linearize the problem, we get a lot of flexiblility.

Let's return to the current task and make some final comments about the output side. The output is not uniformly distributed. Most words are labelled by I while for punctuations like "." ";" "?" "!" it is very likely to be labelled by L. Also to make the final decision is more complicated than selecting highest score because it might not be compatible (B and L must be matched), so we need to impose some penalty score for such kinds of combination. Selecting the sequence with maximal score is a standard dynamical programming problem and can be solved in linear time; the details are skipped because there is nothing new and we have already talked too much about output side. To not break the line of thinking, let's turn to the input side.

Input: meaning and peprocessing

The input is a paragraph of texts. Recall that previously we want to assign a label for each word, so before trying to set up voters and compute scores, we need to break the input into smaller pieces.

Words are not the minimal unit of language, ideally to understand the meaning of a new word, we want to break them into morphemes. For example, we can freely add "dis-"/"un-" to a lot of words to coin a word with opposite meaning but it might not appear in dictionaries. Also by analyzing etymology we can also find a lot of smaller building blocks with Latin/Greek/... origin. Those linguistics-motivated word breakers might not be so practical but nowadays there are other ways of breaking word into smaller units (called tokens), like byte-pair encoding, which works well in many situations.

However, we still use a word-level tokenizer. One reason is that we also want to support word-level searching: sometimes we have a few defined terms and want to know how many times they are used in the document. A character-level searching is not satisfying because it might stop in the middle of a word. By introducing word-level tokenizer, we automatically get the desired result (to be a little bit more detailed, which can be safely skipped, the searching is based on suffix array and the implementation of the algorithm already requires us to convert characters into integers; previously we use UTF-32 encoding as the integer to deal with some dark side of unicode, but as a by-product now we can simply use the id of tokens as the integer).

Now the input is a sequence of words. The standard way of associating meaning to words is to embed them as vectors, in other words, a sequence of numbers. There are some heuristic explanation of the meaning of these numbers, but we will not go over it here. One reason is that the vector is not that explainable. A more direct reason is that, for the current task of determining sentence boundary, most semantic information is useless. If we see "A v. B" we know that "." stands for "vs" because of the previous "v", but it does not make a big difference if it is not "v" but some other abbreviation. Therefore, we will use a more compact representation: now "meaning" are encoded as a 0-1 array of binary features like whether it starts with capital letter, whether it is a number, whether it falls into a small dictionary of known abbreviations, etc. For example "v" becomes v=[0, 0, 1] because it is NOT capital, NOT a number, but IS in known abbreviations. This is just a showcase and in practice the length of the list (a.k.a. dimension of vector) is longer. Notice that it does not rule out any possibility of introducing a pre-trained vector embedding. If one day we find morphological information is not enough, there will be no difficulty introducing semantic information into this model.

Now we define the meaning of each word by a 0-1 vector. But it is just the property of the word itself and ignored the context. So the next task is to transform the sequence of word vectors, from the space of 0-1 vectors to another space of meaning-within-context. It will again be a rough approximation of the real "meaning" (recall that we want a very fast implementation so that the runtime performance is acceptable even on low-end computers) but hopefully it will bring some insights on why it looks like a first step towards an idealized algorithm.

Transform the input sequence

After explaining desired output and preprocessing of input, the remaining task is to connect them, or in other words, build a transformation from a sequence of vectors to another one (but in a differnet space).

Let's consider what is needed for this transformation. First, we are mapping them from the space of 0-1 vectors to another space of linearized B/I/L/O labels. Next, the context should be included in the computation to resolve ambiguity. The size of context is small and in almost all cases there is no need to consider long-range coupling like self-attention. In summary, what we need to predict is scores like s_{B,I,L,O}(\mathrm{context})=s(c_1,\dots) where context is a 0-1 sequence c_i and the range of i is from 1 to length of context times number of features. In practice the context length is of order ~10^1.

It is still complicated and we make the following assumptions. The first one is that the context can be splitted into preceding and following and they have equal length (like 3 words). So we can now write it as s(\mathrm{prev}, \mathrm{next}). Next we assume it can be written as a linear combination s=s(\mathrm{prev})+s(\mathrm{next}). One might complaint that there exist cross terms: the "." has special meaning only if both the prev and next satisfies some condition. To fix this, recall that the labels are not distributed uniformly, which means we can put a constant offset to the corresponding label. For example if by default s_L = 100, then we can break the condition s_I(\mathrm{prev}, \mathrm{next})=150 into something like s_I(\mathrm{prev}) = 70, s_I(\mathrm{next}) = 40, so that if only one of them is satisfied, it is still lower than the score of s_L and it is more likely to be considered as a full stop.

The final assumption is that the space of preceding (or following) itself is degenerate: they fall into (at least almost) disjoint categories and the number of categories is much smaller than 2^\mathrm{d} where d is the number of c_i above (length times number of features). In that case, we can again decompose s_{B,I,L,O} into a linear combination, and for each term it just checks whether the original vector projects onto it. These terms corresopnd to our original voters metaphor, where each voter gives an independent score.

As a result, under the above assumptions, we eventually linearize the original problem and decouple the rules.

Future works

We are close to the end of this article. I keep trying to consider a general form of the problem and them make assumptions and approximations, instead of directly describing the implementation. The reason is that I hope to solve related problems too instead of making the sentence boundary detection task an isolated one. Also by looking at an idealized form and the approximations, it might also hint on future improvements.

We have already described its potential connection with the task of determining undefined terms so it will not be repeated here. Apart from that, in general it would be useful to use the result to improve the quality of parsing or proofreading, so instead of directly jumping to the space of B/I/L/O labels, it might be useful to build some intermediate representation first and then use that as input of different following tasks.

Another important direction is that instead of fixing the model manually only when someone discovers weird behaviors, it might be better to learn the patterns from data. There is not a lot of labelled data, so an unsupervised method is more preferred. For example, one way of getting word embedding is to predict a word given the context, and as long as a nontrivial prediction is found (e.g. the model has a significant confidence about the appearance of ".", or some other special words; in other words it finds fixed combination), then it is likely that it is used in a special way (abbreviation / proper nouns / etc.), and that's what we are looking for during the whold article. Since the language used in legal document is kind of special, we still need to collect more data and do experiments on them to check the patterns of data and its performance, but it might bring some significant improvements and hence deserve a try.

BoostDraft TECH BLOG

Discussion