Detecting sentence boundaries in legal documents
TL;DR
- BoostDraft needs to detect the boundary of sentences to improve the accuracy of its analyzer.
- In languages like English, punctuation "." is not limited to appearing at sentence boundaries, especially in legal documents; so simply splitting at "." often generates bad result.
- We introduce a new algorithm: developers can define rules independently; rules work like voters and emit a score, which are then summed up to determine sentence boundaries. This also allows a smooth transition from rule-based method to statistical models.
Introduction
Detecting sentence boundaries means splitting a paragraph into sentences. The need of detecting sentence boundaries comes from at least two completely different applications:
- BoostDraft supports comparison of two documents. Since the size of diff might be much smaller than the whole document, we want to hide unchanged part. We do not want to cut it off at some weird location so we want to make it at boundaries of sentences.
- BoostDraft supports proofreading the document. Some issues are local: we can detect its occurance from the sentence containing it. This implies that as long as we can cut the document into sentences, then we can afford algorithms of higher complexity (e.g. quadratic, which is common in many NLP algorithms). Also, knowing the boundaries of sentences can improve the result of proofreading. For example, a word at the beginning of a sentence is always capitalized, which implies that (1) if it is not capitalized then there is something wrong (2) we cannot determine whether a word is proper noun or not based on its capitalization at this location.
The two applications require different targets: for the first one, the wrong result can be quite noticeable and we want to understand why it happens and fix bad results as soon as possible. This requires our model to be (1) explainable (2) at least short-circuitable (which means we can set high priority hand-crafted rules); for the second one, since the detecting of sentence boudaries can be used in preprocessing stage, it should run very fast, ideally in linear time.
The discussion below is not restricted to the current implementation. It contains explorations, wrong attempts and future plans.
Challenge and idealized solution
A simple and direct idea is to split at ".", and let's check an example
see United States v. X-Citement Video, Inc., 513 U.S. 64, 76-78, 115 S.
Ct. 464, 130 L. Ed.|P| 2d 372 (1994)
After dividing at ".", we get meaningless fragments
see United States v.
X-Citement Video, Inc.
, 513 U.
S.
64, 76-78, 115 S.
Ct.
464, 130 L.
Ed.
|P| 2d 372 (1994)
The pattern is quite common in legal documents.
Of course we can introduce pre-defined rules to check abbreviations. The issue is that rules can complete with each other: they have different opinions on the nature of a single punctuation. To assign priorities for different rules, the code soon becomes hard to maintain and it is painful to add new rules. Another issue is that it is hard to debug: the result is simply a segmented list of sentences and when we find an incorrect segmentation, we have to run the code step by step to check what's going wrong.
It's always good to set up an ideal target at the beginning and find out how to obtain a satisfying approximation. In general, more information is always preferred and can improve the quality of downstream analysis. Here one natural wish is that the periods should come with their meaning. Instead of a simple ".", we hope to see something like
see United States v.(I am "vs") X-Citement Video, Inc. (I am "Incorporated"), 513 U.(I am "United")S.(I am "States") 64...
If we have such an annotation, it is trivial to determine sentence boundaries: just look at those symbols standing for sentence ending and we are done. In other words, the tasks of determining sentence boundaries is a subtask of determining the meaning of a word in a sentence. Here "word" is interpreted in a general sense: it also includes punctuations, and "." is not the only one we need to deal with, but in the following part of this article, we will focus on "." because it is the most challenging one and everything else can be solved in a similar way.
Of course, deciding the meaning is like the holy grail of language processing and it is impossible to solve it by a simple algorithm. To make the current task (by reducing it to a general problem) practical, we need to identify a very rough approximation of "meaning".
Linearize everything
In math, the only thing we can compute efficiently (i.e. in polynomial time) is linear problems and the situation is similar here: we want to represent the "meaning" by something additive.
A word or punctuation has a list of possible meanings. For example, "." can stands for full stop (sentence ending, that's what we are looking for), full dot (abbreviation), decimal point ("12.34kg"), delimiter ("Section 1.2"), etc.
As a concrete example, we can take a look at the end of previous paragraph: ", etc." where we can make the following estimation: since we see "etc", it is likely that the "." stands for abbreviation, but we cannot rule out the possibility of a real sentence ending, after all it is totally normal for "etc." to appear at the end of a sentenece; the other possibilities are very small, so as a result we might want something like
[full stop: 0.499, full dot(etc.): 0.499, decimal point: 0.001, delimiter: 0.001]
and if we agree with a particular order of meanings, we can drop the label for simpliciy
[0.499, 0.499, 0.001, 0.001]
Now it looks very similar to a vector in the mathematical sense. However not every sequence of numbers is a vector and we need to justify it by showing add two together makes sense. After all, at first glance, it might seem like a combinatorial problem where we select one meaning of "." from a collection of candidates. Let's call them
I would suggest to view it as a voting process: voters give a score for each candidate
Return to the previous example, we can imagine that one voter blindly vote
There is a practical motivation: our original implementation consists of a lot of nested if-else
statement. It is gradually out of control. Maintaining the current codebase is harder and harder: we don't know the impact of adding a new rule, it is even kind of challenging to find the correct position in the nested codes, not to say debugging. However, in the above voter metaphor, voters - corresponding to rules - make their decision independently, and we do not need to worry about where to add a new rule: we know addition always commutes, and so does the location of rules.
That's what I mean by "linear": it is about making things additive and take advantage of
Technically, these numbers can be interpreted as probability and is not what we really use. The actual number is log-probability which is additive for independent events, and it also explains why we try to add
We have to make one further approximation. At this moment, each word/punctuation has its own list of meanings. It is not practical to analyze and store all of them and many of them are irrelevant to our current task. For example, one might not care about whether "." is decimal point ("12.34kg") or delimiter ("Section 1.2"): in both cases they cannot end a sentence.
Therefore, it is better to project them into a small subspace, i.e. do a very crash classification: for the task we care about, each word can be assigned one of the four labels:
[full stop: 0.499, full dot(etc.): 0.499, decimal point: 0.001, delimiter: 0.001, ...]
The proejction for full stop is 1-1 to
So the projection (notice that to preserve additivity it is again a linear transformation) is
full stop: 0.499 -> L: 0.499
full dot(etc.): 0.499 -> L: 0.2495 + I: 0.2495
decimal point: 0.001 -> I: 0.001
delimiter: 0.001 -> I: 0.001
As a result
[B: 0, I: 0.2515, O: 0, L: 0.7485]
This is a common trick in machine learning: to classify something, we predict a vector score
The detailed formulae are not important (there does exist some heuristic argument but experiment shows using other function also works) and feel free to skip them when getting confused. The point is that we are arriving at an interpolation of machine learning algorithms and rule-based algorithms: it is possible to let some of the voters be neural networks and the remaining ones be hand-craft rules. Also the same pattern can be potentially applied to other problems. For example, Recall that in the motivation part, we explained that one way the current task is raised is trying to determine undefined terms in a document, and it can be reduced to a very similar problem (just change "B/I/L/O of sentence" to "B/I/L/O of an undefined term"). Also there are some models directly employing the
Let's return to the current task and make some final comments about the output side. The output is not uniformly distributed. Most words are labelled by
Input: meaning and peprocessing
The input is a paragraph of texts. Recall that previously we want to assign a label for each word, so before trying to set up voters and compute scores, we need to break the input into smaller pieces.
Words are not the minimal unit of language, ideally to understand the meaning of a new word, we want to break them into morphemes. For example, we can freely add "dis-"/"un-" to a lot of words to coin a word with opposite meaning but it might not appear in dictionaries. Also by analyzing etymology we can also find a lot of smaller building blocks with Latin/Greek/... origin. Those linguistics-motivated word breakers might not be so practical but nowadays there are other ways of breaking word into smaller units (called tokens), like byte-pair encoding, which works well in many situations.
However, we still use a word-level tokenizer. One reason is that we also want to support word-level searching: sometimes we have a few defined terms and want to know how many times they are used in the document. A character-level searching is not satisfying because it might stop in the middle of a word. By introducing word-level tokenizer, we automatically get the desired result (to be a little bit more detailed, which can be safely skipped, the searching is based on suffix array and the implementation of the algorithm already requires us to convert characters into integers; previously we use UTF-32 encoding as the integer to deal with some dark side of unicode, but as a by-product now we can simply use the id of tokens as the integer).
Now the input is a sequence of words. The standard way of associating meaning to words is to embed them as vectors, in other words, a sequence of numbers. There are some heuristic explanation of the meaning of these numbers, but we will not go over it here. One reason is that the vector is not that explainable. A more direct reason is that, for the current task of determining sentence boundary, most semantic information is useless. If we see "A v. B" we know that "." stands for "vs" because of the previous "v", but it does not make a big difference if it is not "v" but some other abbreviation. Therefore, we will use a more compact representation: now "meaning" are encoded as a 0-1 array of binary features like whether it starts with capital letter, whether it is a number, whether it falls into a small dictionary of known abbreviations, etc. For example "v" becomes
Now we define the meaning of each word by a
Transform the input sequence
After explaining desired output and preprocessing of input, the remaining task is to connect them, or in other words, build a transformation from a sequence of vectors to another one (but in a differnet space).
Let's consider what is needed for this transformation. First, we are mapping them from the space of
It is still complicated and we make the following assumptions. The first one is that the context can be splitted into preceding and following and they have equal length (like
The final assumption is that the space of preceding (or following) itself is degenerate: they fall into (at least almost) disjoint categories and the number of categories is much smaller than
As a result, under the above assumptions, we eventually linearize the original problem and decouple the rules.
Future works
We are close to the end of this article. I keep trying to consider a general form of the problem and them make assumptions and approximations, instead of directly describing the implementation. The reason is that I hope to solve related problems too instead of making the sentence boundary detection task an isolated one. Also by looking at an idealized form and the approximations, it might also hint on future improvements.
We have already described its potential connection with the task of determining undefined terms so it will not be repeated here. Apart from that, in general it would be useful to use the result to improve the quality of parsing or proofreading, so instead of directly jumping to the space of
Another important direction is that instead of fixing the model manually only when someone discovers weird behaviors, it might be better to learn the patterns from data. There is not a lot of labelled data, so an unsupervised method is more preferred. For example, one way of getting word embedding is to predict a word given the context, and as long as a nontrivial prediction is found (e.g. the model has a significant confidence about the appearance of ".", or some other special words; in other words it finds fixed combination), then it is likely that it is used in a special way (abbreviation / proper nouns / etc.), and that's what we are looking for during the whold article. Since the language used in legal document is kind of special, we still need to collect more data and do experiments on them to check the patterns of data and its performance, but it might bring some significant improvements and hence deserve a try.
Discussion