📌

The Challenges of Making Multi-language Parser

に公開

日本語サマリー

本記事では、BoostDraftの Document Parser 開発チームが取り組んだ「多言語対応の統一パーサー」を構築する中で直面した課題について紹介しています。

BoostDraftは、法務専門職向けのWordアドインを提供しており、その機能を支える基盤となる Document Parser は、契約書内の定義語や条番号などの構造要素を検出・抽出します。これまでは日本語・英語・韓国語ごとに個別のパーサーを実装していましたが、保守効率の向上を目的として、統一されたコードベースへの移行に挑戦しています。

本記事では、特に日本語と英語を一つのコードベースで扱う際に生じた課題に焦点を当て、以下のような具体的なトピックを取り上げています:

  • 日本語における単語の境界判定の難しさと、英語での語形正規化の工夫
  • 日本語特有の条番号の省略表記(第一項の非表示)への対応
  • 表形式で記述された定義語の判別の難しさ

また、そうした課題に向き合う中で感じた、開発の面白さや知的好奇心を刺激された点についても紹介しています。

Introduction

BoostDraft offers a Microsoft Word add-in that functions like an “IDE for legal professionals.”

I’m part of the team responsible for building and maintaining the document parsers—components that identify and extract key structural elements from legal documents such as defined terms, sections, cross-references, and more.…), this is the foundation for other functions of BoostDraft like error checking, auto formatting, and displaying highlights etc..

BoostDraft currently supports documents written in Japanese, English, and Korean. Up until recently, we developed separate parser implementations for each language. This allowed us to move fast at the beginning, but as our customer base grew and feature complexity increased, maintaining language-specific parsers quickly became inefficient and error-prone. Every bug fix or feature enhancement had to be duplicated across multiple code paths.

This article shares some of the challenges we faced as we transitioned toward building a unified multilingual parser, particularly the nuanced problems that arise when trying to support both Japanese and English within the same codebase.

Frequently Used Terms

  • Definition Parser: Identifies “defined terms” in a document—special words or phrases (like "Effective Date" or "反社会的勢力") that are given precise meanings elsewhere in the text.
  • Section Parser: Analyzes the document structure to detect article/section numbers, headings, and hierarchical levels (e.g., "Section 1.2", "第1条第2項").

Challenges We Faced

Counting Word Occurrences : not as easy as it sounds!

In English, words are separated by spaces. That’s convenient for computers. For example, the sentence “I ate a pineapple” clearly contains the word “pineapple”, not “apple”.

In Japanese, however, there are no spaces between words. So a string like 投資関連契約 (Investment-related agreement) contains 投資 (investment) as a substring, but that doesn't mean it's a meaningful match. Whether 投資 should be counted depends on context.

For example:
✅ まずは投資の基本を理解しましょう → contains a real use of 投資
❌ 投資関連契約を締結した → 投資 is part of a compound term and should not be counted separately

English also brings its own challenges. Plurals like “investor” vs. “investors” have to be normalized to avoid overcounting or undercounting. Fortunately, we already had some lemmatization logic in the system for English, which we could reuse.

Handling section numbers - some language has its own “writing style”...

Section parsing is another area where Japanese and English behave very differently.
In English contracts, you usually see sections like:

  • Section 1. Introduction
    • 1.1 Background
    • 1.2 Purpose

In Japanese legal documents, Section X.1 (first subsection under Section X) is almost always implicit. The Section is usually labeled 第X条, and within that, sub-section are marked with 第N項. However, the first sub-section (第一項) is usually not shown — only the second sub-section (第二項) and onward are labeled.

This means when we parse a structure like:

  • 第3条
    • 第二項...
    • 第三項...

We have to infer that there was a hidden 第一項 (Paragraph 1) before those. It’s subtle, but extremely important for accurate structure parsing and reference resolution. Getting this wrong means misnumbered links or broken outlines, which are unacceptable in a legal editing tool.

Identifying Definition Tables: easy for human, hard for computer

Sometimes defined terms are listed in a table format:

Term Meaning
Effective Date The date the agreement begins
Some Term Means ... defined in ...

However, not every table is a definition table. Some are forms, some are lists, some are just layout elements.

To detect likely definition tables, we rely on heuristics. One useful signal is length difference between the two columns. If one column (terms) has very short content and the other (descriptions) is much longer, it's probably a definition table.

But this heuristic falls apart in Japanese due to short word lengths. For example:

Column 1 Column 2
住所 ●●県●●市
氏名 山田太郎

The column lengths technically satisfy the short-vs-long ratio, but semantically this is not a definition table—it’s a form.

A human can tell instantly. A program… not so much.

We are now exploring hybrid strategies—combining length ratios with signals like punctuation, keywords (~~~をいう, ~~~ means), or document layout cues—but perfect accuracy is hard to achieve.

Interesting Points

Despite the headaches, working on this multilingual parser has been genuinely rewarding. A few things made this project particularly interesting:

  • Designing language-agnostic logic as much as possible, while knowing when to break that rule
  • Legal language quirks: for example, how defined terms are declared in-line in English ("Effective Date" means...) versus at the end in Japanese documents
  • The constant balancing act between heuristics and false positives—every fix can introduce a new edge case
  • It's also intellectually satisfying to build something that feels smart when it works—like correctly identifying a term that's only defined deep in an appendix and used once on page 3.

Conclusion

Multilingual support in document parsing isn’t just a matter of translating text. It's about understanding and bridging structural, grammatical, and even cultural differences between languages.

Building a unified parser that works across English, Japanese, and beyond has forced us to think carefully about edge cases, false assumptions, and what "language-agnostic" really means in practice.

Hopefully, this post gives a glimpse into what we’ve been working on—and why it’s harder (and more fun) than it might seem.

BoostDraft TECH BLOG

Discussion