Testing 2 MeCab Dictionaries with Finance Terms

に公開

Hello, I'm Dang, an AI and machine learning engineer at Knowledgelabo, Inc. We provide a service called "Manageboard", which supports our clients in aggregating, analyzing, and managing scattered internal business data. Manageboard is set to enhance its AI capabilities in the future. In my articles, I will share the challenges we encountered during our research and development.

Background

MeCab is a powerful tool for Japanese text segmentation, often used for breaking down text into manageable units like words or phrases. The segmentation results depend heavily on the dictionary being used. One popular dictionary is the IPA Dictionary, which is commonly recommended and available on MeCab’s site. On the other hand, there is the mecab-ipadic-NEologd, a customized dictionary that includes a large number of newer and specialized words, making it especially useful for handling modern or domain-specific terminology.

In this article, we will test both of these dictionaries using accounting-related terms to compare their segmentation results.

Setup and Code

To perform the test, we created a simple Python script that utilizes the MeCab library. The script runs the segmentation process using two different dictionaries: the IPA dictionary (ipadic) and the custom dictionary mecab-ipadic-NEologd.

Here’s the source code for our experiment:

import MeCab

# Load MeCab with the IPA dictionary
ipa_tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/ipadic')

# Load MeCab with the mecab-ipadic-NEologd dictionary
neo_tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

# Test text containing accounting terms
text = '工場労務費'

# Output segmentation results for both dictionaries
print([line.split(f'\t')[0] for line in ipa_tagger.parse(text).replace(f'\nEOS\n', '').split(f'\n')])
print([line.split(f'\t')[0] for line in neo_tagger.parse(text).replace(f'\nEOS\n', '').split(f'\n')])

In this code, we use the MeCab.Tagger() function to load each dictionary and then parse a sample accounting term (工場労務費, meaning "factory labor costs") through both dictionaries. We then print the segmented words for comparison.

Segmentation Results

Here are the results of running the script on a few test accounting terms:

Text IPA Dictionary (ipadic) mecab-ipadic-NEologd
工場労務費 工場/労務/費 工場/労務費
(製)賞与引当金 (/製/)/賞与/引当/金 (/製/)/賞与/引当金
【原価】旅費交通費(通勤費) 【/原価/】/旅費/交通/費/(/通勤/費/) 【/原価/】/旅費交通費/(/通勤費/)
製)外注人件費 製/)/外注/人件/費 製/)/外注/人件費

Analysis

From the results above, we can observe a few key differences between the two dictionaries:

  1. Segment Length: The mecab-ipadic-NEologd dictionary tends to produce longer segments than the IPA dictionary. For example, in the term 工場労務費 (factory labor costs), the IPA dictionary segments it as 工場/労務/費, while mecab-ipadic-NEologd keeps it as 工場/労務費. This suggests that mecab-ipadic-NEologd recognizes 労務費 as a single entity, whereas the IPA dictionary splits it into smaller components.
  2. Handling of New or Specialized Terms: In the test case of 【原価】旅費交通費(通勤費), mecab-ipadic-NEologd provides a more sensible segmentation by grouping the phrase 旅費交通費 (travel and transportation expenses) into one segment, while the IPA dictionary breaks it down into smaller chunks like 旅費 (travel expenses), 交通 (transportation), and (expenses). This is an example of how mecab-ipadic-NEologd may perform better with specialized or compound terms that are not in the standard IPA dictionary.
  3. Handling Special Characters: Both dictionaries struggle with special characters like parentheses or dashes.

Conclusion

The choice of dictionary in MeCab largely depends on the nature of the text you are working with. If you're working with general or older terms, or if you need to break terms into very short segments, the IPA dictionary should suffice. However, if your text includes more modern or specialized terminology, such as accounting terms, mecab-ipadic-NEologd might be the better option due to its inclusion of newer words and more appropriate segmentation for complex terms.

For accounting professionals or anyone working with specialized fields like finance, testing these dictionaries with domain-specific terms could greatly improve the accuracy and efficiency of text processing tasks.

Discussion