Testing 2 MeCab Dictionaries with Finance Terms
Hello, I'm Dang, an AI and machine learning engineer at Knowledgelabo, Inc. We provide a service called "Manageboard", which supports our clients in aggregating, analyzing, and managing scattered internal business data. Manageboard is set to enhance its AI capabilities in the future. In my articles, I will share the challenges we encountered during our research and development.
Background
MeCab is a powerful tool for Japanese text segmentation, often used for breaking down text into manageable units like words or phrases. The segmentation results depend heavily on the dictionary being used. One popular dictionary is the IPA Dictionary, which is commonly recommended and available on MeCab’s site. On the other hand, there is the mecab-ipadic-NEologd, a customized dictionary that includes a large number of newer and specialized words, making it especially useful for handling modern or domain-specific terminology.
In this article, we will test both of these dictionaries using accounting-related terms to compare their segmentation results.
Setup and Code
To perform the test, we created a simple Python script that utilizes the MeCab library. The script runs the segmentation process using two different dictionaries: the IPA dictionary (ipadic) and the custom dictionary mecab-ipadic-NEologd.
Here’s the source code for our experiment:
import MeCab
# Load MeCab with the IPA dictionary
ipa_tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/ipadic')
# Load MeCab with the mecab-ipadic-NEologd dictionary
neo_tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
# Test text containing accounting terms
text = '工場労務費'
# Output segmentation results for both dictionaries
print([line.split(f'\t')[0] for line in ipa_tagger.parse(text).replace(f'\nEOS\n', '').split(f'\n')])
print([line.split(f'\t')[0] for line in neo_tagger.parse(text).replace(f'\nEOS\n', '').split(f'\n')])
In this code, we use the MeCab.Tagger() function to load each dictionary and then parse a sample accounting term (工場労務費, meaning "factory labor costs") through both dictionaries. We then print the segmented words for comparison.
Segmentation Results
Here are the results of running the script on a few test accounting terms:
| Text | IPA Dictionary (ipadic) | mecab-ipadic-NEologd |
|---|---|---|
| 工場労務費 | 工場/労務/費 | 工場/労務費 |
| (製)賞与引当金 | (/製/)/賞与/引当/金 | (/製/)/賞与/引当金 |
| 【原価】旅費交通費(通勤費) | 【/原価/】/旅費/交通/費/(/通勤/費/) | 【/原価/】/旅費交通費/(/通勤費/) |
| 製)外注人件費 | 製/)/外注/人件/費 | 製/)/外注/人件費 |
Analysis
From the results above, we can observe a few key differences between the two dictionaries:
- Segment Length: The
mecab-ipadic-NEologddictionary tends to produce longer segments than the IPA dictionary. For example, in the term工場労務費(factory labor costs), the IPA dictionary segments it as工場/労務/費, whilemecab-ipadic-NEologdkeeps it as工場/労務費. This suggests thatmecab-ipadic-NEologdrecognizes労務費as a single entity, whereas the IPA dictionary splits it into smaller components. - Handling of New or Specialized Terms: In the test case of
【原価】旅費交通費(通勤費),mecab-ipadic-NEologdprovides a more sensible segmentation by grouping the phrase旅費交通費(travel and transportation expenses) into one segment, while the IPA dictionary breaks it down into smaller chunks like旅費(travel expenses),交通(transportation), and費(expenses). This is an example of howmecab-ipadic-NEologdmay perform better with specialized or compound terms that are not in the standard IPA dictionary. - Handling Special Characters: Both dictionaries struggle with special characters like parentheses or dashes.
Conclusion
The choice of dictionary in MeCab largely depends on the nature of the text you are working with. If you're working with general or older terms, or if you need to break terms into very short segments, the IPA dictionary should suffice. However, if your text includes more modern or specialized terminology, such as accounting terms, mecab-ipadic-NEologd might be the better option due to its inclusion of newer words and more appropriate segmentation for complex terms.
For accounting professionals or anyone working with specialized fields like finance, testing these dictionaries with domain-specific terms could greatly improve the accuracy and efficiency of text processing tasks.
Discussion