<h2 id="%E6%A6%82%E8%A6%81">
<a class="header-anchor-link" href="#%E6%A6%82%E8%A6%81" aria-hidden="true"></a> 概要</h2>
Solr9.1のkuromojiでUniDic2.1.2を使えるようにします。
<a href="https://clrd.ninjal.ac.jp/unidic/" target="_blank" rel="nofollow noopener noreferrer">UniDic</a>は国立国語研究所によって開発された、形態素解析器MeCab用の解析用辞書です。 
形態素解析用辞書にはIPA(IPADic)やNAIST辞書、NEologd辞書などがありますが、IPA辞書は2007年で、NAIST辞書は2011年で、NEologd辞書は2020年で更新が停止しています。UniDicは2022年9月にバージョン3.1.1がリリースされており、2022年現在でも更新が続いている数少ない辞書です。
kuromojiは最新のUniDic3.1.1にはまだ正式対応していないようでしたので、ここではUniDic2.1.2を使用することにします。 
(UniDic3.1.1も無理やりビルドすることはできますが、正式対応していないのと辞書サイズが約422MBと大きいので実用的ではないと思われます)
内容的には<a href="https://blog.johtani.info/blog/2019/12/04/about-lucene-4056/" target="_blank" rel="nofollow noopener noreferrer">こちらのエントリ</a>そのままですが、Lucene9からビルドシステムがApache AntからGradleに変更になったことで先人の知恵がそのまま適用できなかったりしたので、覚書ということでこの記事を書きました。
<h2 id="%E7%92%B0%E5%A2%83">
<a class="header-anchor-link" href="#%E7%92%B0%E5%A2%83" aria-hidden="true"></a> 環境</h2>
<ul>
<li>Ubuntu 22.04</li>
<li>openjdk 11</li>
<li>Solr 9.1.0</li>
<li>Lucene 9.3.0</li>
</ul>
<h2 id="%E6%89%8B%E9%A0%86">
<a class="header-anchor-link" href="#%E6%89%8B%E9%A0%86" aria-hidden="true"></a> 手順</h2>
<h3 id="java11%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB">
<a class="header-anchor-link" href="#java11%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB" aria-hidden="true"></a> Java11のインストール</h3>
Java11をインストールします。Ubuntuの場合、<code>apt</code>でインストールすることができます。
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">sudo apt install openjdk-11-jdk
</code></pre></div><h3 id="lucene%E3%81%AE%E3%83%80%E3%82%A6%E3%83%B3%E3%83%AD%E3%83%BC%E3%83%89">
<a class="header-anchor-link" href="#lucene%E3%81%AE%E3%83%80%E3%82%A6%E3%83%B3%E3%83%AD%E3%83%BC%E3%83%89" aria-hidden="true"></a> Luceneのダウンロード</h3>
<a href="https://github.com/apache/lucene" target="_blank" rel="nofollow noopener noreferrer">LuceneのGitHubリポジトリ</a>から9.3.0をクローンします。 
リポジトリのサイズが非常に大きいので、シャロークローンすることをおすすめします。
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">git clone https://github.com/apache/lucene.git --branch releases/lucene/9.3.0 --depth 1
</code></pre></div><h3 id="%E3%82%B3%E3%83%BC%E3%83%89%E3%81%AE%E5%A4%89%E6%9B%B4">
<a class="header-anchor-link" href="#%E3%82%B3%E3%83%BC%E3%83%89%E3%81%AE%E5%A4%89%E6%9B%B4" aria-hidden="true"></a> コードの変更</h3>
Luceneのいくつかのコードを変更します。変更箇所は<a href="https://blog.johtani.info/blog/2019/12/04/about-lucene-4056/" target="_blank" rel="nofollow noopener noreferrer">こちらのエントリ</a>とほとんど同じです。
パッチファイルをGitHub Gistに用意したので、これを各ファイルに適用します。
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">curl -O https://gist.githubusercontent.com/fjnkt98/b97a30b23d90a257aaba23efe54861bd/raw/c52efb8b364b40a7336616c69fda5c4076504b5d/DictionaryBuilder.java.patch
curl -O https://gist.githubusercontent.com/fjnkt98/9d298769314bc0875a7809fa08bcf250/raw/d2c9a9ecfecc0b948973c8762458ccf3487ae626/TokenInfoDictionaryBuilder.java.patch
curl -O https://gist.githubusercontent.com/fjnkt98/3e7655081b8141116bd39fd5a4b95e6a/raw/fbfad21fb165d1457affd8f411d06109da1ec95c/UnknownDictionaryBuilder.java.patch
curl -O https://gist.githubusercontent.com/fjnkt98/c6e9cbf47ef062f0ae1ca21380ecbe24/raw/05b1e593d8bf763858448f82829848855033e9a6/kuromoji.gradle.patch

cd lucene
git apply ../TokenInfoDictionaryBuilder.java.patch
git apply ../DictionaryBuilder.java.patch
git apply ../kuromoji.gradle.patch
git apply ../UnknownDictionaryBuilder.java.patch
</code></pre></div><h3 id="%E8%BE%9E%E6%9B%B8%E3%81%AE%E3%83%93%E3%83%AB%E3%83%89">
<a class="header-anchor-link" href="#%E8%BE%9E%E6%9B%B8%E3%81%AE%E3%83%93%E3%83%AB%E3%83%89" aria-hidden="true"></a> 辞書のビルド</h3>
辞書をビルドします。<code>compileUnidic</code>を実行しないとipadicの辞書ができてしまうのでcompileUnidicタスクを単体で実行してから全体ビルドします(Gradleの仕様がよくわかってない)
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">./gradlew assemble compileUnidic
./gradlew assemble
</code></pre></div>ビルドが完了すると、<code>lucene/analysis/kuromoji/build/libs</code>以下に辞書のアーカイブファイルが生成されます。何も設定を変更していなければ、<code>lucene-analysis-kuromoji-9.3.0-SNAPSHOT.jar</code>という名前でファイルが作成されます。 
ファイルサイズが17MBになっていればOKです。
<h3 id="dockerfile">
<a class="header-anchor-link" href="#dockerfile" aria-hidden="true"></a> Dockerfile</h3>
辞書のビルドを行うDockerfileを以下に示します。Dockerの実行環境があれば、辞書ファイルを持つDockerイメージが作成できます。
<div class="code-block-container"><pre class="language-dockerfile"><code class="language-dockerfile">FROM openjdk:11.0.16-slim-buster

RUN apt-get update \
 &amp;&amp; apt-get install -y git curl \
 &amp;&amp; rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/apache/lucene.git --branch releases/lucene/9.3.0 --depth 1

RUN curl -O https://gist.githubusercontent.com/fjnkt98/b97a30b23d90a257aaba23efe54861bd/raw/c52efb8b364b40a7336616c69fda5c4076504b5d/DictionaryBuilder.java.patch \
 &amp;&amp; curl -O https://gist.githubusercontent.com/fjnkt98/9d298769314bc0875a7809fa08bcf250/raw/d2c9a9ecfecc0b948973c8762458ccf3487ae626/TokenInfoDictionaryBuilder.java.patch \
 &amp;&amp; curl -O https://gist.githubusercontent.com/fjnkt98/3e7655081b8141116bd39fd5a4b95e6a/raw/fbfad21fb165d1457affd8f411d06109da1ec95c/UnknownDictionaryBuilder.java.patch \
 &amp;&amp; curl -O https://gist.githubusercontent.com/fjnkt98/c6e9cbf47ef062f0ae1ca21380ecbe24/raw/05b1e593d8bf763858448f82829848855033e9a6/kuromoji.gradle.patch

RUN cd /lucene \
 &amp;&amp; git apply ../TokenInfoDictionaryBuilder.java.patch \
 &amp;&amp; git apply ../DictionaryBuilder.java.patch \
 &amp;&amp; git apply ../kuromoji.gradle.patch \
 &amp;&amp; git apply ../UnknownDictionaryBuilder.java.patch

RUN cd /lucene \
 &amp;&amp; ./gradlew clean
# compileUnidic単体でビルド 何かこれやらないとipadicの辞書が出来上がるのでこの手順を挟む
RUN cd /lucene \
 &amp;&amp; ./gradlew assemble compileUnidic
RUN cd /lucene \
 &amp;&amp; ./gradlew assemble

RUN cp /lucene/lucene/analysis/kuromoji/build/libs/lucene-analysis-kuromoji-9.3.0-SNAPSHOT.jar /
</code></pre></div>コンテナを起動して、<code>docker cp</code>でホストにコピーすれば辞書が手に入ります。
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">docker run --rm -it lucene9.3.0-unidic2.1.2:latest bash
</code></pre></div><h3 id="solr%E3%81%A7%E4%BD%BF%E7%94%A8%E3%81%99%E3%82%8B">
<a class="header-anchor-link" href="#solr%E3%81%A7%E4%BD%BF%E7%94%A8%E3%81%99%E3%82%8B" aria-hidden="true"></a> Solrで使用する</h3>
作成した辞書をSolrに配置して、動作確認をしてみます。
作成した辞書をSolrに適用するには、まずデフォルトでインストールされているkuromoji辞書を無効化します。デフォルトでは<code>lucene-analysis-kuromoji-9.3.0.jar</code>という辞書があるので、これを削除するか、<code>lucene-analysis-kuromoji-9.3.0.jar.org</code>のようにリネームするかします。拡張子が<code>.jar</code>でなくなればなんでもよいです。
そして、生成した辞書オブジェクトをこのディレクトリに配置します。 
<code>/opt/solr/server/solr-webapp/webapp/WEB-INF/lib</code>以下に、生成した辞書アーカイブを配置します。
DockerでSolrを実行する場合、以下のようなDockerfileを作成すると簡単に辞書入りSolrが作成できます。
<div class="code-block-container"><pre class="language-dockerfile"><code class="language-dockerfile">FROM solr:9.1.0

COPY --chown=solr:solr ./lucene-analysis-kuromoji-9.3.0-unidic-2.1.2.jar /opt/solr/server/solr-webapp/webapp/WEB-INF/lib/lucene-analysis-kuromoji-9.3.0.jar

USER solr
</code></pre></div>上記のDockerfileと同じディレクトリに<code>lucene-analysis-kuromoji-9.3.0-unidic-2.1.2.jar</code>を配置し、Solrイメージをビルドします。
イメージタグは好きなものを付けてください。
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">docker build -t solr-unidic .
</code></pre></div>作成したSolrイメージを起動します。
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">docker run -p 8983:8983 --rm solr-unidic:latest solr-precreate example
</code></pre></div>起動後にコアが作成されるので、テキスト解析をリクエストしてみましょう。
<div class="code-block-container"><pre class="language-bash"><code class="language-bash">$ curl -sS 'http://localhost:8983/solr/example/analysis/field?analysis.fieldtype=text_ja' --get --data-urlencode 'analysis.fieldvalue=自動車と自転車の違いはなんでしょう？' | jq '.analysis.field_types.text_ja.index[13][].text'
"自動"
"車"
"自転"
"車"
"違い"
"なん"
</code></pre></div>「自動車」が「自動」「車」に分割されれば成功です。
<h2 id="%E5%8F%82%E8%80%83%E6%96%87%E7%8C%AE">
<a class="header-anchor-link" href="#%E5%8F%82%E8%80%83%E6%96%87%E7%8C%AE" aria-hidden="true"></a> 参考文献</h2>
<ul>
<li><a href="https://blog.johtani.info/blog/2019/12/04/about-lucene-4056/" target="_blank" rel="nofollow noopener noreferrer">Apache LuceneのKuromojiのUniDicビルド対応パッチについて</a></li>
</ul>

Solr9のkuromojiでUniDic2.1.2を使う

Discussion