<p data-line="0" class="code-line">こんにちは、初めましての方は初めまして。株式会社 <a href="https://fusic.co.jp/" target="_blank" rel="nofollow noopener noreferrer">Fusic</a> の<a href="https://twitter.com/kawara_y" target="_blank" rel="nofollow noopener noreferrer">瓦</a>です。最近お腹がめちゃくちゃゆるいせいか、いきなり 2kg ほど痩せました。「2kg も減って大丈夫なのか！？」という心配と「2kg 減って嬉しい！」という喜びがずっと心の中で戦っています（今はまだ嬉しさの方が勝っています）</p>
<p data-line="2" class="code-line">最近記事がサボりがちになっていたのですが、久しぶりに論文を読んだのでメモを書いておこうと思います。今回この記事で紹介するのは <a href="https://arxiv.org/abs/2408.09017" target="_blank" rel="nofollow noopener noreferrer">Meta Knowledge for Retrieval Augmented Large Language Models</a> という論文で、AWS が出したものとなります（2024 年の論文なのでちょっと古いかもしれません）GitHub などは公開されていないので、とりあえず記載されている文面から読み取れる内容をこの記事ではまとめています（若干意訳している部分もあるので、内容がおかしい部分があればコメントなどで教えていただけると幸いです）</p>
<h2 id="%E6%A6%82%E8%A6%81" data-line="4" class="code-line">
<a class="header-anchor-link" href="#%E6%A6%82%E8%A6%81" aria-hidden="true"></a> 概要</h2>
<p data-line="6" class="code-line"><img src="https://res.cloudinary.com/zenn/image/fetch/s--LLfMw3NN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_1200/https://storage.googleapis.com/zenn-user-upload/deployed-images/e57b5e8cad7c191894f8278a.png%3Fsha%3D5bdb2a0f98cadaf0d1938b5227a0b024970b155f" class="md-img" loading="lazy"><br>
<em>全体の構成図（論文図1より引用）</em></p>
<p data-line="9" class="code-line">一言でまとめると、「ドキュメントに対してメタデータと擬似的な質問応答のペアを作って、<strong>メタデータに基づいた要約（Meta Knowledge Summary: MK Summary）を作成してそれを検索対象とすることで回答の精度を向上させる</strong>」ものとなります。メタデータを作成して活用することでクエリにより関連した情報を取得しやすく、かつ要約を検索対象とすることでより少ないトークン数に多くの情報を詰め込めることがメリットとなります。</p>
<p data-line="11" class="code-line"><img src="https://res.cloudinary.com/zenn/image/fetch/s--AI6YS4k8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_1200/https://storage.googleapis.com/zenn-user-upload/deployed-images/69ed277d208930055017acde.png%3Fsha%3D7de62de774e4bb2b01caf83fd6ba847d704fbc29" class="md-img" loading="lazy"><br>
<em>提案手法の結果（論文図2より引用）</em></p>
<p data-line="14" class="code-line">結果は上の図のようになっており、これまでの手法と比較してどの観点でも向上していることが分かります。特に Breadth（質問に関連する側面や領域を包括的にとらえ、完璧な概要を提供しているかを評価）や Depth（最終的な回答が、そのテーマの詳細な分析や洞察を通して十分な理解を提供しているかを評価）の観点では従来手法を大きく上回っており、<strong>回答の際に考慮する情報を増やすことに成功している</strong>ことが確認できます。</p>
<h2 id="%E6%8F%90%E6%A1%88%E6%89%8B%E6%B3%95" data-line="16" class="code-line">
<a class="header-anchor-link" href="#%E6%8F%90%E6%A1%88%E6%89%8B%E6%B3%95" aria-hidden="true"></a> 提案手法</h2>
<p data-line="18" class="code-line">この論文では今までの "<em>retrieve-then-read</em>" というパイプラインではなく "<em>prepare-then-rewrite-then-retrieve-then-read</em>" というパイプラインを新しく採用し、そのための MK Summary という方法を提案しています。</p>
<p data-line="20" class="code-line">新しくパイプラインに組み込まれている "<em>prepare-then-rewrite</em>" というプロセスについて、これは上の全体の構成図の "<em>Offline Document Preprocessing</em>" と "<em>Metadata-based clusters of documents</em>" の部分にあたります。それぞれを下で詳しく見ていきましょう。</p>
<h3 id="offline-document-preprocessing" data-line="22" class="code-line">
<a class="header-anchor-link" href="#offline-document-preprocessing" aria-hidden="true"></a> Offline Document Preprocessing</h3>
<p data-line="24" class="code-line">ここでは LLM を用いたメタデータの付与と QA ペアの生成を前処理として行っています。これにより、<strong>「ドキュメントには質問のような文は一般的に含まれておらず、ユーザーの質問との類似度を計算する際に関連するドキュメントの検索が難しい」という問題を解決</strong>しています。また、メタデータの付与を行うことで検索対象を絞り込むことが出来るので、より関連のある検索結果を返せるようにもなるのだと思います（こちらに関しては特に記述がないのですが、<a href="https://aws.amazon.com/jp/blogs/news/a-practical-guide-to-improve-rag-systems-with-advanced-rag-on-aws/" target="_blank" rel="nofollow noopener noreferrer">RAG の精度を向上させる Advanced RAG on AWS の道標</a>を拝見するとメタデータによるフィルタリングも有効な解決策であるらしいので、その流れで入れているんじゃないかと思っています）</p>
<p data-line="26" class="code-line">次にこのパイプラインで使用されているプロンプトを見てみましょう（長いのトグルにしています）</p>
<details><summary>プロンプト（論文 Appendix A より引用）</summary><div class="details-content"><div class="code-block-container"><pre><code class="code-line" data-line="29">You are an helpful research assistant, preprocessing { document_types } for { users_types } to use later on.
You are provided with a document and a list of questions that aims at extracting key knowledge from this document. Please stricly follow the format below to answer (no introduction or finishing sentences).

First, answer the following questions with a single Yes or No only:
1. The paper can be clearly categorized into one or multiple research field (s) ( exclusively from : { text_categories }), yes or no ?:
2. The paper is mostly an applied research paper ( versus mostly theoric ), yes or no ?:
3. The paper is referencing a Github repository, yes or no?
4. The paper contains mathematical reasoning, yes or no ?:
5. The paper mentions a specific application to an industry company, yes or no ?:
6. The paper uses evaluations metrics to benchmark their methods, yes or no ?:

Answer the following questions with a python list only, or return an empty python list :
1. If the paper can be clearly categorized into one or multiple research fields, list the fields (3 max):
2. If the paper is mostly an applied research paper, list the application fields (3 max):
3. If the paper references one or more Github repository, list their urls (2 max):
4. If the paper contains mathematical reasoning, list the name (s) of the theorem (s) being used (3 max):
5. If the paper mentioned a specific application to an industry company, list the companies (3 max):
6. If the paper use evaluations metrics to benchmark their methods, list the names of the metrics (5 max):

Your answer must look like the following (no introduction sentence):
1. Yes
2. No
etc.

1. ['a','b']
2. []
etc.

Then , please act as an expert scientists and formulate both general ( general understanding ) and precise questions ( incl. specific findings or limitations ) from the content of the document to assess the knowledge of other highly knowledgeable scientists about the topic of this document.

Scientists that will answer the questions do not know the document. Please do not explicitly refer to "the text" or the name of the document in the questions.
Each questions and answers pairs must be self - contained ( make sure to give enough context ) and independent from other pairs.

Please formulate as many questions as possible covering as much content as possible , and avoid bullet points within answers.

Stricly follow the format of the final questions and answers below, presenting all responses , lists , questions , then all answers:

Questions :
1. ...
2. ...
etc.

Answers :
1. ...
2. ...
etc.

Please find below the text , for which the title is { doc_title }:

[ Text ]
{ doc_content }
[/ Text ]
</code></pre></div></div></details>
<p data-line="85" class="code-line">論文では「Chain-of-Thoughts を使用してガイド付きの QA を生成させた」とあるように、まず初めにそもそもメタデータがあるかどうかの判定をさせて、あるのであればそのメタデータを回答させてから QA の生成をしています。これにより、<strong>LLM がメタデータ（=カテゴリ）を踏まえた QA を生成できる</strong>ようになっています。このようにしてメタデータと QA を生成して、前処理は終わりです。</p>
<h3 id="metadata-based-clusters-of-documents" data-line="87" class="code-line">
<a class="header-anchor-link" href="#metadata-based-clusters-of-documents" aria-hidden="true"></a> Metadata-based clusters of documents</h3>
<p data-line="89" class="code-line">ここでは Meta データと作成した QA を活用した検索結果の向上を目的とします。ここらへんの記述が若干曖昧なので、憶測も含んでいることに注意してください。</p>
<p data-line="91" class="code-line">ここではメタデータと作成した Q（恐らく A は使っていない）に基づいて要約を生成することで MK Summary を作成します。この MK Summary を使用してクエリ拡張を行い、最終的な回答を生成させています。こうすることで「よりユーザーの質問に類似したドキュメントの情報」が取得でき、かつ「関連したカテゴリの情報」が取得出来ているのだと思います。回答生成時にはユーザーの質問、拡張した質問、検索した QA、および Few-shot prompt を用いています。</p>
<h2 id="%E7%B5%90%E6%9E%9C" data-line="93" class="code-line">
<a class="header-anchor-link" href="#%E7%B5%90%E6%9E%9C" aria-hidden="true"></a> 結果</h2>
<p data-line="95" class="code-line"><img src="https://res.cloudinary.com/zenn/image/fetch/s--AI6YS4k8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_1200/https://storage.googleapis.com/zenn-user-upload/deployed-images/69ed277d208930055017acde.png%3Fsha%3D7de62de774e4bb2b01caf83fd6ba847d704fbc29" class="md-img" loading="lazy"><br>
<em>提案手法の結果（論文図2より引用）</em></p>
<p data-line="98" class="code-line">結果のグラフを再掲します。提案手法によって全体的に数値が向上していますが、特に論文で議論されている点を以下にまとめておきます（数値は LLM-as-a-judge によって算出されています）</p>
<ul data-line="100" class="code-line">
<li data-line="100" class="code-line">
<strong>Precision の効果は限定的</strong>：どのような検索手法であってもまったく無関係なドキュメントは存在しないため、手法間による差は小さい。</li>
<li data-line="101" class="code-line">
<strong>Breadth と Depth の大幅な向上</strong>：MK Summary によって回答に必要な情報がより多く付与できるため、結果として数値が向上している。</li>
</ul>
<p data-line="103" class="code-line">Breadth は特に、普通の検索だと網にかからなかったようなドキュメントも拾えるようになったことによって上がっています。Depth についてもおそらく同じ理由で、LLM が「より詳細な情報捉えてるね！」と判断しているのかなと思います。</p>
<h2 id="%E3%81%BE%E3%81%A8%E3%82%81" data-line="105" class="code-line">
<a class="header-anchor-link" href="#%E3%81%BE%E3%81%A8%E3%82%81" aria-hidden="true"></a> まとめ</h2>
<p data-line="107" class="code-line">この記事では MK Summary について紹介しました。文書を（しっかりと情報が含まれるように）コンパクトにまとめることで LLM に渡せる情報が増やせるので、結果として回答が良くなるというのがこの論文から分かるかと思います。特にこの手法は検索時には特別な処理がほとんど必要ないため回答速度を落とすことなく検索性能を大幅に向上できる部分が魅力的です。</p>
<p data-line="109" class="code-line">一方で、QA の生成部分の良さがかなり精度に影響しそうです。例えば LLM の知識がないようなデータに対してこの手法を適用させた場合は一般的な Q しか生成できないと思うので、ユーザーの質問の分布と乖離するんじゃないかなという気もしています。つまり、検索しても意味のない検索結果しか取得できないんじゃないかなーと（これってどうやって確かめるのがいいですかね…？）</p>


Meta Knowledge for Retrieval Augmented Large Language Models を読んだ

Discussion