Closed6
Haystackチュートリアルをやってみる: Utilizing Existing FAQs for Question Answering
抽出型の質問応答は純粋なテキスト上で動作し、そのためより汎用性がありますが、既存のFAQデータを利用する別の一般的な方法もあります。
- 利点
- 推論時に非常に速い。
- 既存のFAQデータを利用する。
- 回答にかなりのコントロールが可能。
- 欠点:
- 汎用性: FAQに既存する質問に似た質問のみに答えられる。
いくつかの使用ケースでは、抽出型のQAとFAQスタイルの組み合わせも興味深い選択肢となることがあります。
Colaboratoryで進める。
GPUを有効にする必要があるので、「ノートブックの設定」で"T4 GPU"を使用する。
インストール。
%%bash
pip install --upgrade pip
pip install farm-haystack[colab,inference]
テレメトリー有効化。
from haystack.telemetry import tutorial_running
tutorial_running(4)
ロギング設定。
import logging
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
DocumentStoreの初期化。シンプルにインメモリで。
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
Retrieverの初期化。今回はベクトル類似検索を使うのでEmbeddingRetriever
を使う。
from haystack.nodes import EmbeddingRetriever
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
use_gpu=True,
scale_score=False,
)
FAQデータをインデックス化する。
import pandas as pd
from haystack.utils import fetch_archive_from_http
# ダウンロード
doc_dir = "data/tutorial4"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
# "question"、"answer"、およびカスタムメタデータを含むデータフレームを取得
df = pd.read_csv(f"{doc_dir}/small_faq_covid.csv")
# 最小限のクリーニング
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())
# FAQの質問のための埋め込みを作成
# ほとんどの他の検索ユースケースとは対照的に、ここではドキュメントの内容から埋め込みを作成するのではなく、
# "入ってくる質問" <-> "保存された質問"をマッチさせたいため、追加のテキストフィールド"question"から埋め込みを作成します。
questions = list(df["question"].values)
df["embedding"] = retriever.embed_queries(queries=questions).tolist()
df = df.rename(columns={"question": "content"})
# Dataframeを辞書のリストに変換し、DocumentStoreにインデックスする
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)
- COVID-19のFAQのCSVをダウンロードしてきてPandasのデータフレームを作成する
- データフレームの"question"のEmbeddingsを作成する。EmbeddingRetrieverにはembed_queriesというメソッドが生えていてこれでEmbeddingsを作成できる。
- データフレームをDocumentStoreに追加する
ビルトインのFAQPipelineを使って、検索してみる。
from haystack.pipelines import FAQPipeline
from haystack.utils import print_answers
pipe = FAQPipeline(retriever=retriever)
prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 3}})
print_answers(prediction, details="medium")
クエリもEmbeddingsを作らないのいけないのだけど、おそらくFAQPipelineが中でよしなにやってくれてるのだと思う。引数で渡したretrieverのメソッドを使うのだろうな。
で、回答
'Query: How is the virus spreading?'
'Answers:'
[ { 'answer': 'This virus was first detected in Wuhan City, Hubei '
'Province, China. The first infections were linked to a live '
'animal market, but the virus is now spreading from '
'person-to-person. It’s important to note that '
'person-to-person spread can happen on a continuum. Some '
'viruses are highly contagious (like measles), while other '
'viruses are less so.\n'
'\n'
'The virus that causes COVID-19 seems to be spreading easily '
'and sustainably in the community (“community spread”) in '
'some affected geographic areas. Community spread means '
'people have been infected with the virus in an area, '
'including some who are not sure how or where they became '
'infected.\n'
'\n'
'Learn what is known about the spread of newly emerged '
'coronaviruses.',
'context': 'This virus was first detected in Wuhan City, Hubei '
'Province, China. The first infections were linked to a '
'live animal market, but the virus is now spreading from '
'person-to-person. It’s important to note that '
'person-to-person spread can happen on a continuum. Some '
'viruses are highly contagious (like measles), while other '
'viruses are less so.\n'
'\n'
'The virus that causes COVID-19 seems to be spreading '
'easily and sustainably in the community (“community '
'spread”) in some affected geographic areas. Community '
'spread means people have been infected with the virus in '
'an area, including some who are not sure how or where they '
'became infected.\n'
'\n'
'Learn what is known about the spread of newly emerged '
'coronaviruses.',
'score': 0.9358832836151123},
{ 'answer': 'The novel coronavirus SARS-CoV-2 spreads from person to '
'person. Droplet infection is the main mode of transmission. '
'Transmission can take place directly, from '
'person-to-person, or indirectly through contact between '
'hands and the mucous membranes of the mouth, the nose or '
'the conjunctiva of the eyes. There have been reports of '
'persons who were infected by individuals who had only shown '
'slight or non-specific symptoms of disease. The percentage '
'of asymptomatic cases is unclear; according to data from '
'WHO and China, however, such cases do not play a '
'significant role in the spread of SARS-CoV-2.',
'context': 'The novel coronavirus SARS-CoV-2 spreads from person to '
'person. Droplet infection is the main mode of '
'transmission. Transmission can take place directly, from '
'person-to-person, or indirectly through contact between '
'hands and the mucous membranes of the mouth, the nose or '
'the conjunctiva of the eyes. There have been reports of '
'persons who were infected by individuals who had only '
'shown slight or non-specific symptoms of disease. The '
'percentage of asymptomatic cases is unclear; according to '
'data from WHO and China, however, such cases do not play a '
'significant role in the spread of SARS-CoV-2.',
'score': 0.8732742071151733},
{ 'answer': 'Coronaviruses are a large family of viruses. Some cause '
'illness in people, and others, such as canine and feline '
'coronaviruses, only infect animals. Rarely, animal '
'coronaviruses that infect animals have emerged to infect '
'people and can spread between people. This is suspected to '
'have occurred for the virus that causes COVID-19. Middle '
'East Respiratory Syndrome (MERS) and Severe Acute '
'Respiratory Syndrome (SARS) are two other examples of '
'coronaviruses that originated from animals and then spread '
'to people. More information about the source and spread of '
'COVID-19 is available on the Situation Summary: Source and '
'Spread of the Virus.',
'context': 'Coronaviruses are a large family of viruses. Some cause '
'illness in people, and others, such as canine and feline '
'coronaviruses, only infect animals. Rarely, animal '
'coronaviruses that infect animals have emerged to infect '
'people and can spread between people. This is suspected to '
'have occurred for the virus that causes COVID-19. Middle '
'East Respiratory Syndrome (MERS) and Severe Acute '
'Respiratory Syndrome (SARS) are two other examples of '
'coronaviruses that originated from animals and then spread '
'to people. More information about the source and spread of '
'COVID-19 is available on the Situation Summary: Source and '
'Spread of the Virus.',
'score': 0.6935744285583496}]
ビルトインのパイプラインはおそらく入出力のフォーマットとかがきちんと定義されていると思うので、自分でやる場合にはこの定義にあわせてDocumentStoreに入れてあげるとかが必要になる気がする。
このスクラップは2023/10/03にクローズされました