Closed6

Haystackチュートリアルをやってみる: Utilizing Existing FAQs for Question Answering

kun432kun432

https://haystack.deepset.ai/tutorials/04_faq_style_qa

抽出型の質問応答は純粋なテキスト上で動作し、そのためより汎用性がありますが、既存のFAQデータを利用する別の一般的な方法もあります。

  • 利点
    • 推論時に非常に速い。
    • 既存のFAQデータを利用する。
    • 回答にかなりのコントロールが可能。
  • 欠点:
    • 汎用性: FAQに既存する質問に似た質問のみに答えられる。

いくつかの使用ケースでは、抽出型のQAとFAQスタイルの組み合わせも興味深い選択肢となることがあります。

kun432kun432

Colaboratoryで進める。

GPUを有効にする必要があるので、「ノートブックの設定」で"T4 GPU"を使用する。

インストール。

%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

テレメトリー有効化。

from haystack.telemetry import tutorial_running

tutorial_running(4)

ロギング設定。

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
kun432kun432

DocumentStoreの初期化。シンプルにインメモリで。

from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

Retrieverの初期化。今回はベクトル類似検索を使うのでEmbeddingRetrieverを使う。

from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True,
    scale_score=False,
)
kun432kun432

FAQデータをインデックス化する。

import pandas as pd
from haystack.utils import fetch_archive_from_http

# ダウンロード
doc_dir = "data/tutorial4"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# "question"、"answer"、およびカスタムメタデータを含むデータフレームを取得
df = pd.read_csv(f"{doc_dir}/small_faq_covid.csv")
# 最小限のクリーニング
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

# FAQの質問のための埋め込みを作成
# ほとんどの他の検索ユースケースとは対照的に、ここではドキュメントの内容から埋め込みを作成するのではなく、
# "入ってくる質問" <-> "保存された質問"をマッチさせたいため、追加のテキストフィールド"question"から埋め込みを作成します。
questions = list(df["question"].values)
df["embedding"] = retriever.embed_queries(queries=questions).tolist()
df = df.rename(columns={"question": "content"})

# Dataframeを辞書のリストに変換し、DocumentStoreにインデックスする
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)
  • COVID-19のFAQのCSVをダウンロードしてきてPandasのデータフレームを作成する
  • データフレームの"question"のEmbeddingsを作成する。EmbeddingRetrieverにはembed_queriesというメソッドが生えていてこれでEmbeddingsを作成できる。
  • データフレームをDocumentStoreに追加する
kun432kun432

ビルトインのFAQPipelineを使って、検索してみる。

from haystack.pipelines import FAQPipeline
from haystack.utils import print_answers

pipe = FAQPipeline(retriever=retriever)

prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 3}})

print_answers(prediction, details="medium")

クエリもEmbeddingsを作らないのいけないのだけど、おそらくFAQPipelineが中でよしなにやってくれてるのだと思う。引数で渡したretrieverのメソッドを使うのだろうな。

で、回答

'Query: How is the virus spreading?'
'Answers:'
[   {   'answer': 'This virus was first detected in Wuhan City, Hubei '
                  'Province, China. The first infections were linked to a live '
                  'animal market, but the virus is now spreading from '
                  'person-to-person. It’s important to note that '
                  'person-to-person spread can happen on a continuum. Some '
                  'viruses are highly contagious (like measles), while other '
                  'viruses are less so.\n'
                  '\n'
                  'The virus that causes COVID-19 seems to be spreading easily '
                  'and sustainably in the community (“community spread”) in '
                  'some affected geographic areas. Community spread means '
                  'people have been infected with the virus in an area, '
                  'including some who are not sure how or where they became '
                  'infected.\n'
                  '\n'
                  'Learn what is known about the spread of newly emerged '
                  'coronaviruses.',
        'context': 'This virus was first detected in Wuhan City, Hubei '
                   'Province, China. The first infections were linked to a '
                   'live animal market, but the virus is now spreading from '
                   'person-to-person. It’s important to note that '
                   'person-to-person spread can happen on a continuum. Some '
                   'viruses are highly contagious (like measles), while other '
                   'viruses are less so.\n'
                   '\n'
                   'The virus that causes COVID-19 seems to be spreading '
                   'easily and sustainably in the community (“community '
                   'spread”) in some affected geographic areas. Community '
                   'spread means people have been infected with the virus in '
                   'an area, including some who are not sure how or where they '
                   'became infected.\n'
                   '\n'
                   'Learn what is known about the spread of newly emerged '
                   'coronaviruses.',
        'score': 0.9358832836151123},
    {   'answer': 'The novel coronavirus SARS-CoV-2 spreads from person to '
                  'person. Droplet infection is the main mode of transmission. '
                  'Transmission can take place directly, from '
                  'person-to-person, or indirectly through contact between '
                  'hands and the mucous membranes of the mouth, the nose or '
                  'the conjunctiva of the eyes. There have been reports of '
                  'persons who were infected by individuals who had only shown '
                  'slight or non-specific symptoms of disease. The percentage '
                  'of asymptomatic cases is unclear; according to data from '
                  'WHO and China, however, such cases do not play a '
                  'significant role in the spread of SARS-CoV-2.',
        'context': 'The novel coronavirus SARS-CoV-2 spreads from person to '
                   'person. Droplet infection is the main mode of '
                   'transmission. Transmission can take place directly, from '
                   'person-to-person, or indirectly through contact between '
                   'hands and the mucous membranes of the mouth, the nose or '
                   'the conjunctiva of the eyes. There have been reports of '
                   'persons who were infected by individuals who had only '
                   'shown slight or non-specific symptoms of disease. The '
                   'percentage of asymptomatic cases is unclear; according to '
                   'data from WHO and China, however, such cases do not play a '
                   'significant role in the spread of SARS-CoV-2.',
        'score': 0.8732742071151733},
    {   'answer': 'Coronaviruses are a large family of viruses. Some cause '
                  'illness in people, and others, such as canine and feline '
                  'coronaviruses, only infect animals. Rarely, animal '
                  'coronaviruses that infect animals have emerged to infect '
                  'people and can spread between people. This is suspected to '
                  'have occurred for the virus that causes COVID-19. Middle '
                  'East Respiratory Syndrome (MERS) and Severe Acute '
                  'Respiratory Syndrome (SARS) are two other examples of '
                  'coronaviruses that originated from animals and then spread '
                  'to people. More information about the source and spread of '
                  'COVID-19 is available on the Situation Summary: Source and '
                  'Spread of the Virus.',
        'context': 'Coronaviruses are a large family of viruses. Some cause '
                   'illness in people, and others, such as canine and feline '
                   'coronaviruses, only infect animals. Rarely, animal '
                   'coronaviruses that infect animals have emerged to infect '
                   'people and can spread between people. This is suspected to '
                   'have occurred for the virus that causes COVID-19. Middle '
                   'East Respiratory Syndrome (MERS) and Severe Acute '
                   'Respiratory Syndrome (SARS) are two other examples of '
                   'coronaviruses that originated from animals and then spread '
                   'to people. More information about the source and spread of '
                   'COVID-19 is available on the Situation Summary: Source and '
                   'Spread of the Virus.',
        'score': 0.6935744285583496}]
kun432kun432

ビルトインのパイプラインはおそらく入出力のフォーマットとかがきちんと定義されていると思うので、自分でやる場合にはこの定義にあわせてDocumentStoreに入れてあげるとかが必要になる気がする。

このスクラップは2023/10/03にクローズされました