WebサイトをスクレイピングしてLLMで使いやすいデータにする「Crawl4AI」を試す

LLM/RAG/エージェントで使うためのスクレイピング＆フォーマット変更ツールは以下のようなものを触ってきた。
https://zenn.dev/kun432/scraps/feec7e2370450c
https://zenn.dev/kun432/scraps/ce65ff8231911d
https://zenn.dev/kun432/scraps/58fce97899cfdd
たまたま以下のYouTube動画を見つけて、Crawl4AIというのもあることを知ったので試してみる。
https://www.youtube.com/watch?v=od6AaKhKYmg

kun432

GitHubレポジトリ
https://github.com/unclecode/crawl4ai

 Crawl4AI 🕷️🤖Crawl4AIは、ウェブクロールとデータ抽出を簡素化し、大規模言語モデル(LLM)やAIアプリケーションが利用できるようにします。 🆓🌐


 機能✨
🆓 完全無料かつオープンソース
🤖 LLMに適した出力フォーマット(JSON、クリーンアップされたHTML、マークダウン)
🌍 複数のURLの同時クロールをサポート
🎨 すべてのメディアタグ(画像、音声、動画)を抽出して返します
🔗 すべての外部リンクと内部リンクを抽出
📚 ページからメタデータを抽出
🔄 クロール前の認証、ヘッダー、ページの修正用のカスタムフック
🕵️ ユーザーエージェントのカスタマイズ
🖼️ ページのスクリーンショットを取得
📜 クロール前に複数のカスタムJavaScriptを実行
📚 さまざまなチャンキング戦略:トピックベース、正規表現、文など
🧠 高度な抽出戦略:コサインクラスタリング、LLM、その他
🎯 CSS セレクタのサポート
📝 抽出を絞り込むための指示/キーワードを渡します

ドキュメント
https://crawl4ai.com/mkdocs/

 Crawl4AICrawl4AIの公式ドキュメントへようこそ！ 🕷️🤖 Crawl4AIは、ウェブクローリングを簡素化し、ウェブページから有用な情報を抽出するために設計されたオープンソースのPythonライブラリです。 このドキュメントでは、Crawl4AIの機能、使い方、カスタマイズについて説明します。
デモをお試しください。
今すぐ試して、さまざまなページをクロールして、その動作を確認してください。 リンクを設定したり、出力の構造を見たり、実行方法のPythonサンプルコードも見ることができます。 古いデモは/old_demo にあり、詳細を見ることができます。

 はじめにCrawl4AIの明確なタスクは1つ：特に大規模な言語モデル（LLM）やAIアプリケーションのために、ウェブページからのクローリングとデータ抽出を簡単かつ効率的に行うことである。 REST APIとして、あるいはPythonライブラリとして、Crawl4AIは堅牢で柔軟なソリューションを提供します

kun432

インストール/Quick Start

https://crawl4ai.com/mkdocs/quickstart/

まず、インストール方法は3つ。

Pythonライブラリをインストール
Dockerでビルドして、REST APIをセルフホスト
DockerHubからビルド済イメージを取得して、REST APIをセルフホスト

あと余談だけども、ドキュメントによれば、Crawl4AIが提供しているREST APIってのもある模様。

Crawl4AI provides an easy way to crawl and extract data from web pages without installing any library. You can use the REST API on our server or run the local server on your machine.

ただ、これAPIエンドポイントがどこにも書いてないように思えるし、デモサイトもなんか動かないので、ここはちょっと真偽不明（それっぽいエンドポイントは一応見つけたのだが）。

で、さらにCrawl4AIは、クローリング以外にもチャンキングや抽出などの処理もできるようだが、この場合にPyTorchやTransformersと連携させる場合のインストールオプションが用意されている。この場合は1か2を選択する必要がある模様。

とりあえず今回はシンプルにPythonライブラリを使ってやってみる。Colaboratoryで。

GitHubレポジトリからインストール

!pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"

ミニマムなクローリングのコード

from crawl4ai import WebCrawler

# WebCrawlerのインスタンスを作成
crawler = WebCrawler()

# クローラをウォームアップ(必要なモデルをロード)
crawler.warmup()

# URL（今回はCrawl4AIのドキュメントサイトにした）に対して、クローラーを実行
result = crawler.run(url="https://crawl4ai.com/mkdocs/")

出力を見る限り、どうやらSeleniumを使用している模様。

[LOG] 🚀 Initializing LocalSeleniumCrawlerStrategy
[LOG] 🌤️  Warming up the WebCrawler
[LOG] 🌞 WebCrawler is ready to crawl
[LOG] 🚀 Crawling done for https://crawl4ai.com/mkdocs/, success: True, time taken: 1.1354084014892578 seconds
[LOG] 🚀 Content extracted for https://crawl4ai.com/mkdocs/, success: True, time taken: 0.09711050987243652 seconds
[LOG] 🔥 Extracting semantic blocks for https://crawl4ai.com/mkdocs/, Strategy: NoExtractionStrategy
[LOG] 🚀 Extraction done for https://crawl4ai.com/mkdocs/, time taken: 0.09826350212097168 seconds.

結果はCrawlerResultオブジェクトで返ってくる。

Markdownで取り出してみる。

print(result.markdown)

Crawl4AI Documentation

  * Home 
  * Demo 
  * Search 

  * Home 
  * Demo
  * First Steps 
    * Introduction
    * Installation
    * Quick Start
  * Examples 
    * Intro
    * LLM Extraction
    * JS Execution & CSS Filtering
    * Hooks & Auth
    * Summarization
    * Research Assistant
  * Full Details of Using Crawler 
    * Crawl Request Parameters
    * Crawl Result Class
    * Advanced Features
    * Chunking Strategies
    * Extraction Strategies
  * API Reference 
    * Core Classes and Functions
    * Detailed API Documentation
  * Miscellaneous 
    * Change Log
    * Contact

  * Crawl4AI v0.2.77
  * Try the Demo
  * Introduction
  * Quick Start
  * Documentation Structure
  * Get Started

# Crawl4AI v0.2.77

Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-
source Python library designed to simplify web crawling and extract useful
information from web pages. This documentation will guide you through the
features, usage, and customization of Crawl4AI.

## Try the Demo

Just try it now and crawl different pages to see how it works. You can set the
links, see the structures of the output, and also view the Python sample code
on how to run it. The old demo is available at /old_demo where you can see
more details.

## Introduction

Crawl4AI has one clear task: to make crawling and data extraction from web
pages easy and efficient, especially for large language models (LLMs) and AI
applications. Whether you are using it as a REST API or a Python library,
Crawl4AI offers a robust and flexible solution.

(snip)

スクリーンショットを取ることもできる。run時にscreenshot=Trueを付与する。

import base64
from IPython.display import Image, display

result = crawler.run(url="https://crawl4ai.com/mkdocs/", screenshot=True)
display(Image(base64.b64decode(result.screenshot), width=500, height=500))

なお、デフォルトだと、クローリングの結果はキャッシュされているので、同じURLに対してもう一度クローリングするとキャッシュの結果が帰って来るので高速に取得できる。が、キャッシュではなく強制的に取得したい場合はrun時にbypass_cache=Trueを付与する。

result = crawler.run(url="https://crawl4ai.com/mkdocs/", bypass_cache=True)

kun432

チャンク分割

クローリングしつつチャンク分割もできる。まず普通にクローリングした場合。

result1 = crawler.run(
    url="https://crawl4ai.com/mkdocs/",
    bypass_cache=True
)

チャンク分割されたデータはCrawlResultのextracted_contentに入っている。

print(result1.extracted_content)

[
    {
        "index": 0,
        "tags": [],
        "content": "Crawl4AI Documentation"
    },
    {
        "index": 1,
        "tags": [],
        "content": "  * Home \n  * Demo \n  * Search "
    },
    {
        "index": 2,
        "tags": [],
        "content": "  * Home \n  * Demo\n  * First Steps \n    * Introduction\n    * Installation\n    * Quick Start\n  * Examples \n    * Intro\n    * LLM Extraction\n    * JS Execution & CSS Filtering\n    * Hooks & Auth\n    * Summarization\n    * Research Assistant\n  * Full Details of Using Crawler \n    * Crawl Request Parameters\n    * Crawl Result Class\n    * Advanced Features\n    * Chunking Strategies\n    * Extraction Strategies\n  * API Reference \n    * Core Classes and Functions\n    * Detailed API Documentation\n  * Miscellaneous \n    * Change Log\n    * Contact"
    },
    {
        "index": 3,
        "tags": [],
        "content": "  * Crawl4AI v0.2.77\n  * Try the Demo\n  * Introduction\n  * Quick Start\n  * Documentation Structure\n  * Get Started"
    },
    (snip)

デフォルトでは、セパレータは文字列"\n\n"で分割される。これを変更することができる。RegexChunkingを使うと文字列or正規表現でセパレータを指定できる。

from crawl4ai.chunking_strategy import RegexChunking

result2 = crawler.run(
    url="https://crawl4ai.com/mkdocs/",
    chunking_strategy=RegexChunking(patterns=[r'\n']),
    bypass_cache=True
)

[
    {
        "index": 0,
        "tags": [],
        "content": "Crawl4AI Documentation"
    },
    {
        "index": 1,
        "tags": [],
        "content": ""
    },
    {
        "index": 2,
        "tags": [],
        "content": "  * Home "
    },
    {
        "index": 3,
        "tags": [],
        "content": "  * Demo "
    },
    (snip)

print(len(result1.extracted_content))
print(len(result2.extracted_content))

7927
16966

異なるチャンキングが行われているのがわかる。なお、デフォルトはRegexChunking(patterns=["\n\n"])と同義となる。

他にも以下のチャンク分割クラスがある。

NlpSentenceChunking
- テキストを文に分割するNLPモデルを使って、文の境界で分割する。
- NLTK.tokenize.punktを使用している様子。
TopicSegmentationChunking
- TextTilingアルゴリズムでトピック・テーマの境界で分割する
- NLTK.tokenize.TextTilingTokenizer()を使用している様子。
FixedLengthWordChunking
- 指定した単語数に基づいてテキストをチャンクに分割する。
- ただし、スペース区切りでカウントしているようなので、日本語には使えなさそう。
SlidingWindowChunking
- スライディングウィンドウ方式を使用して、重複するチャンクを作成する。
- 各チャンクの長さは固定で、指定されたステップサイズでウィンドウがスライドする。
- いわゆるチャンクオーバーラップと近しい。
- ただしウインドウサイズとステップサイズは単語数で指定するので、これも日本語では使えなさそう。

NlpSentenceChunkingを使ってみる。

from crawl4ai.chunking_strategy import NlpSentenceChunking

result = crawler.run(
    url="https://crawl4ai.com/mkdocs/",
    chunking_strategy=NlpSentenceChunking(),
    bypass_cache=True
)
print(result.extracted_content)

前の例とは違った形でチャンク分割されているのがわかる。

[
    {
        "index": 0,
        "tags": [],
        "content": "This documentation will guide you through the\nfeatures, usage, and customization of Crawl4AI."
    },
    {
        "index": 1,
        "tags": [],
        "content": "Whether you are a beginner or an\nadvanced user, Crawl4AI has something to offer to make your web crawling and\ndata extraction tasks easier and more efficient."
    },
    {
        "index": 2,
        "tags": [],
        "content": "## Get Started\n\nTo get started with Crawl4AI, follow the quick start guide above or explore\nthe detailed sections of this documentation."
    },
    {
        "index": 3,
        "tags": [],
        "content": "## Documentation Structure\n\nThis documentation is organized into several sections to help you navigate and\nfind the information you need quickly:\n\n### Home\n\nAn introduction to Crawl4AI, including a quick start guide and an overview of\nthe documentation structure."
    },

また、チャンク分割クラスは単独でも使える様子。

from crawl4ai.chunking_strategy import RegexChunking

patterns = [r'\n\n', r'\. ']
chunker = RegexChunking(patterns=patterns)

text = "This is a sample text. It will be split into chunks.\n\nThis is another paragraph."

chunks = chunker.chunk(text)
print(chunks)

['This is a sample text', 'It will be split into chunks.', 'This is another paragraph.']

ただ、上にも少し書いたけど日本語で使う事を考えた場合、チャンク分割の選択肢はあまりない。ざっとコードを見た限り唯一使えそうなのはTopicSegmentationChunkingぐらいで、あとはコードを直接修正するしかなさそうに見える（クラスのインタフェース的に外側から触れる部分が限られている）。TopicSegmentationChunkingにしても、自分はNLTKをあまり触ったことがないのでわからないが、少なくとも日本語で使うにはトークナイザーを実装する必要があると思われる。

なので、日本語は基本的にRegexChunkingを使うぐらいしかないのではないかなぁ、もしくは自分でクラスを書くか。英語のサイトのスクレイピングならば役に立つかもしれないね。

チャンク分割クラスについての詳細は以下。

kun432

抽出（extraction）

チャンク分割のところと少し違いがわかりにくいのだが、どうやらチャンク分割したあとで、クラスタリング・タグ付け・上位のクラスタのチャンクだけ抽出する？というようなものの様子。

CosineStrategyはEmbeddingモデルを使用してコサイン類似度でクラスタリングする。

from crawl4ai.extraction_strategy import CosineStrategy

result = crawler.run(
    url="https://www.nbcnews.com/business",
    extraction_strategy=CosineStrategy(
        word_count_threshold=10, 
        max_dist=0.2, 
        linkage_method="ward", 
        top_k=3,
        model_name='BAAI/bge-small-en-v1.5',     # デフォルトのモデル
    ),
    bypass_cache=True
)
print(result.extracted_content)

[
    {
        "index": 1,
        "tags": [
            "science_&_technology"
        ],
        "content": "  * Home \n  * Demo\n  * First Steps \n    * Introduction\n    * Installation\n    * Quick Start\n  * Examples \n    * Intro\n    * LLM Extraction\n    * JS Execution & CSS Filtering\n    * Hooks & Auth\n    * Summarization\n    * Research Assistant\n  * Full Details of Using Crawler \n    * Crawl Request Parameters\n    * Crawl Result Class\n    * Advanced Features\n    * Chunking Strategies\n    * Extraction Strategies\n  * API Reference \n    * Core Classes and Functions\n    * Detailed API Documentation\n  * Miscellaneous \n    * Change Log\n    * Contact   * LLM Extraction\n  * JS Execution & CSS Filtering\n  * Hooks & Auth\n  * Summarization\n  * Research Assistant"
    },
    {
        "index": 2,
        "tags": [
            "science_&_technology"
        ],
        "content": "### Full Details of Using Crawler Comprehensive details on using the crawler, including:"
    },
    {
        "index": 3,
        "tags": [
            "learning_&_educational"
        ],
        "content": "  * Crawl Request Parameters\n  * Crawl Result Class\n  * Advanced Features\n  * Chunking Strategies\n  * Extraction Strategies"
    },
    (snip)

semantic_filterを使えば、クラスタリング前に指定したキーワードと類似性の高いものでフィルタするらしい。

from crawl4ai.extraction_strategy import CosineStrategy

result = crawler.run(
    url="https://crawl4ai.com/mkdocs/",
    extraction_strategy=CosineStrategy(
        word_count_threshold=10, 
        max_dist=0.2, 
        linkage_method="ward", 
        top_k=3,
        model_name='BAAI/bge-small-en-v1.5',
        semantic_filter="Quick Start"
    ),
    bypass_cache=True
)
print(result.extracted_content)

イマイチ効果があったのかどうかはわからないが。

[
    {
        "index": 1,
        "tags": [
            "science_&_technology"
        ],
        "content": "    \n    \n    from crawl4ai import WebCrawler\n    # Create an instance of WebCrawler\n    crawler = WebCrawler()\n    # Warm up the crawler (load necessary models)\n    crawler.warmup()\n    # Run the crawler on a URL\n    result = crawler.run(url=\\\"https://www.nbcnews.com/business\\\")\n    # Print the extracted content\n    print(result.extracted_content)\n    \n```   1. Importing the Library: We start by importing the WebCrawler class from the crawl4ai library.\n  2. Creating an Instance: An instance of WebCrawler is created.\n  3. Warming Up: The warmup() method prepares the crawler by loading necessary models and settings.\n  4. Running the Crawler: The run() method is used to crawl the specified URL and extract meaningful content.\n  5. Printing the Result: The extracted content is printed, showcasing the data extracted from the web page."
    },
    {
        "index": 2,
        "tags": [
            "science_&_technology"
        ],
        "content": "Welcome to the official documentation for Crawl4AI! \ud83d\udd77\ufe0f\ud83e\udd16 Crawl4AI is an open-\nsource Python library designed to simplify web crawling and extract useful\ninformation from web pages. This documentation will guide you through the\nfeatures, usage, and customization of Crawl4AI. Crawl4AI has one clear task: to make crawling and data extraction from web\npages easy and efficient, especially for large language models (LLMs) and AI\napplications. Whether you are using it as a REST API or a Python library,\nCrawl4AI offers a robust and flexible solution."
    },
    {
        "index": 3,
        "tags": [
            "science_&_technology"
        ],
        "content": "Crawl4AI Documentation   * Crawl4AI v0.2.77\n  * Try the Demo\n  * Introduction\n  * Quick Start\n  * Documentation Structure\n  * Get Started # Crawl4AI v0.2.77 Here\\'s a quick example to show you how easy it is to use Crawl4AI:"
    },
    (snip)

Embeddingモデルではなく、LLM（Completionモデル）を使って抽出を行うLLMExtractionStrategyもある。

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from google.colab import userdata

result = crawler.run(
    url="https://crawl4ai.com/mkdocs/",
    extraction_strategy=LLMExtractionStrategy(
        provider='openai/gpt-4o-mini',
        api_token=userdata.get('OPENAI_API_KEY'),
    ),
    bypass_cache=True                 
)

print(result.extracted_content)

おー、だいぶ変わった。

[
    {
        "index": 0,
        "tags": [
            "title"
        ],
        "content": [
            "Crawl4AI Documentation"
        ],
        "error": false
    },
    {
        "index": 1,
        "tags": [
            "navigation"
        ],
        "content": [
            "* Home",
            "* Demo",
            "* Search",
            "* Home",
            "* Demo",
            "* First Steps",
            "* Introduction",
            "* Installation",
            "* Quick Start",
            "* Examples",
            "* Intro",
            "* LLM Extraction",
            "* JS Execution & CSS Filtering",
            "* Hooks & Auth",
            "* Summarization",
            "* Research Assistant",
            "* Full Details of Using Crawler",
            "* Crawl Request Parameters",
            "* Crawl Result Class",
            "* Advanced Features",
            "* Chunking Strategies",
            "* Extraction Strategies",
            "* API Reference",
            "* Core Classes and Functions",
            "* Detailed API Documentation",
            "* Miscellaneous",
            "* Change Log",
            "* Contact"
        ],
        "error": false
    },
    {
        "index": 2,
        "tags": [
            "version"
        ],
        "content": [
            "Crawl4AI v0.2.77"
        ],
        "error": false
    },
    {
        "index": 3,
        "tags": [
            "demo"
        ],
        "content": [
            "Try the Demo"
        ],
        "error": false
    },
    (snip)

指示プロンプトを渡すこともできる。

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from google.colab import userdata

result = crawler.run(
    url="https://crawl4ai.com/mkdocs/",
    extraction_strategy=LLMExtractionStrategy(
        provider='openai/gpt-4o-mini',
        api_token=userdata.get('OPENAI_API_KEY'),
        instruction="Exract only Python code samples."
    ),
    bypass_cache=True                 
)

print(result.extracted_content)

[
    {
        "index": 0,
        "tags": [
            "python_code_sample"
        ],
        "content": [
            "``` from web pages easy and efficient, especially for large language models (LLMs) and AI applications. Whether you are using it as a REST API or a Python library, Crawl4AI offers a robust and flexible solution. ## Quick Start Here\\\\'s a quick example to show you how easy it is to use Crawl4AI: ```"
        ],
        "error": false
    },
    {
        "index": 0,
        "tags": [
            "python_code"
        ],
        "content": [
            "from crawl4ai import WebCrawler",
            "# Create an instance of WebCrawler",
            "crawler = WebCrawler()",
            "# Warm up the crawler (load necessary models)",
            "crawler.warmup()",
            "# Run the crawler on a URL",
            "result = crawler.run(url=\\\\\"https://www.nbcnews.com/business\\\\\")",
            "# Print the extracted content",
            "print(result.extracted_content)"
        ],
        "error": false
    }
]

LLMを使うのは効果が高そうに思える。ただし、当然ながらLLMを使うことで処理時間が変わってくることには注意。

なお、LLMへのアクセスはLiteLLMを使用しているようなので、LiteLLMがサポートしているモデルであれば、理屈的にはどれでも使えるのではないだろうか。

また、ドキュメントには載っていないが、以下のような抽出クラスもある模様。

TopicExtractionStrategy
- TopicSegmentationChunkingと同じロジックで抽出
ContentSummarizationStrategy
- Transformersを使った要約モデルを使って抽出

抽出クラスの詳細は以下。LLMの場合はStructured Outputを使うこともできるみたい。

kun432

 その他/より進んだ機能CSSのセレクタでフィルタしたり、JavaScriptでブラウザ操作させたり、認証させたり、など。
https://crawl4ai.com/mkdocs/full_details/advanced_features/
https://crawl4ai.com/mkdocs/examples/js_execution_css_filtering/
https://crawl4ai.com/mkdocs/examples/hooks_auth/

kun432

 まとめシンプルにLLMに解釈させやすいMarkdown化するだけなら、Jina ReaderやFirecrawlなどとそれほど大差はないと思うし、出力されるMarkdownも十分使いやすい。どれが一番良い、というよりは、クロール先のサイトの作りとの相性みたいなものがあるような気がするので、それぞれ試してみればよいと思う。
ただ、LLMと組み合わせた場合はかなり強力。LLMなので当然実行するたびに多少変わることは避けれないけれども、うまく使えばいい感じでフォーマット揃えたりとかもできるので、あとでゴミを削除したりとかする手間も省けそう。
Scrape Graph AIと同じく、LLMを組み合わせることで、シンプルなスクレイピングツールからExtract/Transformができるパイプラインになるし、このままエージェントのツールとしても使えそう。
シンプルな使い方も凝った使い方も両方できるので、これまでFirecrawlをよく使ってたけど、しばらくはCrawl4AIを使ってみようと思っている。

kun432

ちょっと細かく見すぎたので少しわかりにくい記事になってしまったし、クローリングのサンプルも日本語サイトにすべきだったなぁ。いつも「日本語で使えるかどうか」を確認することを念頭に置いているのだけど、今回はちょっとそこを怠っていたので、反省。
ということで、日本語の記事もいくつかあったので、自分の記事を見るよりも、そちらを見たほうが手っ取り早いと思う。
https://note.com/shi3zblog/n/n087a41f7a60e
https://zenn.dev/syoyo/articles/adc91d17b0c76b

kun432

LLMExtractionStrategyを使った場合のプロンプトはこんな感じ。

日本語にしたもの。{REQUEST}にinstructionsで記載したものが埋め込まれることになる。

以下がウェブページの URL です。
<url>{URL}</url>

そして以下がそのウェブページのクリーンな HTML コンテンツです。
<html>
{HTML}
</html>

あなたのタスクは、ユーザーの要求に従って、この HTML コンテンツを意味的に関連するブロックに分割し、各ブロックについて、以下のキーを持つ JSON オブジェクトを生成することです。

- index:コンテンツ内のブロックのインデックスを表す整数
- content:ブロックのテキストコンテンツを含む文字列のリスト

これはユーザーのリクエストです。よく読んでください。
<request>
{REQUEST}
</request>

JSONオブジェクトを生成するには:

1. HTMLコンテンツを注意深く読み、コンテンツを分割するのにふさわしい論理的な区切りや変化を特定します。

2. 各ブロックについて:
  a. コンテンツ内の順序に基づいてインデックスを割り当てます。
  b. コンテンツを分析し、そのブロックが何について述べているかを説明する意味上のタグを1つ生成します。
  c. テキストコンテンツを抽出します。これは「与えられたデータ」とまったく同じで、必要に応じてテキストを整理し、「content」フィールドに文字列のリストとして保存します。

3. JSONオブジェクトの順序が、元のHTMLコンテンツに表示されているブロックの順序と一致していることを確認してください。

4. 各JSONオブジェクトにすべての必須キー(index、tag、content)が含まれていること、および値が想定されるフォーマット(整数、文字列のリストなど)であることを再度確認してください。

5. 生成されたJSONが完全で解析可能であり、エラーや抜けがないことを確認してください。

6. HTML コンテンツ内の特殊文字をエスケープし、また、シングルクォートやダブルクォートをエスケープして、JSON の解析に関する問題を回避してください。

7. 抽出したコンテンツは決して変更せず、そのままコピー&ペーストしてください。

出力は、次のように、<blocks> タグ内に記載してください。

<blocks>
[{
  "index": 0,
  "tags": ["introduction"],
  "content": ["これは記事の最初の段落で、主題の紹介と概要を提供しています。"]
},
{
  "index": 1,
  "tags": ["background"],
  "content": ["これは2番目の段落で、トピックの歴史と背景について掘り下げています。",
              "文脈を提供し、記事の残りの部分の舞台を設定します。"]
}]
</blocks>

**ブロックを抽出する際には、ユーザー指示に従ってください。**

出力は、省略やエラーのない、完全な解析可能なJSONで、<blocks>タグで囲まれている必要があります。JSONオブジェクトは、コンテンツを意味的に関連するブロックに分解し、元の順序を維持する必要があります。

kun432

今回試していないけども、LLMExtractionStrategyでスキーマを指定した場合のプロンプト。

URLの内容は次の通りです。
<url>{URL}</url>

<url_content>
{HTML}
</url_content>

ユーザーは、上記のコンテンツから抽出する情報について、以下のリクエストを行いました。

<user_request>
{REQUEST}
</user_request>

<schema_block>
{SCHEMA}
</schema_block>

URLコンテンツとユーザーのリクエストを注意深く読みます。ユーザーが上記の<schema_block>で希望するJSONスキーマを提供している場合は、そのスキーマに従ってURLコンテンツからリクエストされた情報を抽出します。スキーマが提供されていない場合は、ユーザーが探している主要な情報を最も適切に捉えることができるユーザーのリクエストに基づいて、適切なJSONスキーマを推測します。

抽出の指示:
抽出した情報を、URLのコンテンツのブロックに対応するJSONオブジェクトのリストとして返します。各オブジェクトは、ページに表示されている順序と同じ順序でリストに含まれます。JSONリスト全体を、XMLタグの<blocks>...</blocks>で囲みます。

品質の確認:
最終的な回答を出力する前に、返すJSONが完全であり、ユーザーが要求したすべての情報を含み、エラーや省略なくjson.loads()で解析できる有効なJSONであることを再度確認してください。出力されたJSONオブジェクトは、提供されたか推論されたスキーマに完全に一致する必要があります。

品質スコア:
振り返った後、返却しようとしているJSONデータの品質と完全性を1から5のスケールで評価します。スコアを<score>タグ内に記述します。

よくある間違いを避ける:
- JSON 出力に「//」または「#」を使用してコメントを追加しないでください。パースエラーの原因となります。
- JSONが適切な形式で、波括弧、角括弧、カンマが正しい位置にあることを確認してください。
- JSON出力の最後に終了タグ</blocks>を付け忘れないでください。
- Python coee を生成しないでください。タスクの実行方法については、こちらをご覧ください。これは、情報を抽出して JSON 形式で返すというタスクです。

結果
XMLタグで囲まれたJSONオブジェクトの最終的なリストを出力します。タグを適切に閉じていることを確認してください。

このスクラップは2024/09/21にクローズされました

ログインするとコメントできます