📄

Snowflakeにおける非構造化データを構造化データに変える選択肢

Takumi

2025/09/17に公開

Snowflake

LLM

tech

 はじめにこんにちは。ナウキャストでデータエンジニア/LLMエンジニアとして頑張っているTakumiです。
Snowflakeにて、非構造化データからデータの抽出を行う機能が続けてPublic Preview、GAとなりました。
Document AI
AI_EXTRACT
AI_PARSEDOCUMENT(OCR)
AI_PARSEDOCUMENT(LAYOUT)
それぞれ、非構造化データ（PDF等）から情報を抽出する関数ですが使い所が違います。

この記事ではそれぞれの適したユースケースと、コストはどうなっているかをお伝えします！
!類似の関数として、
SNOWFLAKE.CORTEX.EXTRACT_ANSWER
SNOWFLAKE.CORTEX.PARSE_DOCUMENT
が存在しますが、これらはAI_EXTRACTとAI_PARSE_DOCUMENTに置き換えられています。

 この記事でわかることDocument AI
AI_EXTRACT
AI_PARSEDOCUMENT(OCR)
AI_PARSEDOCUMENT(LAYOUT)
それぞれの機能の概要、ユースケース、パターンごとのコスト

 機能概要まずそれぞれの機能の制限を見ていきましょう。お互い使えるデータはほぼ一緒ですね。


項目
AI_EXTRACT
Document AI
AI_PARSE_DOCUMENT (OCR/ LAYOUT)


主な使い方
プロンプトで都度指定して抽出（ゼロショット的）
Snowsightでモデルを用意し、<model>!PREDICTで定義済み項目を抽出

OCR：全文テキスト化
LAYOUT：Markdownで構造付き抽出

対応ファイル形式
PDF, PNG, PPTX, EML, DOC/DOCX, JPEG/JPG, HTM/HTML, TXT, TIF/TIFF
PDF, PNG, DOCX, EML, JPEG/JPG, HTM/HTML, TXT, TIF/TIFF
PDF, PPTX, DOCX, JPEG/JPG, PNG, TIFF/TIF, HTML, TXT

最大サイズ
100MB未満
50MB以下
100MB未満

最大ページ
明記なし
125ページ以下
300ページ以下

!ちゃんと最大サイズ、ページが制限されていますね。

これらの機能に限らずですが、Snowflakeからは「データの前処理ちゃんとやろうね」、「えいやで突っ込むみたいな脳筋ソリューション許さないよ」

と言われている様に自分は感じます。

 Document AIDocument AIとは、Snowflake謹製のLLMであるArctic-TILTを使ってドキュメントからデータを抽出する機能です。

単にPDFファイルの様なPCで作成した資料だけでなく、手書きテキスト、チェックマークなどにも対応できるオールマイティな機能です。
DocumentAIが他の機能と明確に異なる点は、モデルをFine-tuningできることです。

もちろんゼロショット学習での情報抽出も可能ですが、Fine-tuningを使うことで一般的ではない形式の資料についてもある程度対応可能になります。

Fine-tuningを行ったモデルは、そのアカウント内に保持されて、他のSnowflakeアカウントと共有されることはありません。
!2025年9月現在では日本語対応していないです。なので、日本語のPDFを通しても現状はNo Answer（抽出失敗）となります。
具体的な使い方はDocument AIのTutorial を参照してください。

 AI_EXTRACTAI_EXTRACTは、Snowflakeが提供するLLMベースの関数です。その場でプロンプト（指示文）を与えて、非構造化データから情報を抜き出すことができます。

DocumentAIと似た機能ですが、responseFormatを指定することで、AI_EXTRACTのみで多様な文書から柔軟な情報抽出が可能です。

 小話DocumentAIとAI_EXTRACT関数が、他の機能と比較して優位なのは情報抽出のみがユースケースではない点です。

ドキュメントに記載の通り、文書種別をクラスタリングさせる様な使い方も可能です。
SELECT AI_EXTRACT(
  file => TO_FILE('@db.schema.files', relative_path),
  responseFormat => [
    'What is this document?',
    'How would you classify this document?'
  ]
) FROM DIRECTORY (@db.schema.files);
上記のような使い方をすることで、単に情報を抽出するだけではなく、ファイルのメタデータそのものもパイプラインで作成することが可能です。

 AI_PARSE_DOCUMENTAI_PARSE_DOCUMENTは、ドキュメントを読み込んでテキスト内容を抽出し、結果をJSON形式で返します。

この関数では、OCR (光学文字認識)モードとLAYOUTモードをサポートしています。
ざっくりとそれぞれの違いは以下の通りです。
OCRモード：OCRによる文字列抽出を行う
LAYOUTモード：文字列抽出だけでなく、Markdownによるレイアウト情報の抽出まで行う
adobeのサンプルpdfを使ってどんな感じで使うことができるかを見てみましょう。

 OCR使い方は至って簡単です。
SELECT TO_VARCHAR(SNOWFLAKE.CORTEX.PARSE_DOCUMENT (
'@DOC_AI_DB.DOC_AI_SCHEMA.AI_EXTRACT_STG', 
'sample_pdf.pdf',
{'mode': 'OCR', 'page_split': TRUE} ) ) AS MULTIPAGE;
この様にSQLを打つと、
OCRモードの実行結果{"metadata":{"pageCount":4},"pages":[{"content":"PDF BOOKMARK SAMPLE\nSample Date: May 2001\nPrepared by: Accelio Present Applied Technology\nCreated and Tested Using: • Accelio Present Central 5.4\n• Accelio Present Output Designer 5.4\nFeatures Demonstrated: • Primary bookmarks in a PDF file.\n• Secondary bookmarks in a PDF file.\nOverview\nThis sample consists of a simple form containing four distinct fields. The data file contains eight\nseparate records.\nBy default, the data file will produce a PDF file containing eight separate pages. The selective\nuse of the bookmark file will produce the same PDF with a separate pane containing\nbookmarks. This screenshot of the sample output shows a PDF file with bookmarks.\nAcrobat Reader - [ap_bookmark.pdf]\n) File Edit Document View Window Help\nBookmarks \\Thumbnails E.. Imvoices by Date\n2000-01-1 1 2000-01-2 1 2000-01-3\n2000-01-4 2000-01-5 Date 2000-01-1 2000-01-6\n2000-01-7 Description Description for item # 1 2000-01-8\nType TYPE1\nAmount 11.00\nThe left pane displays the available bookmarks for this PDF. You may need to enable the\ndisplay of bookmarks in Adobe Acrobat Reader by clicking Window > Show Bookmarks.\nSelecting a date from the left pane displays the corresponding page within the document.\nNote that the index has been sorted according to the specification in the bookmark file, and that\npages within the file are created according to the original order in the data file.\nPDF Bookmark Sample Page 1 of 4","index":0},{"content":"Sample Data File Sample Bookmark File\n^reformat trunc [invoices] ^symbolset WINLATIN1 Invoices by Date=0 ^field trans_date trans_date=1,A 2000-01-1 [type] ^field description Invoices by Item Type=0 Description for item #1 trans_type=1,A ^field trans_type [amount] TYPE1 Invoices by Transaction Amount=0 ^field trans_amount trans_amount=1,D 11.00 ^page 1 ^field trans_date 2000-01-2 ^field description Description for item #2 ^field trans_type TYPE2 ^field trans_amount 11.00 ^page 1 ^field trans_date 2000-01-3 ^field description Description for item #3 ^field trans_type TYPE3\nThe example bookmark file includes three distinct sections:\n• Invoices sorted, ascending, by date.\n• Invoices sorted, ascending, by item type.\n• Invoices sorted, descending, by transaction amount.\nPDF Bookmark Sample Page 2 of 4","index":1},{"content":"Sample Files\nThis sample package contains:\nFilename Description\nap_bookmark.IFD The template design.\nap_bookmark.mdf The template targeted for PDF output.\nap_bookmark.dat A sample data file in DAT format.\nap_bookmark.bmk A sample bookmark file.\nap_bookmark.pdf Sample PDF output.\nap_bookmark_doc.pdf A document describing the sample.\nDeploying the Sample\nTo deploy this sample in your environment:\n1. Open the template design ap_bookmark.IFD in Output Designer and recompile the\ntemplate for the appropriate presentment target.\n2. Modify the -z option in the ^job command in the data file ap_bookmark.dat to:\n• Identify the target output device.\n• Identify the bookmark file using the -abmk command.\n• Identify the section for which to generate bookmarks, if desired, using the -abms\ncommand.\nFor example,\nTo bookmark by … Use the command line parameter …\nInvoices -abmkap_bookmark.bmk -abmsinvoices\nType -abmkap_bookmark.bmk -abmstype\nAmount -abmkap_bookmark.bmk -abmsamount\nPDF Bookmark Sample Page 3 of 4","index":2},{"content":"3. Place the accompanying files in directories consistent with your implementation:\n• Place ap_bookmark.IFD in the Designs subdirectory for Output Designer.\n• Place ap_bookmark.mdf in the forms subdirectory accessible to Central.\n• Place ap_bookmark.bmk in an addressable directory.\nRunning the Sample\n• To run this sample, place ap_bookmark.dat in the collector directory scanned by Central.\nPDF Bookmark Sample Page 4 of 4","index":3}]}
こんな感じで帰ってきます。実行時間は3ページのPDFで1.4秒。大量のPDFを含むパイプラインを構築しても現実的な時間で終わりそうです。

 LAYOUTこちらも使い方は簡単で、
SELECT TO_VARCHAR(SNOWFLAKE.CORTEX.PARSE_DOCUMENT (
'@DOC_AI_DB.DOC_AI_SCHEMA.AI_EXTRACT_STG', 
'sample_pdf.pdf',
{'mode': 'LAYOUT', 'page_split': TRUE} ) ) AS MULTIPAGE;
とするだけ。

結果はこんな感じ。
LAYOUTモードの実行結果{"metadata":{"pageCount":4},"pages":[{"content":"# ACCELIO.\n\n## PDF Bookmark Sample\n\n|  Sample Date: | May 2001  |\n| --- | --- |\n|  Prepared by: | Accelio Present Applied Technology  |\n|  Created and Tested Using: | - Accelio Present Central 5.4  |\n|   | - Accelio Present Output Designer 5.4  |\n|  Features Demonstrated: | - Primary bookmarks in a PDF file.  |\n|   | - Secondary bookmarks in a PDF file.  |\n\n## Overview\n\nThis sample consists of a simple form containing four distinct fields. The data file contains eight separate records.\n\nBy default, the data file will produce a PDF file containing eight separate pages. The selective use of the bookmark file will produce the same PDF with a separate pane containing bookmarks. This screenshot of the sample output shows a PDF file with bookmarks.\n\n![img-0.jpeg](img-0.jpeg)\n\nThe left pane displays the available bookmarks for this PDF. You may need to enable the display of bookmarks in Adobe® Acrobat® Reader by clicking **Window > Show Bookmarks**. Selecting a date from the left pane displays the corresponding page within the document.\n\nNote that the index has been sorted according to the specification in the bookmark file, and that pages within the file are created according to the original order in the data file.","index":0},{"content":"# $\\square$ \n\n## ACCELIO.\n\n## Sample Data File\n\n^reformat trunc\n^symbolset WINLATIN1\n^field trans_date\n2000-01-1\n^field description\nDescription for item \\#1\n^field trans_type\nTYPE1\n^field trans_amount\n11.00\n^page 1\n^field trans_date\n2000-01-2\n^field description\nDescription for item \\#2\n^field trans_type\nTYPE2\n^field trans_amount\n11.00\n^page 1\n^field trans_date\n2000-01-3\n^field description\nDescription for item \\#3\n^field trans_type\nTYPE3\n\n## Sample Bookmark File\n\n[invoices]\nInvoices by Date $=0$\ntrans_date=1,A\n[type]\nInvoices by Item Type=0\ntrans_type=1,A\n[amount]\nInvoices by Transaction Amount=0\ntrans_amount=1,D\n\nThe example bookmark file includes three distinct sections:\n\n- Invoices sorted, ascending, by date.\n- Invoices sorted, ascending, by item type.\n- Invoices sorted, descending, by transaction amount.","index":1},{"content":"# $\\square$ \n\n## ACCELIO.\n\n## Sample Files\n\nThis sample package contains:\n\n| Filename | Description |\n| :-- | :-- |\n| ap_bookmark.IFD | The template design. |\n| ap_bookmark.mdf | The template targeted for PDF output. |\n| ap_bookmark.dat | A sample data file in DAT format. |\n| ap_bookmark.bmk | A sample bookmark file. |\n| ap_bookmark.pdf | Sample PDF output. |\n| ap_bookmark_doc.pdf | A document describing the sample. |\n\n## Deploying the Sample\n\nTo deploy this sample in your environment:\n\n1. Open the template design ap_bookmark.IFD in Output Designer and recompile the template for the appropriate presentment target.\n2. Modify the -z option in the ^job command in the data file ap_bookmark.dat to:\n\n- Identify the target output device.\n- Identify the bookmark file using the -abmk command.\n- Identify the section for which to generate bookmarks, if desired, using the -abms command.\n\nFor example,\n\n| To bookmark by $\\ldots$ | Use the command line parameter ... |\n| :-- | :-- |\n| Invoices | -abmkap_bookmark.bmk -abmsinvoices |\n| Type | -abmkap_bookmark.bmk -abmstype |\n| Amount | -abmkap_bookmark.bmk -abmsamount |","index":2},{"content":"# ACCELIO. \n\n3. Place the accompanying files in directories consistent with your implementation:\n\n- Place ap_bookmark.IFD in the Designs subdirectory for Output Designer.\n- Place ap_bookmark.mdf in the forms subdirectory accessible to Central.\n- Place ap_bookmark.bmk in an addressable directory.\n\n\n## Running the Sample\n\n- To run this sample, place ap_bookmark.dat in the collector directory scanned by Central.","index":3}]}
となります。ちゃんとMarkdownになっていますね。実行時間は9秒。長い。
どちらも図のキャプチャは読み込めないものの、文字起こしとしては良い精度を達成できていそうです。

 ユースケースを考えるここまでで概要を把握できました。各機能の特徴を踏まえて、どの様なシナリオでどの機能を使うべきかを整理しましょう。

 Document AI特定フォーマットを持つ（フォーマットが決まりきっている）文書を繰り返し処理する場合は、DocumentAIが適していると言えます。

事前にモデルを作成、質問項目を定義した上でAccuracyを評価するという、他の機能に比べて手間はあるものの、その分安定した抽出精度と継続運用が可能となるでしょう。

今持っている資料を使ってモデルをFine-tuningすることも可能であるため、抽出結果に信頼性が求められる業務用途で最も輝きそう。

一方で、多種多様な様式を持つ文書群については、モデルを様式ごとに構築することが必要になるため、現実的ではありません。

 AI_EXTRACT様々な非構造化データから柔軟にデータを抽出したい場合、思いついた問いをすぐ投げてみたい場合は、AI_EXTRACTが適しています。

DocumentAIとは違い、事前学習なしでその場で質問ができるお手軽さが大きな利点になります。そのため、ファイルをアップロードしてその場でクイックにデータを抽出したい場合に役立ちます。

更に「'What is this document?'」、「'How would you classify this document?'」の様な文書の分類そのものを聞くことで、Cortex Search Serviceを構築する時のメタデータ付与にも使えそうです。
!responseFormatで出力をカスタマイズできるものの、「指示が増えると聞かない。」、「Fewshotのつもりの例をそのまま出力する。」などの動きがあったので、ChatGPTを使う様なプロンプトでどうにかする機能ではないと感じています。

モデルの能力としてはgpt-3~4の間のイメージ。絶妙。

 AI_PARSE_DOCUMENT文書全体をテキストデータ化したい場合は、AI_PARSE_DOCUMENTの出番になります。特に魔改造されたWord、Excelを使った文書はそのまま読み込むと値が読み込めなかったりするので、その文書を簡単に加工できる形にする場合は強力な武器となります。LLMに読み込ませるための下準備として、様々な非構造化データを処理することが求められる場合にも適しています。
その上で、OCRとLAYOUTを使い分ける条件は、
その文書が文字列の羅列では後処理(LLMによる理解)が難しくなるか
という問いを行うことで分岐します。例えば、表の行列対応や文章の階層構造など、Markdown形式の方が理解に役立つ場合はLAYOUTモードの方がAIにとって読みやすくなる状態で保存が可能です。
!こちらのコミュニティでも議論されている通り、XMLフォーマットによる区切りがモデル性能を改善されることがあるみたいなので、将来的にはLAYOUTモードにXMLが増えるのかなぁと想像しています。ただ、XMLは人間にとって読みやすい形式ではないので、作られないかもと思います。

 ユースケースまとめここまでをまとめると、

「決まったフォーマット & 厳密な抽出」 にはDocument AI

「柔軟な抽出 & クイックな質問」 にはAI_EXTRACT

「全文テキスト化」 にはAI_PARSE_DOCUMENT (OCR)

「全文テキスト化 & 構造も保存」 にはAI_PARSE_DOCUMENT (LAYOUT)
といった使い分けが考えられます。

 コスト試算コスト計算には以下のサンプルPDFを用います。

https://ontheline.trincoll.edu/images/bookdown/sample-local-pdf.pdf

 前提条件AWS Tokyo リージョン
契約はEnterpriseであること（1credit = $4.3）

 コストテーブルCredit Consumption Table参照。

 変数定義pages = 3                           # 本PDF  
doc_tokens_avg = 2040               # 実測平均トークン  
K = 10                              # 抽出個数  
q_tok = 20                          # 質問のトークン数
a_tok = 20                          # 回答のトークン数

 計算結果サマリー

機能
見積もり条件/式
単価の前提
計算過程
消費クレジット
コスト ($)


AI_EXTRACT

total_tokens = 2040 + (10×20) + (10×20) = 2440 → billable_Mtokens = 2440 / 1,000,000 → credit = billable_Mtokens × 2.55


2.55 credits / 1M tokens（AI_EXTRACT）
2440 / 1,000,000 × 2.55
≈ 0.0062
$0.0267

AI_PARSE_DOCUMENT (OCR)
ページ課金: credit = (pages / 1000) × 0.5

0.5 credits / 1000 pages
(3 / 1000) × 0.5
0.0015
$0.0065

AI_PARSE_DOCUMENT (LAYOUT)
ページ課金: credit = (pages / 1000) × 3.33

3.33 credits / 1000 pages
(3 / 1000) × 3.33
≈ 0.0100
$0.0430

Document AI

中密度 × 抽出10項目 の公式目安: credit = (pages / 1000) × (6〜9)

6〜9 credits / 1000 pages
(3 / 1000) × (6〜9)
0.018〜0.027
$0.077〜$0.116


 スケールを大きくしてコストを見てみる今回は同じ様式のファイルに対して計算しているので、基本線形でコストが推移していますね。
この図から読み取れることは、
AI_PARSE_DOCUMENT (OCR) はページ課金が最も小さく、圧倒的に安い。
AI_EXTRACT は固定の質問トークン（今回は 10×(20+20)=400 tokens）を足しても、ページ増にも緩やかに伸びる。
Document AI は中密度×10項目のレンジ（6〜9 credits/1000ページ）に基づくため、ページ数が増えると OCRやAI_EXTRACTよりも大きい傾き。
あたりですかね。単にコストだけ見るとDocument AIを選ぶのをためらってしまいそうですが、会社固有のドキュメントにも対応可能という”抽出の質”を買うイメージなのかなと思いました。

 個人的に考えていること自分としては今回紹介した機能は、本気で日本の魔改造Excelを救えるのではないかと思っています。結合されているセルはPythonによる操作を行うと壊れます。かと言って、全て手作業でテーブルに起こすわけにもいかない。そんな扱いに困っているデータ群を抽出することができます。

これまで活かし切ることのできなかったデータを活用可能にするソリューションの一つとして、期待していますし使いこなせる様に精進します！

 ちょっと宣伝先日行われたSnowflake World Tour Tokyo 2025で登壇しました。

その際の資料でAI_EXTRACTを使ったデータのパイプラインをどう活かしたのかを紹介しています！

ご覧いただけますと幸いです！
https://speakerdeck.com/kevinrobot34/swt-2025-snowflake-intelligence-and-document-ai

項目	AI_EXTRACT	Document AI	AI_PARSE_DOCUMENT (OCR/ LAYOUT)
主な使い方	プロンプトで都度指定して抽出（ゼロショット的）	Snowsightでモデルを用意し、`<model>!PREDICT`で定義済み項目を抽出	OCR：全文テキスト化 LAYOUT：Markdownで構造付き抽出
対応ファイル形式	PDF, PNG, PPTX, EML, DOC/DOCX, JPEG/JPG, HTM/HTML, TXT, TIF/TIFF	PDF, PNG, DOCX, EML, JPEG/JPG, HTM/HTML, TXT, TIF/TIFF	PDF, PPTX, DOCX, JPEG/JPG, PNG, TIFF/TIF, HTML, TXT
最大サイズ	100MB未満	50MB以下	100MB未満
最大ページ	明記なし	125ページ以下	300ページ以下

機能	見積もり条件/式	単価の前提	計算過程	消費クレジット	コスト ($)
AI_EXTRACT	`total_tokens = 2040 + (10×20) + (10×20) = 2440` → `billable_Mtokens = 2440 / 1,000,000` → `credit = billable_Mtokens × 2.55`	2.55 credits / 1M tokens（AI_EXTRACT）	`2440 / 1,000,000 × 2.55`	≈ 0.0062	$0.0267
AI_PARSE_DOCUMENT (OCR)	ページ課金: `credit = (pages / 1000) × 0.5`	0.5 credits / 1000 pages	`(3 / 1000) × 0.5`	0.0015	$0.0065
AI_PARSE_DOCUMENT (LAYOUT)	ページ課金: `credit = (pages / 1000) × 3.33`	3.33 credits / 1000 pages	`(3 / 1000) × 3.33`	≈ 0.0100	$0.0430
Document AI	中密度 × 抽出10項目の公式目安: `credit = (pages / 1000) × (6〜9)`	6〜9 credits / 1000 pages	`(3 / 1000) × (6〜9)`	0.018〜0.027	$0.077〜$0.116

Finatext Tech BlogPublication

『金融を"サービス"として再発明する』Finatextグループのテックブログです。

はじめに

この記事でわかること

機能概要

Document AI

AI_EXTRACT

小話

AI_PARSE_DOCUMENT

OCR

LAYOUT

ユースケースを考える

Document AI

AI_EXTRACT

AI_PARSE_DOCUMENT

ユースケースまとめ

コスト試算

前提条件

コストテーブル

変数定義

計算結果サマリー

スケールを大きくしてコストを見てみる

個人的に考えていること

ちょっと宣伝

Discussion