Open20日前にコメント追加2

【Azure】AI SearchのPDFテキスト抽出

インデクサーのimageAction"がnullの場合と"generateNormalizedImagePerPage"の場合で、テキスト抽出結果に差が出るかを確認

 1. インデックス概要imageAction: null

Documents: 1440

Total storage: 5.79MB
imageAction: generateNormalizedImagePerPage

Documents: 1445

Total storage: 5.59MB
imageAction: generateNormalizedImagePerPage + OcrSkill

TextSplitSkillのまえにOcrSkillを使用

Documents: 504

Total storage: 3.85MB

ページごとのチャンク分割となりドキュメント数が減った

ama

 2. 抽出された文比較imageAction: null

文法はあっている。
抽出された文

How we use price assumptions \nOur price assumptions are used for our \ninvestment appraisal processes. They are also \nused to inform decisions about internal planning \nand the value-in-use impairment testing of \nassets for financial reporting. \n\nThe role of price assumptions \nAs part of our regular strategy review,
imageAction: generateNormalizedImagePerPage + OcrSkill

区切りを無視して横に文が結合されている
抽出された文

Our investment process How we use price assumptions Impairment testing For investment appraisal, potential future Our price assumptions are used for our Our best estimate of future prices for use operational emissions costs that may be borne investment appraisal processes.
元の文