🌝
Mistral OCR APIを使ってPDFをMarkdownファイルに変換してみた！（画像埋め込み対応🚀）

2025/03/09に公開
 1. はじめに
つい先日、フランスのAIスタートアップ企業であるMistral AIから、新しいOCR技術を活用したAPI「Mistral OCR」が発表されました。PDFのテキストや画像、数式、表などを高精度に認識できるとのことで、かなり話題になっています。

実際に試してみたところ、Markdown形式への変換がスムーズで、思った以上に実用的でした！
この記事では、Mistral OCR APIを使ってPDFファイルをMarkdownファイルに変換する方法を紹介します。

 2. Mistral OCR とは

 公式ページ
 リリースニュースhttps://mistral.ai/en/news/mistral-ocr

 ドキュメントhttps://docs.mistral.ai/capabilities/document/

 Mistral OCRの概要
「Mistral OCR」特徴は以下のとおりです。

高度なドキュメント理解
テキストだけでなく、画像、表、数式など多岐にわたる要素を認識し、構造を保ったかたちでデータを抽出可能。


多言語対応
多数の言語や文字スクリプトに対応しており、世界中の文書に対して高精度なOCRを実現。


Markdown形式へのテキスト出力
構造や階層を維持したテキストをMarkdown形式で返すため、解析やレンダリングが容易。

さらに、セルフホスティングでの利用も可能らしいですが、公式ページをざっと見た限りでは、別途問い合わせが必要そうですね。

 料金について
公式情報では、1000ページ / $1（バッチ推論を併用するとさらに安価）とのこと。
他の製品を利用したことがないので比較はできませんが、思ったよりも安いですね・・・

 Mistral OCR APIの詳細
 公式のサンプルコードhttps://docs.mistral.ai/capabilities/document/
import os
from mistralai import Mistral

api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    include_image_base64=True
)

 レスポンス抽出された各ページのテキストや画像が、Markdown形式でまとめて返ってきます。

テキスト要素だけでなく、画像をBase64で埋め込むことも可能です。
{
    "pages": [
        {
            "index": ,        // ページ数
            "markdown": ,     // OCRされたテキスト情報（Markdown形式）
            "images": [],     // 当該ページに含まれる画像のリスト（Base64でエンコードされた情報を含む）
            "dimensions": {}. // ページの寸法情報？
        },
.
.
.
    "model": "",               // 使用したOCRモデルの名称
    "usage_info": {            // OCR処理に関するメタデータ
        "pages_processed": ,   // OCR処理されたページ数
        "doc_size_bytes": ,    // ドキュメントのバイトサイズ
    }


 3. 実装の紹介

 ソースコード（GitHubリポジトリ）本記事で使用するソースコードは以下のGitHubリポジトリにて公開しています。

https://github.com/rynskrmt/mistral-ocr-pdf2markdown
詳細な手順はREADMEにも記載していますので、参考にしてください！

 処理の解説主要な処理を解説していきます。

 PDFのMistral AIへのアップロードOCR を実行するために PDF を Mistral AI にアップロードします。
from mistralai import Mistral
from pathlib import Path

def upload_pdf(client: Mistral, pdf_path: Path):
    if not pdf_path.exists():
        raise FileNotFoundError(f'指定されたPDFファイルが見つかりません: {pdf_path}')
    
    with pdf_path.open('rb') as file:
        uploaded_pdf = client.files.upload(
            file={"file_name": pdf_path.name, "content": file.read()},
            purpose="ocr"
        )
    return uploaded_pdf

 OCR 処理の実行アップロードした PDF に対して OCR 処理を実行し、JSON 形式で結果を取得します。
import datetime
import json

def process_ocr(client: Mistral, pdf_path: Path, output_dir: Path) -> Path:
    uploaded_pdf = upload_pdf(client, pdf_path)
    signed_url = client.files.get_signed_url(file_id=uploaded_pdf.id)
    
    ocr_response = client.ocr.process(
        model="mistral-ocr-latest",
        document={"type": "document_url", "document_url": signed_url.url},
        include_image_base64=True
    )

    response_output_path = output_dir / "ocr_response.json"
    response_output_path.write_text(json.dumps(ocr_response.model_dump(), ensure_ascii=False, indent=4), encoding='utf-8')
    
    return response_output_path

 画像データの抽出OCR のレスポンスに含まれる画像データ（Base64）の部分だけを取り出します。
def get_images_from_page(page: dict) -> dict:
    mime_types = {'.png': 'image/png', '.jpg': 'image/jpeg', '.jpeg': 'image/jpeg', '.gif': 'image/gif'}
    images_dict = {}

    for image in page.get('images', []):
        img_id = image.get('id')
        base64_image = image.get('image_base64')
        if base64_image and img_id:
            base64_data = base64_image.split(",", 1)[1] if "," in base64_image else base64_image
            ext = Path(img_id).suffix.lower()
            mime_type = mime_types.get(ext, 'image/png')
            data_uri = f"data:{mime_type};base64,{base64_data}"
            images_dict[img_id] = data_uri
    return images_dict

 JSONからMarkdownへの変換OCR の JSON データを Markdown 形式に変換します。
def json_to_markdown(json_file: Path, output_md: Path) -> None:
    data = json.loads(json_file.read_text(encoding='utf-8'))
    markdown_lines = []
    pages = data.get('pages', [data] if isinstance(data, dict) else data)

    for page in pages:
        md = page.get('markdown', page.get('text', '')).strip()
        images_dict = get_images_from_page(page)
        
        for img_id, data_uri in images_dict.items():
            placeholder = f"![{img_id}]({img_id})"
            md = md.replace(placeholder, f"![{img_id}]({data_uri})")
        
        if images_dict and not any(f"![{img_id}](" in md for img_id in images_dict):
            for img_id, data_uri in images_dict.items():
                md += f"\n![{img_id}]({data_uri})\n"

        markdown_lines.append(md)

    output_md.write_text("\n\n".join(markdown_lines), encoding='utf-8')

 実行スクリプトimport argparse
import sys
import logging

logging.basicConfig(level=logging.INFO, format='%(message)s')

def main():
    parser = argparse.ArgumentParser(description='OCR tool using Mistral OCR API')
    parser.add_argument('-p', '--pdf', type=Path, required=True, help='Path to the PDF file to process')
    parser.add_argument('-o', '--output', type=Path, required=True, help='Base directory for output files')
    args = parser.parse_args()

    pdf_path: Path = args.pdf
    output_base: Path = args.output

    try:
        api_key = load_api_key()
    except ValueError as e:
        logging.error(e)
        sys.exit(1)

    client = Mistral(api_key=api_key)
    output_dir = output_base / f'output_{datetime.datetime.now().strftime("%Y%m%d_%H%M%S")}'
    output_dir.mkdir(parents=True, exist_ok=True)

    try:
        response_output_path = process_ocr(client, pdf_path, output_dir)
    except Exception as e:
        logging.error(f"OCR処理中にエラーが発生しました: {e}")
        sys.exit(1)

    output_md_path = output_dir / 'output.md'
    json_to_markdown(response_output_path, output_md_path)

if __name__ == '__main__':
    main()

 4. 実際に動かしてみる🚀

 実行方法python ocr.py --pdf sample.pdf --output output_directory

 実行結果今回は「Attention Is All You Need」のPDFを変換してみました。

https://arxiv.org/abs/1706.03762
画像や数式も問題なくMarkdownファイル内に埋め込まれていますね。


 5. おわりに私は普段、Obsidianで論文や資料を管理しているのですが、これがあればOCRした内容をそのまま整理できるので非常に助かります！特に数式や画像も含めてMarkdownで扱えるのが嬉しいポイント。
LLM（大規模言語モデル）のインプットとしてもMarkdownは扱いやすいため、Mistral OCRの活用の幅はますます広がりそうですね。
みなさんもぜひ試してみてください！
1. はじめに

2. Mistral OCR とは

公式ページ

リリースニュース

ドキュメント

Mistral OCRの概要

料金について

Mistral OCR APIの詳細

公式のサンプルコード

レスポンス

3. 実装の紹介

ソースコード（GitHubリポジトリ）

処理の解説

PDFのMistral AIへのアップロード

OCR 処理の実行

画像データの抽出

JSONからMarkdownへの変換

実行スクリプト

4. 実際に動かしてみる🚀

実行方法

実行結果

5. おわりに

Discussion