📄

macOSのLive TextでPDFをOCRする（Swift/CLI）

2024/04/27に公開

Swift

macOSでOCRしたいとき、別のツールをインストールしたり、クラウドのAPIを叩いたりも考えられますが、標準のテキスト認識（Live Text）がそれなりに高品質なので、それを使えばいいのでは、と思い、Swiftで簡単なコードを書いてみました（AppKitにもUIKitにも依存していないので、CLI部分以外はiOS等でも動くと思います）。

完成品はこちらにあります。

以下、各処理を関数に切り出して説明していますが、完成品のコードでは一つの関数にまとめています。

また、以下では次のようなMyErrorというエラー型を定義していることを前提としていますが、これも完成品では別の名前になっています。

struct MyError: Error {
    let description: String

    init(_ description: String) {
        self.description = description
    }
}

テキスト認識（VisionKit/Vision）

VisionKit

テキスト認識は、画像からテキストを取り出す処理です。

単に文字列を取り出すだけであれば、VisionKitのImageAnalyzerによって簡単に実現できます。

import CoreGraphics
import VisionKit

func analyze(image: CGImage) async throws -> String {
    let analyzer = ImageAnalyzer()
    let configuration = ImageAnalyzer.Configuration(.text)
    let analysis = try await analyzer.analyze(image, orientation: .up, configuration: configuration)
    return analysis.transcript
}

ただし、基本的にUIKitやAppKitと連携させて使うことを想定して設計されているため、これ以上の処理、
例えばどの場所にどの文字列があるかの一覧を取得するというようなAPIは直接は提供されていません（上手く実装すればできる気はしますが）。

Vision

Visionフレームワークは、より低レベルなAPIを提供するフレームワークです。

より細やかな制御や処理を行いたい場合は、こちらを使うことになります。ただし、VisionKitが行うような複雑な処理も自分で行う必要があるため、同じ結果を得るために予想以上のコードが必要になる可能性もあります（下処理の関係かパラメータの関係か、ImageAnalyzerで認識できていた縦書きのテキストが、こちらでは認識させることができませんでした）。

テキスト認識についての概説は、次のページにあります。

Visionフレームワークの基本的な処理の流れは、画像に紐付いたVNImageRequestHandlerに、リクエスト（VNRequest）を配列として投げて、それぞれに対し結果（VNObservation）の配列を得る、というものです。

テキスト認識の場合、リクエストはVNRecognizeTextRequestで、結果はVNRecognizedTextObservationになります。

import Vision

func observe(image: CGImage) throws -> [VNRecognizedTextObservation] {
    // 画像に結びついたリクエストハンドラを作成
    let handler = VNImageRequestHandler(cgImage: image)

    // リクエストを作成
    let request = VNRecognizeTextRequest()

    // リクエストを送る
    try handler.perform([request])

    guard let results = request.results else { 
        throw MyError("VNRecognizedTextObservation.results is nil.")
    }

    return results
}

VNRecognizedTextObservationは、認識結果の複数の候補を返せるようになっているため、最初の候補を取り出すには次のようにtopCandidatesを使います。

import Vision

func recognize(image: CGImage) throws -> [VNRecognizedText] {
    return try observe(image: image).compactMap({ result in
        result.topCandidates(1).first
    })
}

VNRecognizedTextからは、認識結果の文字列（string）の他に、文字列の各範囲に対するバウンディングボックス（boundingBox(for:)）、認識の信頼度（confidence）などが取得できます。

ここでは、次のように認識結果の文字列と、文字列全体の元の画像の座標系におけるバウンディングボックスを保持する構造体Itemを定義し、それにVNRecognizedTextを変換する処理を書いてみます。

struct Item: Codable {
    let text: String
    let rect: Rect?
}

struct Rect: Codable {
    let x: Double
    let y: Double
    let width: Double
    let height: Double
}

なお、boundingBox(for:)は0から1の範囲に正規化された座標を返すため、それを元の画像の座標系に変換するために、VNImageRectForNormalizedRect(_:_:_:)を使います。

import Vision

func extract(image: CGImage) async throws -> [Item] {
    return try recognize(image: image).compactMap { result in
        // `string`で文字列を取得
        let text = result.string

        // バウンディングボックスの範囲はテキスト全体にする
        let range = text.startIndex..<text.endIndex
    
        // `boundingBox`で座標を取得（VNRectangleObservation?が返される）
        let rect = try result.boundingBox(for: range).map { rect in
            // 座標を元の画像の座標系に変換
            let rect = VNImageRectForNormalizedRect(rect.boundingBox, image.width, image.height)

            // Core Graphicsでは原点が左下なので、左上に変換
            let y = Double(image.height) - rect.maxY

            return Rect(x: rect.minX, y: y, width: rect.width, height: rect.height)
        }
    
        return Item(text: text, rect: rect)
    }
}

以上で、テキスト認識の処理は完了です。

PDFのレンダリング（CoreGraphics）

次に、CoreGraphicsを使って、PDFから各ページを取り出し、それらを画像に変換する処理を書いてみます（PDFKitを使ってもほぼ同じことができます）。

PDFを開く

まず、指定されたパスのPDFをCGPDFDocumentとして開きます。

Swift（Foundation）では、ファイルのパスもURLとして統一的に扱うので受け取るのはURLです。文字列からURLを作成するには、URL(filePath:directoryHint:relativeTo:)を使うことができます。

import CoreGraphics
import Foundation

func openPDF(url: URL) throws -> CGPDFDocument {
    guard let document = CGPDFDocument(url as CFURL) else {
        throw MyError("Failed to open PDF: \(url)")
    }
    
    return document
}

ただし、URLから直接開いた場合、URLが不正（ファイルが存在しないなど）なのか、PDFが不正（データが壊れている）なのか、判別するのが難しいので、ここでは一度Dataとして開いてから、それをPDFとして読み込むことにします。

CGPDFDocumentにDataを渡すには、CGDataProviderを挟む必要があります（PDFKitのPDFDocumentの場合、Dataを直接渡すことができます）。

import CoreGraphics
import Foundation

func openPDF(url: URL) throws -> CGPDFDocument {
    let data = try Data(contentsOf: url)

    guard let provider = CGDataProvider(data: data as CFData) else {
        throw MyError("Failed to initialize CGDataProvider.")
    }

    guard let document = CGPDFDocument(provider) else {
        throw MyError("Failed to open PDF: \(url)")
    }
    
    return document
}

PDFの各ページを処理する

PDFの各ページを処理するには、CGPDFDocumentのnumberOfPagesでページ数を取得し、page(at:)で各ページを取得します。

注意として、CoreGraphicsのCGPDFDocumentでは、各ページには1始まりのページ番号でアクセスします。

import CoreGraphics

func forEachPage(document: CGPDFDocument, body: (CGPDFPage) throws -> Void) throws {
    // 1から始まる（CoreGraphics）
    for i in 1...document.numberOfPages {
        guard let page = document.page(at: i) else {
            throw MyError("Failed to get page: \(i)")
        }
        
        try body(page)
    }
}

PDFKitのPDFDocumentでは、pageCountでページ数を取得し、0始まりのインデックスでアクセスする（PDFDocument.page(at:)）ようになっています。

import PDFKit

func forEachPage(document: PDFDocument, body: (PDFPage) throws -> Void) throws {
    // 0から始まる（PDFKit）
    for i in 0..<document.pageCount {
        guard let page = document.page(at: i) else {
            throw MyError("Failed to get page: \(i)")
        }
        
        try body(page)
    }
}

PDFのページをレンダリングする

CoreGraphicsでPDFのページを画像としてレンダリングするには、まずレンダリング先のCGContextを作成し、それに対してPDFのページを描画し、makeImage()でCGImageを作成する、という手順を踏みます。

まず、CGContextを作成するために、CGContext(data:width:height:bitsPerComponent:bytesPerRow:space:bitmapInfo:)を呼びます。

色空間については、一時的な処理であればCGColorSpaceCreateDeviceRGB()を使ってもいいのですが、ここでは保存することも考え、sRGB色空間を使うことにします。

import CoreGraphics

func makeContext(width: Int, height: Int) throws -> CGContext {
    // sRGB色空間を使用
    guard let space = CGColorSpace(name: CGColorSpace.sRGB) else {
        throw MyError("Failed to initialize CGColorSpace.")
    }

    // コンテクストを作成
    guard let context = CGContext(
         data: nil, // nilにすると自動で確保
         width: width,
         height: height,
         bitsPerComponent: 8, // 成分ごとに8ビット（256）
         bytesPerRow: 0, // 0にすると自動計算
         space: space,
         bitmapInfo: CGImageAlphaInfo.premultipliedLast.rawValue // RGBはそのまま表示色を保持し、最後にAを付加情報として持つ
    ) else {
        throw MyError("Failed to initialize CGContext.")
    }

    return context
}

これを使ってレンダリングする処理を書くと次のようになります。

PDFのページが想定する描画先のサイズは、getBoxRect(_:)で取得できます。ただし、そのままでは描画結果がぼやけて認識に影響する可能性があるため、ratio倍して高解像度でレンダリングできるようにしています。

PDFページの描画は、CGContext.drawPDFPage(_:)を呼ぶだけです（PDFKitの場合はPDFPage.draw(with:to:)を使うことができます）。

func render(page: CGPDFPage, ratio: Double) throws -> CGImage {
    // ページのサイズを取得
    let box = page.getBoxRect(.mediaBox)

    // 高解像度でレンダリングするため`ratio`倍する
    let width = Int(ceil(box.width * ratio))
    let height = Int(ceil(box.height * ratio))

    // コンテクストを作成
    let context = try makeContext(width: width, height: height)

    // 白背景を描画
    context.setFillColor(.white)
    context.fill([CGRect(x: 0, y: 0, width: context.width, height: context.height)])

    // PDFページを`ratio`倍して描画
    context.scaleBy(x: ratio, y: ratio)
    context.drawPDFPage(page)

    // 画像を作成
    guard let image = context.makeImage() else {
        throw MyError("Failed to make an image.")
    }

    return image
}

PNGの生成（ImageIO）

レンダリング結果を確認するために、PNGとして保存する処理を書いてみます。

UIKitやAppKitを使う場合は、UIImage.pngData()やNSBitmapImageRep.representation(using:properties:)を使うことになりますが、
Core Graphicsで行う場合は、ImageIOを使うことになります。

ImageIOでCGImageをDataに変換する際の基本的な流れは、CGImageDestinationを作成し、そこにCGImageを追加して、CGImageDestinationFinalize(_:)を呼ぶ、というものです。

CGImageDestinationFinalize(_:)は、処理が成功した場合にtrueを返し、失敗した場合にfalseを返します。

あるパス（URL）にPNGを保存する場合は、次のようにCGImageDestinationCreateWithURL(_:_:_:_:)を呼びます。

import ImageIO
import UniformTypeIdentifiers

func savePNG(url: URL, image: CGImage) throws {
    guard let destination = CGImageDestinationCreateWithURL(url as CFURL, UTType.png.identifier as CFString, 1, nil) else {
        throw MyError("Failed to initialize CGImageDestination.")
    }

    CGImageDestinationAddImage(destination, image, nil)

    if !CGImageDestinationFinalize(destination) {
        throw MyError("Failed to finalize CGImageDestination.")
    }
}

ただしここでは、変換処理と保存処理の問題を切り分けておくため、一度Dataに変換することにします。

その場合は、CFDataCreateMutable(_:_:)でCFMutableDataを作成し、それを使ってCGImageDestinationCreateWithData(_:_:_:_:)を呼びます。

import ImageIO
import UniformTypeIdentifiers

func generatePNG(image: CGImage) throws -> Data {
    guard let data = CFDataCreateMutable(nil, 0) else {
        throw MyError("Failed to initialize CFData.")
    }

    guard let destination = CGImageDestinationCreateWithData(data, UTType.png.identifier as CFString, 1, nil) else {
        throw MyError("Failed to initialize CGImageDestination.")
    }

    CGImageDestinationAddImage(destination, image, nil)

    if !CGImageDestinationFinalize(destination) {
        throw MyError("Failed to finalize CGImageDestination.")
    }

    return data as Data
}

以上で、基本的な処理は完了です。

CLIの作成（ArgumentParser）

これらの処理をコマンドとして手軽に使えるようにするために、ArgumentParser（apple/swift-argument-parser）を使ってCLIを作成します。

ArgumentParserはApple製のライブラリですが、標準のライブラリやフレームワークではないため、Swift Package Managerを使って依存関係を追加する必要があります。

Package.swift

// swift-tools-version: 5.10

import PackageDescription

let package = Package(
    name: "<command-line-tool>",
    platforms: [
        .macOS(.v13)
    ],
    dependencies: [
        .package(url: "https://github.com/apple/swift-argument-parser", from: "1.3.0")
    ],
    targets: [
        .executableTarget(
            name: "<command-line-tool>",
            dependencies: [
                .product(name: "ArgumentParser", package: "swift-argument-parser")
            ]
        )
    ]
)

あとは、次のように構造体を定義するだけで、コマンドライン引数がパースされ、run()メソッドが実行されます。

ここでは、run()内でawaitを使うため、ParsableCommandではなくAsyncParsableCommandを使っています。

引数の定義についての概説は、ドキュメンテーションのDeclaring Arguments, Options, and Flagsにあります。

Sources/command.swift

import ArgumentParser

@main
struct Command: AsyncParsableCommand {
    @Flag(name: .shortAndLong, help: "Overwrite existing files.")
    var force: Bool = false

    @Flag(name: .shortAndLong, help: "Output PNG files (for debugging).")
    var png: Bool = false

    @Flag(name: .shortAndLong, help: "Output JSON files.")
    var json: Bool = false

    @Option(name: .shortAndLong, help: "Output text files. Default to true if there are no other textual outputs (i.e., JSON).")
    var text: Bool? = nil

    @Option(name: .shortAndLong, help: "Scale factor to render PDF pages as images. Larger values may improve text recognition accuracy.")
    var ratio: Double = 2.0

    @Option(name: .shortAndLong, help: "Locales to recognize.")
    var locales: [String] = []

    @Option(name: .shortAndLong, help: "Start page number (1-based, inclusive).")
    var start: Int? = nil

    @Option(name: .shortAndLong, help: "End page number (1-based, inclusive).")
    var end: Int? = nil

    @Option(name: .shortAndLong, help: "Output directory.", completion: .directory)
    var out: String = "out"

    @Argument(help: "Input PDF file.", completion: .file(extensions: [".pdf"]))
    var input: String

    mutating func run() async throws {
        // ...
    }
}

完成品

その他、ファイルの読み書きやディレクトリの作成、引数の適用などの細々とした処理を追加して、完成品は次のようになりました。

Package.swift

// swift-tools-version: 5.10

import PackageDescription

let package = Package(
    name: "PDFLiveText",
    platforms: [
        .macOS(.v13)
    ],
    dependencies: [
        .package(url: "https://github.com/apple/swift-argument-parser", from: "1.3.0")
    ],
    targets: [
        .executableTarget(
            name: "PDFLiveText",
            dependencies: [
                .product(name: "ArgumentParser", package: "swift-argument-parser")
            ]
        )
    ]
)

Sources/command.swift

import ArgumentParser
import UniformTypeIdentifiers
import Vision
import VisionKit

struct CommandError: Error {
    let description: String

    init(_ description: String) {
        self.description = description
    }
}

struct Page: Codable {
    let size: Size
    let items: [Item]
}

struct Size: Codable {
    let width: Int
    let height: Int
}

struct Item: Codable {
    let text: String
    let rect: Rect?
}

struct Rect: Codable {
    let x: Double
    let y: Double
    let width: Double
    let height: Double
}

@main
struct Command: AsyncParsableCommand {
    @Flag(name: .shortAndLong, help: "Overwrite existing files.")
    var force: Bool = false

    @Flag(name: .shortAndLong, help: "Output PNG files (for debugging).")
    var png: Bool = false

    @Flag(name: .shortAndLong, help: "Output JSON files.")
    var json: Bool = false

    @Option(name: .shortAndLong, help: "Output text files. Default to true if there are no other textual outputs (i.e., JSON).")
    var text: Bool? = nil

    @Option(name: .shortAndLong, help: "Scale factor to render PDF pages as images. Larger values may improve text recognition accuracy.")
    var ratio: Double = 2.0

    @Option(name: .shortAndLong, help: "Locales to recognize.")
    var locales: [String] = []

    @Option(name: .shortAndLong, help: "Start page number (1-based, inclusive).")
    var start: Int? = nil

    @Option(name: .shortAndLong, help: "End page number (1-based, inclusive).")
    var end: Int? = nil

    @Option(name: .shortAndLong, help: "Output directory.", completion: .directory)
    var out: String = "out"

    @Argument(help: "Input PDF file.", completion: .file(extensions: [".pdf"]))
    var input: String

    mutating func run() async throws {
        let text = text ?? !json

        let out = URL(filePath: out)
        let input = URL(filePath: input)

        let data = try Data(contentsOf: input)

        guard let provider = CGDataProvider(data: data as CFData) else {
            throw CommandError("Failed to initialize CGDataProvider.")
        }

        guard let document = CGPDFDocument(provider) else {
            throw CommandError("Failed to initialize CGPDFDocument.")
        }

        let analyzer = ImageAnalyzer()
        var configuration = ImageAnalyzer.Configuration(.text)

        if !locales.isEmpty {
            configuration.locales = locales
        }

        do {
            try FileManager.default.createDirectory(at: out, withIntermediateDirectories: true)
        } catch CocoaError.fileWriteFileExists {
            // ignore
        } catch {
            throw error
        }

        let options = force ? [] : Data.WritingOptions.withoutOverwriting

        let n = document.numberOfPages

        let start = start.map { max(1, min(n + 1, $0)) } ?? 1
        let end = end.map { max(start, min(n + 1, $0 + 1)) } ?? n + 1

        for i in start..<end {
            guard let page = document.page(at: i) else {
                throw CommandError("Failed to get page: \(i)")
            }

            let box = page.getBoxRect(.mediaBox)

            let width = Int(ceil(box.width * ratio))
            let height = Int(ceil(box.height * ratio))
            
            guard let space = CGColorSpace(name: CGColorSpace.sRGB) else {
                throw CommandError("Failed to initialize CGColorSpace.")
            }

            guard let context = CGContext(
                data: nil,
                width: width,
                height: height,
                bitsPerComponent: 8,
                bytesPerRow: 0,
                space: space,
                bitmapInfo: CGImageAlphaInfo.premultipliedLast.rawValue
            ) else {
                throw CommandError("Failed to initialize CGContext.")
            }

            context.setFillColor(.white)
            context.fill([CGRect(x: 0, y: 0, width: width, height: height)])

            context.scaleBy(x: ratio, y: ratio)
            context.drawPDFPage(page)

            guard let image = context.makeImage() else {
                throw CommandError("Failed to make an image.")
            }

            if png {
                let file = out.appending(component: "\(i).png")

                guard let data = CFDataCreateMutable(nil, 0) else {
                    throw CommandError("Failed to initialize CFData.")
                }

                guard let destination = CGImageDestinationCreateWithData(data, UTType.png.identifier as CFString, 1, nil) else {
                    throw CommandError("Failed to initialize CGImageDestination.")
                }

                CGImageDestinationAddImage(destination, image, nil)

                if !CGImageDestinationFinalize(destination) {
                    throw CommandError("Failed to finalize CGImageDestination.")
                }

                try (data as Data).write(to: file, options: options)
            }
            
            if json {
                let file = out.appending(component: "\(i).json")

                let handler = VNImageRequestHandler(cgImage: image)
                let request = VNRecognizeTextRequest()

                if locales.isEmpty {
                    request.automaticallyDetectsLanguage = true
                } else {
                    request.recognitionLanguages = locales
                }

                try handler.perform([request])

                guard let results = request.results else {
                    throw CommandError("VNRecognizedTextObservation.results is nil.")
                }

                let items = try results.compactMap({ $0.topCandidates(1).first }).compactMap({ result in
                    let text = result.string

                    let rect = try result.boundingBox(for: text.startIndex..<text.endIndex).map {
                        VNImageRectForNormalizedRect($0.boundingBox, image.width, image.height)
                    }.map {
                        Rect(x: $0.minX, y: Double(image.height) - $0.maxY, width: $0.width, height: $0.height)
                    }

                    return Item(text: text, rect: rect)
                })

                let page = Page(size: Size(width: image.width, height: image.height), items: items)

                let data = try JSONEncoder().encode(page)
                try data.write(to: file, options: options)
            }

            if text {
                let file = out.appending(component: "\(i).txt")

                let analysis = try await analyzer.analyze(image, orientation: .up, configuration: configuration)

                guard let data = analysis.transcript.data(using: .utf8) else {
                    throw CommandError("Failed to encode a string to UTF8.")
                }

                try data.write(to: file, options: options)
            }

            print("DONE: \(i)/\(n)")
        }
    }
}

200行程度のコードで、ここまで出来るのはSwiftとAppleのフレームワークの強さを感じます。

おわりに

テキスト認識を実装しようとすると、オープンソースのライブラリとモデルを用意したり、あるいはサーバーのAPIを叩くためにネットワーク環境と認証情報を用意したりと、それなりの手間やオーバーヘッドがかかる場合が多いですが、Apple標準のフレームワークを使うことで、簡単に実装することができました。

Core Graphics周りはモダンなフレームワークとは異なりクセのあるAPIになってはいますが、PDFをシームレスに扱えるのはさすがはQuartzです。

今回実装を諦めてしまったことに、処理の並列化（ImageAnalyzerでは出来たが、VNRecognizeTextRequestがハングアップしてしまった）、Visionフレームワークでの縦書きテキストの認識（VisionKitでは出来ているので下処理やパラメータの問題か）、などがありますが、こちらは今後の課題ということにしたいと思います。

何かの参考になれば幸いです。