iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🤖

Dify × PDF: A Simple Solution for Enabling AI to Read Scanned PDFs

に公開

What I Created

I've created a Dify plugin called PDF to Images Converter Plugin!
It has been officially registered on the marketplace, so I'm writing this article to introduce the plugin and explain some conventional issues that couldn't be fully covered on the marketplace page.

Introduction

This plugin is intended for scenarios in Dify where you load a PDF and pass its content to an AI model for some kind of processing.
In such cases, PDFs can be broadly classified into the following two types:
(I believe care is needed on this point whenever handling PDFs in systems or programs, not just in Dify, such as when using Python libraries.)

  1. Text-embedded PDF
    PDFs where text information is embedded inside. Essentially, these are ones where you can copy text with your mouse.
    (You can select text as shown in the "Write for yourself" section below.)

    A saved version of "What is Zenn" as "What is Zenn?.pdf"

  2. Scanned PDF (Image-based PDF, Non-text-embedded PDF, etc.)
    PDFs where there is no text information inside, and the entire page is stored as an image. In this case, since the text cannot be read directly, OCR or image recognition is required.
    (Since I couldn't find a formal name even after asking AI, I'll call it a "Scanned PDF.")

Conventional Issues

Before discussing the solution, let me share the issues encountered with each pattern.
When processing PDFs in Dify, the standard approach is to use the "File Upload"[1] feature and the "Document Extractor"[2] node.

For Text-embedded PDFs, standard features are sufficient

Below is a simple Dify workflow for reading "What is Zenn?.pdf".
As shown in the "text" section at the bottom center, you can see that the PDF content is being read successfully.

If you would like to use even this minimal configuration, please import and use the following.

test.yml
app:
  description: ''
  icon: 🤖
  icon_background: '#FFEAD5'
  mode: advanced-chat
  name: test
  use_icon_as_answer_icon: false
dependencies:
- current_identifier: null
  type: marketplace
  value:
    marketplace_plugin_unique_identifier: langgenius/openai:0.2.3@5a7f82fa86e28332ad51941d0b491c1e8a38ead539656442f7bf4c6129cd15fa
kind: app
version: 0.3.1
workflow:
  conversation_variables: []
  environment_variables: []
  features:
    file_upload:
      allowed_file_extensions:
      - .JPG
      - .JPEG
      - .PNG
      - .GIF
      - .WEBP
      - .SVG
      allowed_file_types:
      - image
      - document
      allowed_file_upload_methods:
      - local_file
      enabled: true
      fileUploadConfig:
        audio_file_size_limit: 50
        batch_count_limit: 5
        file_size_limit: 15
        image_file_size_limit: 10
        video_file_size_limit: 100
        workflow_file_upload_limit: 10
      image:
        enabled: false
        number_limits: 3
        transfer_methods:
        - local_file
        - remote_url
      number_limits: 3
    opening_statement: ''
    retriever_resource:
      enabled: true
    sensitive_word_avoidance:
      enabled: false
    speech_to_text:
      enabled: false
    suggested_questions: []
    suggested_questions_after_answer:
      enabled: false
    text_to_speech:
      enabled: false
      language: ''
      voice: ''
  graph:
    edges:
    - data:
        isInIteration: false
        isInLoop: false
        sourceType: start
        targetType: document-extractor
      id: 1754876217921-source-1756044569689-target
      source: '1754876217921'
      sourceHandle: source
      target: '1756044569689'
      targetHandle: target
      type: custom
      zIndex: 0
    - data:
        isInIteration: false
        isInLoop: false
        sourceType: document-extractor
        targetType: llm
      id: 1756044569689-source-1756044579087-target
      source: '1756044569689'
      sourceHandle: source
      target: '1756044579087'
      targetHandle: target
      type: custom
      zIndex: 0
    - data:
        isInIteration: false
        isInLoop: false
        sourceType: llm
        targetType: answer
      id: 1756044579087-source-1756044632703-target
      source: '1756044579087'
      sourceHandle: source
      target: '1756044632703'
      targetHandle: target
      type: custom
      zIndex: 0
    nodes:
    - data:
        desc: ''
        selected: false
        title: 開始
        type: start
        variables: []
      height: 54
      id: '1754876217921'
      position:
        x: 429.3783617376292
        y: 23.778184703671002
      positionAbsolute:
        x: 429.3783617376292
        y: 23.778184703671002
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    - data:
        desc: ''
        is_array_file: true
        selected: false
        title: テキスト抽出
        type: document-extractor
        variable_selector:
        - sys
        - files
      height: 94
      id: '1756044569689'
      position:
        x: 476.8571428571429
        y: 105.35714285714283
      positionAbsolute:
        x: 476.8571428571429
        y: 105.35714285714283
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    - data:
        context:
          enabled: false
          variable_selector: []
        desc: ''
        model:
          completion_params:
            temperature: 0.7
          mode: chat
          name: gpt-4o-mini
          provider: langgenius/openai/openai
        prompt_template:
        - id: 4a44c71b-bde2-4500-998d-7a159fb9c46d
          role: system
          text: ''
        - id: 949bf395-1a93-4c3c-b7c8-58040033ede2
          role: user
          text: '{{#sys.query#}}\n\n\n            {{#1756044569689.text#}}'
        selected: false
        title: LLM
        type: llm
        variables: []
        vision:
          enabled: false
      height: 90
      id: '1756044579087'
      position:
        x: 554.4313111598581
        y: 225.6386357299544
      positionAbsolute:
        x: 554.4313111598581
        y: 225.6386357299544
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    - data:
        answer: '{{#1756044579087.text#}}'
        desc: ''
        selected: false
        title: 回答
        type: answer
        variables: []
      height: 105
      id: '1756044632703'
      position:
        x: 609.7149753261024
        y: 338.058419526812
      positionAbsolute:
        x: 609.7149753261024
        y: 338.058419526812
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    viewport:
      x: -451.4914715123374
      y: 204.4258469818957
      zoom: 1.0051611574364074

Scanned PDFs cannot be read with standard features

On the other hand, with "What is Zenn?_scan.pdf", which was saved as a scan type of the same page, the "text" field is blank, showing that information could not be retrieved. Given the specifications of document extraction, this is unavoidable.


Content of the scanned PDF. Text cannot be copied here.


Text cannot be extracted from the scanned PDF using the Document Extractor node.

Solution

Here is how it looks using the Dify plugin I created, PDF to Images Converter Plugin.
The content of the scanned PDF "What is Zenn?_scan.pdf" is passed to the AI (LLM node) as an image, and it correctly provides an answer about the content.

The converted image "What is Zenn?_scan_page_1.png" is being used as input for the AI.

In the "LLM" node for AI processing, the image file created by the plugin is specified in the "Vision" section. By doing this, rather than the developer extracting text from the image, the AI's own OCR capabilities are utilized to read the content.

Configured at the bottom in the format "(x) pdf conversion tool / (x) files Array[Files]".

How to Use

You can easily install and use it from the marketplace by following these steps:

  1. Log in to the Dify Cloud version
  2. Install pdf-to-images from the Plugin Marketplace
  3. Create a workflow by referring to the image and yml file below

    Image from GitHub

By importing the following file after creating the flow, you can easily replicate the node layout shown in the GitHub image, so please give it a try!
This goes a step beyond what was shown in the Solution section, providing a workflow where the AI can read and process the content regardless of whether it's an image, a text-embedded PDF, or a scanned PDF!
https://github.com/aToy0m0/dify-customplugin_pdf-to-images/blob/main/docs/pdf-to-images_common_en.yml

Summary

  • PDFs can be divided into "Text-embedded PDFs" and "Scanned PDFs" *My own terminology
  • Text-embedded PDFs can be easily processed with the "Document Extractor node"
  • Scanned PDFs use the pdf-to-images plugin + LLM's vision feature as an OCR alternative

With this, the AI can read the content no matter what kind of PDF you upload! 🥳 🙌

脚注
  1. https://docs.dify.ai/guides/workflow/file-upload#file-type ↩︎

  2. https://docs.dify.ai/guides/workflow/node/doc-extractor#input-variables ↩︎

Discussion