iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📑

Exploring PDF Search with ColPali and Image Captioning/Bounding Box Placement using Florence-2-large

に公開

Introduction

This time, I tried something because I wanted to quickly search for information even from PDF materials containing many images (photos, drawings, graphs, etc.).

Normally, when searching a PDF, you would often extract text information using OCR and then perform the search.
However, there are cases where text extraction is difficult, such as when text is contained within photos or drawings. Also, on pages with many images or where an image is the main subject, the full picture cannot be grasped through text information alone.

As a solution, I will try searching by treating PDF pages as image information using a model called ColPali.

In addition, as page information for the search results, I will generate image captions (for images within the page), extract text, and place bounding boxes (frames surrounding target objects).

Specifically, I will do the following:

  1. Treat PDF pages as images and convert them into embeddings (using ColPali).
  2. Extract pages with high similarity from a search query using the embeddings (using ColPali).
  3. Generate and output page captions (using Florence-2-large).
  4. Extract and output text from the page.
  5. Place bounding boxes on the page (using Florence-2-large).
  6. Save the page with bounding boxes as an image.

Also, since I am handling English PDFs this time, I added a translation function (can be toggled ON/OFF) to handle it in Japanese.
It internally translates the search query from Japanese to English, and the captions and text from English to Japanese, for output (so you search in Japanese and get captions and text back in Japanese).
I used Borea-Phi-3.5-mini-Instruct-Jp for the Japanese translation.

Below are examples of input and output.

Input example)

  • PDF: 10 pages
  • Search query: "Bird"

Output example)

  • Search result page: Page 5
  • Image caption: Description of images on the page
  • Extracted text: Text extracted from the page
  • Saved page: The page with bounding boxes placed on bird-related images
  • Regarding captions, since the entire page is loaded as an image, it will read even if multiple images are present on the page. Also, descriptions are provided by recognizing words and sentences to some extent.

  • Since the text extraction process within the page is not the main topic, it is performed using a simple method without pre-processing or post-processing. A more advanced implementation would be necessary when building something like a RAG system.

This concludes the overview of what will be implemented.
There are many ways to process PDFs, but this time I experimented with the concept of "using local LLMs" and "using small models."

About the Models Used

ColPali

ColPali is a model based on PaliGemma-3B that performs vectorization in a ColBERT style for text and images.
https://huggingface.co/vidore/colpali
ColPali treats documents as images. There is no need to extract text via OCR beforehand.
The advantage of this method is (roughly speaking) that since it treats them as images, even if text extraction is difficult, it can capture words from visual information for searching.

Reference articles:
https://zenn.dev/knowledgesense/articles/08cfc3de7464cb
https://huggingface.co/blog/manu/colpali
https://arxiv.org/abs/2407.01449

Florence-2-large

Florence-2-large (0.77B) is a lightweight vision-language model released by Microsoft. It can generate captions and detect objects from images.
https://huggingface.co/microsoft/Florence-2-large
Inputs consist of an image plus a task prompt such as "<CAPTION>" or "<OD>" (Object Detection) and input text (the input text is used for the "<CAPTION_TO_PHRASE_GROUNDING>" task, which detects objects from captions).
The output is text describing the image for caption tasks, or object coordinates for object detection tasks.

Reference article:
https://note.com/npaka/n/n5863c3bd2990

Borea-Phi-3.5-mini-Instruct-Jp

Borea-Phi-3.5-mini-Instruct-Jp is a model tuned by Axcxept based on phi-3.5-mini-Instruct (3.8B).
https://huggingface.co/AXCXEPT/Borea-Phi-3.5-mini-Instruct-Jp
It records higher scores than the base model in Japanese MT Bench (Japanese evaluation), ElyzaTasks100 (Japanese performance), and MTBench (English evaluation), and it is designed to handle general-purpose tasks.

The following three versions have been released, and this time I used "Borea-Phi-3.5-mini-Instruct-Jp."

  • Borea-Phi-3.5-mini-Instruct-Jp: A model characterized by improved versatility and Japanese performance.
  • Borea-Phi-3.5-mini-Instruct-Common: A model with generally improved capabilities.
  • Phi-3.5-mini-instruct-Borea-Coding: A model with enhanced coding capabilities and Japanese language skills.

Reference article:
https://prtimes.jp/main/html/rd/p/000000008.000129878.html

PDF to be Used

The document used for loading this time is the following English PDF issued by "Government Public Relations Online."

Source: Government Public Relations Online (https://www.gov-online.go.jp/hlj/en/may_2024/)
https://www.gov-online.go.jp/hlj/en/may_2024/

In the PDF, representative Japanese forests and "forest bathing" (Shinrin-yoku) are mainly introduced (topics like Rakugo and Wagashi are also covered).
Since it contains a good balance of photos, diagrams, and text on images, it seems suitable for testing search with ColPali and caption generation with Florence-2-large (I think it is intended for an international audience, but it was also interesting to read as it explains the charm of Japanese forests and Rakugo clearly through photos and diagrams).

Work Environment

OS: WSL2 Ubuntu22.04
GPU: GeForce RTX 2080 SUPER (8GB)
CPU: Corei9-9900KF

Preparing Libraries

First, you need to install poppler-utils, tesseract-ocr, and libtesseract-dev to use for operations such as extracting text and images from PDF files.
In my environment (WSL2 Ubuntu22.04), I installed them using the following commands:

$ sudo apt install poppler-utils
$ sudo apt install tesseract-ocr
$ sudo apt install libtesseract-dev

Next, install the following libraries to be used in the code:

$ pip install pytesseract
$ pip install pdf2image
$ pip install colpali-engine==0.2.0
$ pip install torch 
$ pip install typer 
$ pip install tqdm 
$ pip install transformers 
$ pip install Pillow
$ pip install flash_attn 
$ pip install timm

Coding (Overall)

The entire code is as follows (it is collapsed because it is long).

Coding (Overall)
pdf_sercth.py
import os
import re
import time
import glob
import torch
from tqdm import tqdm
import pytesseract
from PIL import ( 
    Image, 
    ImageDraw, 
    ImageFont
)
from transformers import (
    AutoProcessor, 
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig
)
from torch.utils.data import DataLoader
from colpali_engine.trainer.retrieval_evaluator import CustomEvaluator
from colpali_engine.models.paligemma_colbert_architecture import ColPali
from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
from colpali_engine.utils.image_from_page_utils import load_from_pdf


# Set huggingface token
os.environ["HF_TOKEN"] = ""

# Device
device = "cuda"

# Working folder
work_path = "./"

# PDF (Place directly under the working folder)
pdf = "HIGHLIGHTING_Japan_May2024.pdf"

# Search query (Use English if "translate=False")
query = "Recommended forests"

# Whether to perform translation of English text
translate = False

# Batch size for embedding creation
batch_size = 2

# Output the top 2 search results
top_k = 2

# Color for image bounding boxes
bboxes_color = (50, 205, 50) # Lime green

# Color for image labels
label_color = "white"


def main():
    pdf_path = os.path.join(work_path, pdf)
    pdf_name = os.path.splitext(pdf)[0]
    
    image_dir_path = os.path.join(work_path, f"{pdf_name}_pages")
    
    # Create image save directory
    os.makedirs(image_dir_path, exist_ok=True)

    # Delete images in the folder
    image_path_list = glob.glob(os.path.join(image_dir_path, "*.png"))
    for file in image_path_list:
        os.remove(file)
        
    emb_dir_path = os.path.join(work_path, "embeddings")
    emb_name = f"{pdf_name}_embedding.pt"
    emb_path = os.path.join(emb_dir_path, emb_name)

    # Create embedding save directory
    os.makedirs(emb_dir_path, exist_ok=True)
    
    # Load colpali model and processor
    colpali_model, colpali_processor = load_colpali()

    # Convert PDF to images
    images = load_from_pdf(pdf_path)

    # Load embeddings if they exist, otherwise create and save them
    if os.path.isfile(emb_path):
        embedding = torch.load(emb_path, weights_only=True)
    else:
        embedding = create_embdding(colpali_model, colpali_processor, images)
        torch.save(embedding, emb_path)
        print(f"\nSaved embedding: {emb_name}\n")

    # Load Borea-Phi-3.5-mini-Instruct-Jp model and tokenizer
    borea_phi_model, borea_phi_tokenizer = load_borea_phi()
    
    # If "translate=True", translate the search query to English
    if translate:
        translate_query = translation(
            borea_phi_model, 
            borea_phi_tokenizer, 
            text=query, 
            translate_switch="ja_to_en"
        ).strip()
    else:
        translate_query = query

    # Get the start time for page search
    start_time = time.time()
    
    # Get search scores for each PDF page
    scores = page_scores(
        colpali_model, 
        colpali_processor,
        embedding, 
        translate_query
    )
    
    # Get processing time for page search
    processing_time = time.time() - start_time
    hours, remainder = divmod(processing_time, 3600)
    minutes, seconds = divmod(remainder, 60)
    
    # Output processing time
    page_seach_time = f"{hours:02.0f}h:{minutes:02.0f}m:{seconds:04.1f}s"
    
    # Remove ColPali model and processor from memory
    del colpali_model
    del colpali_processor
    torch.cuda.empty_cache()
    
    # Get top-scoring pages
    top_scores = scores.argsort()[-top_k:][::-1]
    
    # Load Florence-2-large model and processor
    florence_model, florence_processor = load_florence()
    
    page_score_list = []
    page_caption_list = []
    page_text_list = []
    
    # Perform caption generation, text extraction, and saving images with bounding boxes
    for index in top_scores:
        page_image = images[index]
        page_score = f"Page: {index+1} Score: {scores[index]}"
        
        page_score_list.append(page_score)
        
        # Caption generation
        page_caption = run_florence(
            florence_model, 
            florence_processor, 
            page_image, 
            "<MORE_DETAILED_CAPTION>"
        )
        
        page_caption_list.append(page_caption)
        
        # Extract text from the PDF page
        page_text = pytesseract.image_to_string(page_image)
        
        # Replace 2 or more blank lines with a single one
        page_text = re.sub(r'\n\s*\n', '\n\n', page_text).strip()
        
        page_text_list.append(page_text)
        
        # Generate bounding box coordinates
        bboxes_labels = run_florence(
            florence_model, 
            florence_processor, 
            page_image, 
            "<CAPTION_TO_PHRASE_GROUNDING>", 
            translate_query
        )
        
        # Place bounding boxes on the image
        image = draw_bboxes(page_image, bboxes_labels)
        
        # Save image
        img_name = f"page_{index+1}_score_{scores[index]}.png"
        image_path = os.path.join(image_dir_path, img_name)
        image.save(image_path)

    # Remove Florence-2-large model and processor from memory
    del florence_model
    del florence_processor
    torch.cuda.empty_cache()

    # Translate captions and extracted text, then output results
    for page_and_score, page_caption, page_text in zip(page_score_list, page_caption_list, page_text_list):
        # Output top score pages and scores
        print(f"\n\n{page_and_score}")
        print("===================================================================")
        
        # If "translate=True", translate the page caption to Japanese
        if translate:
            translate_page_caption = translation(
                borea_phi_model, 
                borea_phi_tokenizer, 
                text=page_caption["<MORE_DETAILED_CAPTION>"], 
                translate_switch="en_to_ja"
            ).strip()
        else:
            translate_page_caption = page_caption["<MORE_DETAILED_CAPTION>"]
            
        print("Image Caption-------------------------------------------------")
        print(translate_page_caption)
        print("-------------------------------------------------------------------")

        # If "translate=True", translate the page text to Japanese
        if translate:
            translate_page_text = translation(
                borea_phi_model, 
                borea_phi_tokenizer, 
                text=page_text, 
                translate_switch="en_to_ja"
            ).strip()
        else:
            translate_page_text = page_text
            
        print("\nExtracted Text-------------------------------------------------")
        print(translate_page_text)
        print("-------------------------------------------------------------------")
        print("===================================================================")    
    
    print(f"\nProcessing time for page search with ColPali: {page_seach_time}")    
        
# Function to load ColPali model and processor
def load_colpali():
    
    # Load model
    model = ColPali.from_pretrained(
        "google/paligemma-3b-mix-448", 
        torch_dtype=torch.bfloat16, 
        device_map=device
    ).eval()
    model.load_adapter("vidore/colpali")
    
    # Load processor 
    processor = AutoProcessor.from_pretrained("vidore/colpali")
    return model, processor


# Function to create embeddings
def create_embdding(colpali_model, colpali_processor, images):
    
    # Create data loader
    dataloader = DataLoader(
        images,
        batch_size=batch_size,
        shuffle=False,
        collate_fn=lambda x: process_images(colpali_processor, x)
    )
    
    # Create embeddings
    embedding = []
    for batch_doc in tqdm(dataloader):
        with torch.no_grad():
            batch_doc = {k: v.to(device) for k, v in batch_doc.items()}
            embeddings_doc = colpali_model(**batch_doc)
        embedding.extend(list(torch.unbind(embeddings_doc.to(device))))
    return embedding


# Function to get scores for all pages
def page_scores(colpali_model, colpali_processor, embedding, query):
    mock_image = Image.new("RGB", (448, 448), (255, 255, 255))
    
    # Generate query embedding
    with torch.no_grad():
        processed_query = process_queries(
            colpali_processor, 
            [query], 
            mock_image
        )
        processed_query = {k: v.to(device) for k, v in processed_query.items()}
        query_embedding = colpali_model(**processed_query)
    
    # Calculate similarity between query embedding and PDF embedding
    evaluator = CustomEvaluator(is_multi_vector=True)
    scores = evaluator.evaluate(query_embedding, embedding)[0]
    return scores


# Function to load Florence-2-large model and processor
def load_florence():
    repo_id = "microsoft/Florence-2-large"
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        repo_id, 
        torch_dtype=torch.float16, 
        device_map=device,
        trust_remote_code=True
    )
    
    # Load processor 
    processor = AutoProcessor.from_pretrained(
        repo_id,
        trust_remote_code=True
    )
    return model, processor


# Function to process images based on task
def run_florence(florence_model, florence_processor, image, task_prompt, query=None):
    if query is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + query
    
    # Convert image and text to tensors
    inputs = florence_processor(
        text=prompt, 
        images=image, 
        return_tensors="pt"
    ).to(device, torch.float16)
    
    # Generate output from input tensors
    generated_ids = florence_model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=2048,
        num_beams=3,
        do_sample=False
    )
    
    # Decode output tensors to text
    generated_text = florence_processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    # Get results from generated text
    response = florence_processor.post_process_generation(
        generated_text, 
        task=task_prompt, 
        image_size=(image.width, image.height)
    )
    return response


# Function to place bounding boxes
def draw_bboxes(image, bboxes_labels):
    draw = ImageDraw.Draw(image)
    
    # Get bounding box positions
    bboxes = bboxes_labels['<CAPTION_TO_PHRASE_GROUNDING>']['bboxes']
    
    # Get label text
    labels = bboxes_labels['<CAPTION_TO_PHRASE_GROUNDING>']['labels']
    
    # Draw bounding boxes and labels on the image
    for bbox, label in zip(bboxes, labels):
        
        # Draw bounding box
        draw.rectangle(bbox, outline=bboxes_color, width=10)
        
        # Set label font size
        font = ImageFont.load_default().font_variant(size=50)
        
        # Set label position
        label_position  = (bbox[0]+10, bbox[1]-50)
        
        # Get label size
        label_bbox = font.getbbox(label)
        label_width = label_bbox[2] - label_bbox[0]
        label_height = label_bbox[3] - label_bbox[1]
        
        # Set label background rectangle
        background_bbox = [
            label_position[0]-10, 
            label_position[1]+label_bbox[1]-10,
            label_position[0]+label_width+10, 
            label_position[1]+label_height+label_bbox[1]
        ]
        
        # Draw label background
        draw.rectangle(background_bbox, fill=bboxes_color)  
  
        # Draw label
        draw.text(
            label_position, 
            text=label, 
            font=font, 
            fill=label_color
        )
    return image
    
    
# Function to load Borea-Phi-3.5-mini-Instruct-Jp model and tokenizer
def load_borea_phi():
    repo_id = "HODACHI/Borea-Phi-3.5-mini-Instruct-Jp"
    
    # Set quantization Config
    quantization_config = BitsAndBytesConfig(
        load_in_8bit=True
    )

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=repo_id,
        torch_dtype=torch.float16, 
        quantization_config=quantization_config,
        device_map=device
    )

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=repo_id
    )
    return model, tokenizer


# Function to perform translation
def translation(model, tokenizer, text, translate_switch=""):
    
    # Determine whether to translate to English or Japanese based on translate_switch
    if translate_switch == "en_to_ja":
        content = """あなたは英語の文章を**的確な**日本語に翻訳する優秀なな翻訳家です。
        **翻訳した文章のみ**を出力してください。
        翻訳する文章は以下です。
        
        {text}
        """
    elif translate_switch == "ja_to_en":
        content = """You are an excellent translator who translates Japanese texts into **accurate** English.
        **Only output the translated text**. 
        The text to be translated is as follows.
        
        {text}
        """
    else:
        return text
    
    messages = [
        {"role": "user", "content": content.format(text=text)}
    ]
    
    # Apply the tokenizer's chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize the prompt and convert to tensor
    inputs = tokenizer([prompt], return_tensors="pt").to(device)
    
    # Generate output from input tensor
    generated_ids = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=4096
    )
    
    # Extract the generated answer
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
    ]

    # Convert token IDs to string
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response
    
    
if __name__ == "__main__":
    
    # Get start time
    start_time = time.time()
    
    main()
    
    # Get processing time
    processing_time = time.time() - start_time
    hours, remainder = divmod(processing_time, 3600)
    minutes, seconds = divmod(remainder, 60)
    
    # Output processing time
    print(f"Total processing time (including model loading time): {hours:02.0f}h:{minutes:02.0f}m:{seconds:04.1f}s")

Main parameters (such as the path to the working folder and the search query) are set as global variables. If you place the PDF in the working folder and change the parameters accordingly, it should work (hopefully...). If you want to search in Japanese and output results in Japanese, set translate = True.

Embeddings are created from the PDF only for the first time (an embeddings folder is created and they are placed there; subsequent executions will use the existing embeddings for searching).

Let's take a look at the main functions (functions other than the main function) in the above code.

Coding (Checking Function Parts)

Function to Load ColPali Model and Processor (load_colpali)

pdf_sercth.py
# Function to load ColPali model and processor
def load_colpali():
    
    # Load model
    model = ColPali.from_pretrained(
        "google/paligemma-3b-mix-448", 
        torch_dtype=torch.bfloat16, 
        device_map=device
    ).eval()
    model.load_adapter("vidore/colpali")
    
    # Load processor 
    processor = AutoProcessor.from_pretrained("vidore/colpali")
    return model, processor

This function loads the ColPali model and processor.
By loading "paligemma-3b-mix-448" as the base model and setting up the adapter, ColPali is prepared.

Function to Create Embeddings (create_embdding)

pdf_sercth.py
# Function to create embeddings
def create_embdding(colpali_model, colpali_processor, images):
    
    # Create data loader
    dataloader = DataLoader(
        images,
        batch_size=batch_size,
        shuffle=False,
        collate_fn=lambda x: process_images(colpali_processor, x)
    )
    
    # Create embeddings
    embedding = []
    for batch_doc in tqdm(dataloader):
        with torch.no_grad():
            batch_doc = {k: v.to(device) for k, v in batch_doc.items()}
            embeddings_doc = colpali_model(**batch_doc)
        embedding.extend(list(torch.unbind(embeddings_doc.to(device))))
    return embedding

This function is used when creating new PDF embeddings during the initial run.
It takes the ColPali model (colpali_model), processor (colpali_processor), and a list of PDF pages loaded as images (images) as arguments.
When creating embeddings, images are loaded and converted into embeddings (a list of vectors) according to the batch_size set in the global variable.

Function to Get Scores for All Pages (page_scores)

pdf_sercth.py
# Function to get scores for all pages
def page_scores(colpali_model, colpali_processor, embedding, query):
    mock_image = Image.new("RGB", (448, 448), (255, 255, 255))
    
    # Generate query embedding
    with torch.no_grad():
        processed_query = process_queries(
            colpali_processor, 
            [query], 
            mock_image
        )
        processed_query = {k: v.to(device) for k, v in processed_query.items()}
        query_embedding = colpali_model(**processed_query)
    
    # Calculate similarity between query embedding and PDF embedding
    evaluator = CustomEvaluator(is_multi_vector=True)
    scores = evaluator.evaluate(query_embedding, embedding)[0]
    return scores

This function takes the ColPali model (colpali_model), processor (colpali_processor), PDF embeddings (embedding), and search query (query) as arguments.
After converting the search query into an embedding, it returns the similarity with the PDF embeddings as a list of scores. The list is in the order of the PDF pages.

Function to Load Florence-2-large Model and Processor (load_florence)

pdf_sercth.py
# Function to load Florence-2-large model and processor
def load_florence():
    repo_id = "microsoft/Florence-2-large"
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        repo_id, 
        torch_dtype=torch.float16, 
        device_map=device,
        trust_remote_code=True
    )
    
    # Load processor 
    processor = AutoProcessor.from_pretrained(
        repo_id,
        trust_remote_code=True
    )
    return model, processor

This function loads the Florence-2-large model and processor.

Function to Process Images Based on Tasks (run_florence)

pdf_sercth.py
# Function to process images based on tasks
def run_florence(florence_model, florence_processor, image, task_prompt, query=None):
    if query is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + query
    
    # Convert image and text to tensors
    inputs = florence_processor(
        text=prompt, 
        images=image, 
        return_tensors="pt"
    ).to(device, torch.float16)
    
    # Generate output from input tensors
    generated_ids = florence_model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=2048,
        num_beams=3,
        do_sample=False
    )
    
    # Decode output tensors to text
    generated_text = florence_processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    # Get results from generated text
    response = florence_processor.post_process_generation(
        generated_text, 
        task=task_prompt, 
        image_size=(image.width, image.height)
    )
    return response

This function takes the Florence-2-large model (florence_model), processor (florence_processor), image (image), task (task_prompt), and search query (query) as arguments (the query is used only for the <CAPTION_TO_PHRASE_GROUNDING> task).
The two tasks (task_prompt) and their responses (response) used this time are:

  • task_prompt: "<MORE_DETAILED_CAPTION>" (Caption generation)
    • response: Image caption (description)
  • task_prompt: "<CAPTION_TO_PHRASE_GROUNDING>" (Retrieving object coordinates and labels)
    • response: Coordinates and labels for the objects specified by the search query (query)"

Function to Place Bounding Boxes (draw_bboxes)

pdf_sercth.py
# Function to place bounding boxes
def draw_bboxes(image, bboxes_labels):
    draw = ImageDraw.Draw(image)
    
    # Get bounding box positions
    bboxes = bboxes_labels['<CAPTION_TO_PHRASE_GROUNDING>']['bboxes']
    
    # Get label text
    labels = bboxes_labels['<CAPTION_TO_PHRASE_GROUNDING>']['labels']
    
    # Draw bounding boxes and labels on the image
    for bbox, label in zip(bboxes, labels):
        
        # Draw bounding box
        draw.rectangle(bbox, outline=bboxes_color, width=10)
        
        # Set label font size
        font = ImageFont.load_default().font_variant(size=50)
        
        # Set label position
        label_position  = (bbox[0]+10, bbox[1]-50)
        
        # Get label size
        label_bbox = font.getbbox(label)
        label_width = label_bbox[2] - label_bbox[0]
        label_height = label_bbox[3] - label_bbox[1]
        
        # Set label background rectangle
        background_bbox = [
            label_position[0]-10, 
            label_position[1]+label_bbox[1]-10,
            label_position[0]+label_width+10, 
            label_position[1]+label_height+label_bbox[1]
        ]
        
        # Draw label background
        draw.rectangle(background_bbox, fill=bboxes_color)  
  
        # Draw label
        draw.text(
            label_position, 
            text=label, 
            font=font, 
            fill=label_color
        )
    return image

This function takes an image (image) and object coordinates and labels (bboxes_labels) as arguments.
bboxes_labels contains the object coordinates and labels obtained using the run_florence function (specifically for the <CAPTION_TO_PHRASE_GROUNDING> task).
This function returns the image with labeled bounding boxes placed on it.

Function to Load Borea-Phi-3.5-mini-Instruct-Jp Model and Tokenizer (load_borea_phi)

pdf_sercth.py
# Function to load Borea-Phi-3.5-mini-Instruct-Jp model and tokenizer
def load_borea_phi():
    repo_id = "HODACHI/Borea-Phi-3.5-mini-Instruct-Jp"
    
    # Set quantization Config
    quantization_config = BitsAndBytesConfig(
        load_in_8bit=True
    )

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=repo_id,
        torch_dtype=torch.float16, 
        quantization_config=quantization_config,
        device_map=device
    )

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=repo_id
    )
    return model, tokenizer

This function loads the Borea-Phi-3.5-mini-Instruct-Jp model and tokenizer.

Function to Perform Translation (translation)

pdf_sercth.py
# Function to perform translation
def translation(model, tokenizer, text, translate_switch=""):
    
    # Determine whether to translate to English or Japanese based on translate_switch
    if translate_switch == "en_to_ja":
        content = """あなたは英語の文章を**的に**日本語に翻訳する優秀なな翻訳家です。
        **翻訳した文章のみ**を出力してください。
        翻訳する文章は以下です。
        
        {text}
        """
    elif translate_switch == "ja_to_en":
        content = """You are an excellent translator who translates Japanese texts into **accurate** English.
        **Only output the translated text**. 
        The text to be translated is as follows.
        
        {text}
        """
    else:
        return text
    
    messages = [
        {"role": "user", "content": content.format(text=text)}
    ]
    
    # Apply the tokenizer's chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize the prompt and convert to tensor
    inputs = tokenizer([prompt], return_tensors="pt").to(device)
    
    # Generate output from input tensor
    generated_ids = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=4096
    )
    
    # Extract the generated answer
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
    ]

    # Convert token IDs to string
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

This function is used when translate = True is specified in the global variables.
It takes the Borea-Phi-3.5-mini-Instruct-Jp model (model) and tokenizer (tokenizer), the text to be translated (text), and the translation mode setting (English-to-Japanese or Japanese-to-English) via translate_switch as arguments.
The translate_switch can be set to "en_to_ja" (English → Japanese) or "ja_to_en" (Japanese → English) to toggle the translation direction.

Execution Results 1

No Translation

First, let's try running it without translation (translate = False).
The search query (query) was set to "Recommended forests", and the number of pages to search (top_k) was set to the top 2 pages with the highest scores.

The output for pages and scores, image captions, and extracted text is as follows (it is collapsed because it is long).

Output (English)
Page: 5 Score: 9.5625
===================================================================
Image Caption-------------------------------------------------
The image is a map of Japan, showing the locations of the most beautiful forests in the country. The map is in an orange color and is divided into different sections, each representing a different type of forest. 

The top left section of the map has the letter "J" in the center, which is likely the name of the forest. Below that, there is a list of the names of the forests, including "Kikuchi Gorge", "Bijinbayashi", "Honshu", "Akashima Natural Forest", and "Kumano Kodo". The map also includes a legend that explains the different colors used in the map.

On the right side of the image, there are several smaller circles that represent the different types of forests in Japan. These circles are labeled with their names and are arranged in a grid-like pattern. The background of the infographic is white, and the text is black.
-------------------------------------------------------------------

Extracted Text-------------------------------------------------
JAPAN’S HEALING FORESTS <partT1>

apan is one of the most forested coun-
tries in the world, with approximately 70
percent of its land area covered by forest.
BUCO ee bey meu Caee DCO MODAL ou tos (OKIE
LUE MCMC OTOL PM APPR eMC ur CIT LOMO
time. In Japan, the practice of relaxing in such
a forest, away from the busy daily life, is called

“shinrin-yoku” (forest bathing)*, and overseas it AKAN-MASHU
is also known as “shinrin-yoku” as the Japanese Velie
term implies. Actually, forest bathing has scien- Eastern Hokkaido

tifically proven the relaxing effects. This issue
of Highlighting Japan introduces readers to forest
bathing based on scientific knowledge and some
of Japan’s most famous forests, including the
Akasawa Natural Recreation Forest in Nagano
Prefecture, the birthplace of forest bathing, the Hokkaido
Shirakami Sanchi Mountain Range, and the
ancient pilgrimage routes of Kumano Kodo, as
well as various initiatives that utilize forests.

* The Japanese term for “shinrin-yoku” is for-

est bathing. Shinrin means ‘forest’ and Yoku =
means ‘bath.’ Shinrin-Yoku literally means Ss La l RAKAM |
forest bathing, or ‘taking in the forest atmo- SAN Cc H |

PNT n as gy
PT Mag

sphere’ for therapeutic results. 12) iB | N BAYAS H |
Me Tar on Rel 5 2
Nile Maat ie)

$

KIKUCHI GORGE
Parse A AKASAWA NATURAL
- Kumamoto Prefecture Pad RECREATION
y FOREST
NEL nL

Nagano Prefecture

y

KUMANO KODO

Nara Prefecture
We clccrotg
MEIN aCe

SHIKOKU KARST TENGU
HIGHLAND NATURAL
RECREATIONAL FOREST

Tsuno Town, Kochi Prefecture

Vol.192 HIGHLIGHTING JAPAN 5
-------------------------------------------------------------------
===================================================================


Page: 11 Score: 9.5
===================================================================
Image Caption-------------------------------------------------
The image is a page from a book titled "Japan's Healing Forests". It is divided into two sections. 

The top section is a photograph of a forest with tall trees and a clear blue sky. The trees are tall and green, and the sky is a bright blue. The forest appears to be in a wooded area, as there are no other trees visible in the image.

In the bottom section, there is a photo of a black bird perched on a tree trunk. The bird is facing towards the right side of the page, and its beak is open as if it is about to take flight. The tree trunk is covered in white flowers, and there are a few small white flowers scattered around it. The background is blurred, but it seems to be a forested area with more trees and shrubs. The text on the page is written in black font and is in a smaller font size than the rest of the text.
-------------------------------------------------------------------

Extracted Text-------------------------------------------------
You can enjoy forest bathing in the Japanese Spruce (Sakhalin spruce)*
forest that spreads out from right behind the Kawayu Visitor Center.

the other lakes and marshes in the area. The other is
the ‘Mashu area,’ where you get a feel of the moisture
and circulation of the water, with Lake Kussharo,
which fills the western half of Japan’s largest caldera—
Kussharo Caldera—and Lake Mashu, one of the clear-
est lakes in the world.”

Suehiro told us how to enjoy
the forests in the Mashu and Akan
areas.

“The Tsutsujigahara Nature
Trail from the Kawayu Visitor
Center in the Kawayu Onsen hot
spring area of Mashu to Atusanu-
puri (also known as Mt. Io) is a flat
trail that can be traversed by foot
in about one hour. It is a precious
place with a unique ecosystem of

Photo: Akan-Mashu National Park

JAPAN’S HEALING FORESTS <PartT1>

Iso-azalea flowers extend through the forest along the Tsutsujigahara
Nature Trail

production.

However, Maeda Masana, the initial head of the
park, believed that “this mountain should be changed
from one to be logged into one to be viewed,” and
based on his principle of not opposing the power of
nature but maximizing it, long-
term efforts have been made to
keep the forests in their original
state and pass them down to the
future generations. You can enter
the Hikari no Mori (Mystical For-
est) and Kohoku no Mori (North
of the Lake Forest) with a qualified
Ippoen Forest Guide. The many
interesting spots include a giant
katsura (Cercidiphyllum japonicum)
tree said to be 800 years old and

Photo: PIXTA

The black woodpecker, which also inhabits the area,
is designated as a natural treasure.

alpine plants that can withstand
the volcanic gases and acidic soil
discharged from Mt. Io. You can stroll through the
forests and observe the plants living in this special
environment-—the trail starts with a Coniferous Forest
Zone, and also has a Broadleaf Forest Zone and an Iso-
azalea’ Zone, all within a short distance of about 2.5
kilometers. The sight of 100 hectares of pure white iso-
azaleas in the forest in early summer (around June) is
especially stunning.”

In the Akan area, we recommend a guided tour
that explores the forests by the Maeda Ippoen Founda-
tion*, which has been working for over 90 years to pre-
serve sustainable forests. According to Hibino Akihiro
of the Lake Akan National Park Ranger Station, “In the
early 1900s, a great number of trees in the Akan area
forests administered by the Maeda Ippoen Foundation
were cut down to clear space for ranching and timber

Tezukanuma Swamp with its hot
springs, so we recommend that
you have the experience of forest bathing in a prime-
val forest preserved by the people of Akan.”

The Akan-Mashu National Park is currently part of
the Project to Fully Enjoy National Parks.

Suehiro said, “In order to make the park more
accessible to overseas visitors, we are working on a
variety of projects, including the renovation of over-
night accommodations and observatory facilities. We
hope that many people will visit the unique, majestic
forests of Akan-Mashu.”

A depression formed by volcanic activity.

2. Asmall evergreen shrub native to Hokkaido that is 30-70 cm tall. It is sometimes seen in gravelly
alpine areas, and also grows in some volcanic ash areas and marshlands.

3. Found in Hokkaido and on Mount Hayachine in Honshu. It is designated as a Tree of Hokkaido
along with the Ezo spruce.

4. A foundation set up to carry on the wishes of Maeda Masana, who developed the expansive

mountain forests along the lakeside of Lake Akan in Hokkaido in early 1900, and to contribute to

the preservation of its natural environment and appropriate use.

Vol.192 HIGHLIGHTING JAPAN

Photo: Akan-Mashu National Park

ii
-------------------------------------------------------------------
===================================================================

Processing time for page search with ColPali: 00h:00m:00.7s
Total processing time (including model loading time): 00h:01m:52.9s

Checking the image caption portion of the output (for page 5), the names of forest areas can be confirmed in the following section:

The top left section of the map has the letter "J" in the center, which is likely the name of the forest. Below that, there is a list of the names of the forests, including "Kikuchi Gorge", "Bijinbayashi", "Honshu", "Akashima Natural Forest", and "Kumano Kodo". The map also includes a legend that explains the different colors used in the map.

While the names of the forest areas can also be confirmed from the extracted text, the text is full of noise and missing characters, possibly because it was a page that was difficult to extract from.
For pages where the image is the primary subject, it seems that the overview can be grasped more quickly by looking at the caption.

However, since captions are mostly descriptions of the images themselves, it is necessary to check the content from the extracted text for pages where text is the primary element (on page 11, which has the next highest score, the page overview cannot be grasped from the caption alone). When using this as a context for RAG, it seems best to include both the image caption and the extracted text simultaneously.

At the end of the output, the processing time is returned as follows:

Processing time for page search with ColPali: 00h:00m:00.7s
Total processing time (including model loading time): 00h:01m:52.9s

The above is the execution time after the embeddings have been created (from the second run onwards). The first run takes additional time (about 1 minute) to create the embeddings.
The page search (32 pages) using ColPali completed in 0.7 seconds, which is very fast.
Regarding the total processing time, considering that caption generation and bounding box generation are each performed twice, along with model loading and other processes, it doesn't seem that slow.

The following are the search result pages (images) created during code execution (saved in the "<PDF_name>_pages" folder within the working directory).

  • Page 5, Score 9.5625
  • Page 11, Score 9.5

Checking the image for page 5, which had the highest score, we can see a map of Japan along with a map representing representative forests. This seems to have worked well as a search result for "Recommended forests" specified in the search query.
A bounding box with the label "forests" has been placed on the image.

With Translation

Next, I will try running it with translation (translate = True). I set the search query (query) to "おすすめの森" (Recommended Forests) and the number of search result pages to output (top_k) to the top 2 pages with the highest scores.

The output for pages and scores, image captions, and extracted text is as follows (it is collapsed because it is long).

Output (Japanese)
Page: 5 Score: 9.5625
===================================================================
Image Caption-------------------------------------------------
地図は日本の美しい森林の位置を示しており、オレンジの色で分割されています。各部分は異なるタイプの森林を表しています。 

左上の地図の頂点には「J」の文字が中央に配置されており、これが森林の名前の可能性があります。その下には、「キクチガード」、「ビジンバヤシ」、「ホンスホ」、「アカシマ自然林」、「クマノコド」という森林の名前のリストがあります。地図には、使用されている色の説明を含む階層もあります。

地図の右側には、日本の森林の異なるタイプを表す小さな円が配置されています。これらの円は名前がラベル付けされており、グリッド状の配置になっています。インフォグラフィックの背景は白色で、テキストは黒色です。
-------------------------------------------------------------------

Extracted Text-------------------------------------------------
日本の治癒の森は世界で最も森林豊かな国の一つで、約70%の陸地が森に覆われています。

日本では、このような森でリラックスすることを「森林浴(しんrin-よく)」と呼び、海外でも「森林浴」として知られています。実際に、森林浴は科学的にも癒し効果が証明されています。この号「高輝く日本」では、科学的な知識に基づいて、日本の有名な森林、例えば、長野県のアカサワ自然温泉公園、北海道の白神山脈、古のカムイオードの旅のルート、そして森を活用したさまざまな取り組みなどを紹介しています。

* 「森林浴」という日本語の用語は、「森」を意味する「しんrin」と「入る」を意味する「よく」を組み合わせたもので、「森を浴びる」という意味を持ちます。これは、森の空間を満喫して癒しを得ることを指しています。

(注:文章には不正確な文字が含まれているため、正しい日本語の翻訳には修正が加えられています。)
-------------------------------------------------------------------
===================================================================


Page: 11 Score: 9.5
===================================================================
Image Caption-------------------------------------------------
画像は「日本の癒しの森」と題された本の一コマです。上部は森林の写真で、高い木々と明るい青空が映し出されています。木は緑で高く、空は明るい青です。画像は木が主体で、他の木は見えません。

下部には、右側に向かって立つ黒い鳥の写真があります。鳥の嘴が開いて飛び立つ準備を示しています。樹冠は白い花で覆われ、少数の小さな白い花も周囲に点在しています。背景はぼやけていますが、森林と低木が見えるようです。ページの文字は黒で小さなフォントで書かれています。
-------------------------------------------------------------------

Extracted Text-------------------------------------------------
森林浴を楽しむことができる日本の樫栗杉(さかくりすぎ)*森が右側の堅平観光センターから広がる
他の湖と湿地帯にもあります。他のものは『マシュー地域』と呼ばれ、水の湿気と流れを感じることができる、カルデラ湖(カストラロデラカルデラ)とマシュー湖の一つである

スエヒロには、マシューとアカン地域の森林を楽しむ方法を教えてもらいました。

マシューから始まる『ツツジガハラ自然トレイル』は、カンゾウ温泉の熱泥地域のKawayu観光センターから約1時間の歩ける平坦な道で、珍しい生態系を持つ貴重な場所です。

写真:アカンマシュー国立公園

日本の癒しの森林 <部1>

イソアザレアの花が森に広がる

しかし、前身のマエダ・マサナは、「この山は伐採されるべきではなく、見るべきものに変えるべきだ」と考え、自然の力を尊重しつつ最大限に活用する原則に基づき、長期的な努力で、森林を元の状態に保ち、将来の世代に伝えることに努めてきました。

入場可能な『ハイキリノモリ(神秘の森)』と『コホクノモリ(湖の森)』を兼ね差し出す資格のあるイポエン森林ガイドによるガイド付きの森林散策をお勧めします。多彩な興味深いポイントには、800年以上の古木として知られる大きなカツヤ(カツヤ)の木も含まれます。

黒漆鳥、マエダ・イポエン基金の保護された自然宝石としても知られています。

火山ガスと酸性土壌を生き延ばすアルプス植物も見られる、約2.5キロメートルの短距離の森林を歩くことができる。トレイルは、淡水林区、広葉林区、イソアザレア区を含む短距離内に広がり、早春(約6月)には100ヘクタールの純白のイソアザレアの森が幻想的な美しさを放つ

アカン地域では、マエダ・イポエン基金*によって管理されている森林を探索するガイド付きツアーをお勧めします。

1900年代初頭に、アカン地域の森林に大量に木を伐採して牧場と木材のために開けたことから、熱泥湖の森林浴体験を提案しています。

アカンマシュー国立公園は、全国の国立公園を楽しむプロジェクトに参加しています。

スエヒロは、海外の訪問者に公園をよりアクセスしやすくするための多様なプロジェクトに取り組んでおり、独特で壮大なアカンマシューの森林を楽しむことを願っています。

噴火による溝。

2. 小さな冷涼な広葉樹、北海道原産のもので、30-70cmの高さ。石灰岩のアルプス地域や火山灰地帯に見られ、一部の火山灰地帯にも生えています。

3. 北海道と本州の山のマエダ-スプリングスと共に、北海道の木として指定されています。

4. マエダ・イポエン基金*を設立し、マエダ・マサナの夢で広がる山の森林を保護し、適切な利用を支援するための基金です。

Vol.192 HIGHLIGHTING JAPAN

写真:アカンマシュー国立公園

ii
-------------------------------------------------------------------
===================================================================

Processing time for page search with ColPali: 00h:00m:00.3s
Total processing time (including model loading time): 00h:08m:16.5s

Although some proper nouns in Katakana are a bit difficult to understand, it is helpful to be able to understand the content at a glance in Japanese.

The processing time is as follows:

Processing time for page search with ColPali: 00h:00m:00.3s
Total processing time (including model loading time): 00h:08m:16.5s

The page search (32 pages) using ColPali completed in 0.3 seconds. The total processing time increased several-fold (00h:01m:52.9s → 00h:08m:16.5s) because the translation process was added. Processing time depends heavily on the environment, but if the time is an issue, you could try 4-bit quantization or change the translation model to a smaller one (like EZO-Common-T2-2B-gemma-2-it or Qwen2-1.5B-Instruct).

(The saved page images are omitted as they are the same results as before (no translation).)

Next, I will try changing the search query so that different pages appear in the search results.

Execution Results 2

No Translation

I set the search query (query) to "What is "Wagashi"?" and the number of search result pages to output (top_k) to the top 1 page with the highest score.

The output for the page and score, image caption, and extracted text is as follows (it is collapsed because it is long).

Output (English)
Page: 28 Score: 13.1875
===================================================================
Image Caption-------------------------------------------------
The image is an advertisement for a Japanese dish called "Art Inspired by the Seasons: Wagashi". The dish is displayed on a white plate with a pair of chopsticks resting on top of it. The dish appears to be a type of Japanese dumpling, with a white base and a dark brown filling on top. The filling is made up of small pieces of meat and vegetables, and is drizzled with a dark sauce. The plate is garnished with a sprig of green leaves. The background is a light pink color, and there is text on the right side of the image that reads "Series: Discovering Japan through the eyes of Japanese food".
-------------------------------------------------------------------

Extracted Text-------------------------------------------------
28

n spring, there are many light-

colored sweets that evoke the

image of budding plants. Sakura

mochi, a typical example of
spring wagashi, comes in two styles:
Kansai style and Kanto style.? Kan-
sai-style sakura mochi are light pink
mochi glutinous rice cakes filled with
azuki bean paste (with some beans
left whole) and wrapped in pickled
sakura cherry leaves. In the Kanto
region, the azuki bean paste is rolled
in a crepe-like dough, which is then
wrapped in a pickled sakura cherry

HIGHLIGHTING JAPAN MAY 2024

» -.Minazuki, a type of wagashi that embodies wishes for
good health and happiness

leaf. Sakura mochi is so popular, it can
even be found in convenience stores
and supermarkets outside of the
spring season.

uring the hot and humid
Japanese summer, cool-
looking smooth sweets
made with agar or kuzu
are especially popular. Minazuki is a
wagashi made with a jelly-like base
made to resemble a small piece of
ice and topped with azuki beans. In
Japan, the color red is traditionally



believed to keep evil spirits away, so
this wagashi, with the hope that the
red beans on top will do just that,
symbolizes wishes for good health
and happiness.

s the season of chest-
nuts and sweet potatoes,
autumn is known for
wagashi such as kuri-kinton
(candied chestnuts and sweet pota-
toes) and imo-yokan (sweet potato
jelly). Besides, there is also a genre
of Japanese confectionery called



Photo: PITA
-------------------------------------------------------------------
===================================================================

Processing time for page search with ColPali: 00h:00m:00.6s
Total processing time (including model loading time): 00h:01m:37.1s

In the caption description, there are some slightly suspicious parts, such as Wagashi being explained as a dish made of meat and vegetables, but the search successfully retrieved the Wagashi page without any problems.

The processing time is as follows:

Processing time for page search with ColPali: 00h:00m:00.6s
Total processing time (including model loading time): 00h:01m:37.1s

The following is the search result page (image) created during code execution.

  • Page 28, Score 13.1875

In the bounding box label, "wagashi" is written, and the coordinates are accurately captured.

With Translation

I set the search query (query) to "「和菓子」とは何ですか?" (What is "Wagashi"?) and the number of search result pages to output (top_k) to the top 1 page with the highest score.

The output for the page and score, image caption, and extracted text is as follows (it is collapsed because it is long).

Output (Japanese)
Page: 28 Score: 12.9375
===================================================================
Image Caption-------------------------------------------------
画像は日本料理「四季にインスピレーションを受けたアート:わがし - 日本食を見る目」の広告です。料理は白い皿に置かれ、上には箸が置かれています。料理は日本の皮箱蒸しのようなもので、白い基盤と上に黒褐色の具材で埋められたダークトップがあります。具材は小さな肉と野菜で構成され、暗いドレッシングで漬けられています。皿は緑の葉で飾られており、背景は薄いピンク色です。画像右側には「日本を食べる目を発見」という文字が記されています。
-------------------------------------------------------------------

Extracted Text-------------------------------------------------
春には、桜のような色の甘いスナックが多く、桜の花びらを思わせるものが多い。関西風の桜餅は、桜の葉に包まれた小麦粉のあずき餡とともに、一部のあずきが残っているもので、辛子桜の葉で包まれている。東京地方では、あずき餡がクレープのような皮で巻かれ、それを辛子桜の葉で包んでいる。

2024年5月の高評価

- ミナズキ、健康と幸福を象徴する願望のある団子

桜餅は夏の熱気と湿度の厳しい日本でも人気があり、コンビニやスーパーでも見かけることができる。

夏の暑い日本の季節には、アガやクズを使った見た目の涼しげなスナックが人気を集める。ミナズキは、アガやクズで作られたジェル状のベースを持ち、あずきをのせ、伝統的には悪霊を遠ざけるために赤いあずきが象徴する健康と幸福の願いを表している。

秋には、栗きんとん(焼き栗と甘ねぎ)やいもよかん(甘ねぎジェル)などの団子が有名で、さらには日本の甘味料である団子のジャンルがある。

(注: 翻訳は文章の内容に基づいており、日本語の流暢さと文法に焦点を当てています。)
-------------------------------------------------------------------
===================================================================

Processing time for page search with ColPali: 00h:00m:00.2s
Total processing time (including model loading time): 00h:04m:03.4s

The processing time is as follows:

Processing time for page search with ColPali: 00h:00m:00.2s
Total processing time (including model loading time): 00h:04m:03.4s

(The saved page image is omitted as it is the same result as before (no translation).)

Execution Results 3

No Translation

I set the search query (query) to "Map of the "Kumano Kodo"" and the number of search result pages to output (top_k) to the top 1 page with the highest score.

The output for the page and score, image caption, and extracted text is as follows (it is collapsed because it is long).

Output (English)
Page: 18 Score: 15.5
===================================================================
Image Caption-------------------------------------------------
The image is a page from a magazine article titled "Exploring 'Kumano Kodo', a Forest-Enveloped World Cultural Heritage Site". The page is divided into two sections. The top section is titled "Features" and has a map of the world on the right side. Below the map, there is text that explains the features of the article.

The bottom section has a photo of a forest with tall trees and a person walking through it. The person is wearing a traditional Japanese outfit and is walking on a path that winds through the trees. The trees are tall and lush, and the sunlight is shining through the branches, creating a warm glow on the scene. The background is white, and there is a small illustration of a mountain range in the top right corner.
-------------------------------------------------------------------

Extracted Text-------------------------------------------------
18

FEATURES

Exploring

‘Kumano Kodo, a
Forest-Enveloped

World Cultural
Heritage Site

The Kii Peninsula, Japan’s largest, lies
slightly west from the central area of Honshu,
extending into the Pacific Ocean. Within the
Kii Peninsula, trails known as the ‘Kumano
Kodo’ have been preserved since ancient
times. A section of these routes has been des-
ignated as a UNESCO World Cultural Her-
itage Site,’ making it one of the rare trails
worldwide to receive such recognition. This
year, 2024, marks the 20th anniversary of
the listing of the ‘Sacred Sites and Pilgrimage
Routes in the Kii Mountain Range,’ which
includes the Kumano Kodo routes. Here, we
take a look at Kumano Kodo, situated amid
the forests of the Kii Mountains.

(Text: Morohashi Kumiko)

he majority of the Kii Peninsula is occupied by
mountainous terrain, known as the Kii Moun-
tain Range, which spans the three prefectures

Apilgrimage route leading to three sacred sites known as Kumano Sanzan.
The model in the photograph is wearing attire typical of upper-class ladies
during the medieval period, often worn while traveling.

HIGHLIGHTING JAPAN MAY 2024

‘Mount Koya

Kumano
Hongu Taisha

Kumano.

Hayatama
Taisha

== Nakahechi

=== Kohechi === Ohechi

of Mie, Nara, and Wakayama. The mountain range
with elevations ranging from 1,000 to 2,000 meters
runs from East to West and from North to South and is
characterized by a warm and extremely rainy climate,
nurturing rich forests over the years.

To the southeast lies a sacred site known as
‘Kumano Sanzan,’ primarily comprising three shrines
and two temples. This site has been a center of faith
for centuries and is recognized as a vital component of
the World Cultural Heritage Site. Near one of the three
grand shrines, Kumano Hongu Taisha (located in
Tanabe City, Wakayama Prefecture), sits the Kumano
Hongu Heritage Center, dedicated to the dissemina-
tion of tourist information and local information. We
spoke with Sugawa Aki who works at the center.

“The region centered around Kumano Sanzan,
enveloped by the natural beauty of the Kii Mountain
Range and known as ‘Kumano,’ has been revered as
a sacred place dedicated to the gods since ancient
times. Furthermore, from the 10th to the lth cen-
tury onwards, former emperors and monk emper-
ors as well as nobles, frequently visited this sacred
area on journeys known as Kumano pilgrimages. I
believe that’s why the trails were so well maintained,”
explains Sugawa.

“Kumano Kodo is a collective term referring to

Amountain forest along the Kumano Kodo, which includes numerous
subsidiary shrines called “Oji,” meaning “prince,” dedicated to the divine
offspring of Kumano Sanzan.

Photo: Kumano Hongu Heritage Center
-------------------------------------------------------------------
===================================================================

Processing time for page search with ColPali: 00h:00m:00.9s
Total processing time (including model loading time): 00h:01m:38.7s

The processing time is as follows:

Processing time for page search with ColPali: 00h:00m:00.9s
Total processing time (including model loading time): 00h:01m:38.7s
  • Page 18, Score 15.5

It correctly encloses the map part in the upper right.

With Translation

I set the search query (query) to "熊野古道のマップ" (Map of Kumano Kodo) and the number of search result pages to output (top_k) to the top 1 page with the highest score.

The output for the page and score, image caption, and extracted text is as follows (it is collapsed because it is long).

Output (Japanese)
ページ: 18 スコア: 13.0
===================================================================
画像のキャプション-------------------------------------------------
画像は「熊野古道」と題された世界文化遺産の記事の一面です。ページは2つの部分に分かれています。上部は「特徴」と題され、右側に世界地図が配置されています。地図の下には、記事の特徴を説明するテキストがあります。

下部には、高く茂った木々の中を歩く人物の写真があります。その人物は伝統的な日本の衣装を着用し、木々の間を緩やかに進む道を歩いています。木々は豊かで高く、枝からの光が木々の間を通して柔らかな光沢を放ち、シーンに温かい雰囲気を与えています。背景は白色で、上部右上には小さな山脈のイラストが描かれています。
-------------------------------------------------------------------

抽出したテキスト-------------------------------------------------
大阪桃子が主催する、日本の山岳地帯、紀伊半島の森林覆いに包まれた世界文化遺産地区「キマノ」についての見識を探る。

紀伊半島は日本の最大の半島で、中部の中心からわずかに西に位置し、太平洋に臨む。この紀伊半島には、古代から保存されている「キマノ・コード」と呼ばれるトレイルが存在する。2024年は、これらのルートが世界遺産に認定されて20年目を迎える。これは世界でも稀なトレイルの一つである。

「キマノ」は、紀伊山地の森林に囲まれた場所に位置し、この地区の中心として歴史的に信仰されてきた。

紀伊山地は東西に伸び、南北にも延び、1,000~2,000メートルの高度差を持つ。温暖で激しい雨量により、この地域には豊かな森林が育まれてきた。

東南に位置する「キマノ・サンサン」という聖地は、主に三つの神社と二つの寺院から構成されている。この聖地は、長い歴史の中で信仰の中心として崇敬されてきた。この地区は、世界遺産の一部として認められている。

「キマノ・サンサン」の一つ、「キマノ・ホウグ・タイシ」(田辺市、和歌山県)には、「キマノ・コード」の案内と地域情報を提供する「キマノ・ホウグ・タイシ・ハーベスト・センター」が立地している。ここでは、桃子さんという方が勤務しており、話を聞いた。

「紀伊半島、特に『キマノ』は、古代から神聖なる場所として崇敬されてきた。また、10世紀から10世紀の間に、元上皇や仏皇帝、貴族がこの聖地を訪れることが多く、これが『キマノ・コード』のようなトレイルが良好に保存されている理由だと思う。」と桃子さんは説明してくれた。

「『キマノ・コード』とは、この地区に広がる森林の中に点在する「オジ」と呼ばれる尊崇される子孫神社の集合を指す。」

(注: 「オジ」は「王子」の意味で、これらの神社は「キマノ・サンサン」の子孫として崇敬されている。)
-------------------------------------------------------------------
===================================================================

ColPaliによるページ検索の処理時間: 00h:00m:00.4s
全体の処理時間(モデルの読み込み時間を含む): 00h:05m:16.6s

The processing time is as follows:

ColPaliによるページ検索 of the processing time: 00h:00m:00.4s
Total processing time (including model loading time): 00h:05m:16.6s

(The saved page image is omitted as it is the same result as before (no translation).)

Conclusion

This time, I tried processing PDFs (images + text) using three models I was interested in. All of them were lightweight and easy to test.

  • ColPali
    I was curious about this because it can perform good searches even from images, and the accuracy and speed were beyond expectations. Since text extraction is unnecessary (as embeddings are created directly from images), it is very convenient to use and seems applicable to various use cases beyond just PDFs.

  • Florence-2-large
    In this case, I used this model as a vision-language model to grasp the overall meaning even on pages with many images or where the image is the primary subject. I appreciate that it can recognize and explain content to some degree even when text is scattered along with photos and drawings.

  • Borea-Phi-3.5-mini-Instruct-Jp
    Although it's not a model specifically designed for translation, I used it because its overall performance (especially in Japanese) was high and it piqued my interest.
    While proper nouns were difficult to translate and there were some issues due to translating text extracted from images, it translated the text into a form that felt close to natural language.
    Although I didn't do it this time, if you are using the output text as a RAG context, you could also consider processing it directly with this model.

Thank you, and I look forward to the next opportunity.

*Note: While I was writing this article, "Qwen2-VL (2B, 7B, 72B)" was released by Qwen. It supports multiple languages (including Japanese).
It might be a good idea to try adopting this model as a vision-language model ("Qwen2-VL-2B-Instruct" is small and likely easy to test).

Reference article:
https://note.com/npaka/n/n6e2a00a0c0e7

Reference Articles

https://zenn.dev/knowledgesense/articles/08cfc3de7464cb
https://github.com/illuin-tech/colpali
https://arxiv.org/abs/2407.01449
https://huggingface.co/blog/manu/colpali
https://note.com/npaka/n/n5863c3bd2990
https://huggingface.co/microsoft/Florence-2-large
https://huggingface.co/AXCXEPT/Borea-Phi-3.5-mini-Instruct-Jp
https://prtimes.jp/main/html/rd/p/000000008.000129878.html
https://note.com/npaka/n/n6e2a00a0c0e7

Discussion