iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔠

Investigating PDF Embedded Fonts with Custom Encoding

に公開

When a PDF uses fonts with custom encoding, copying and pasting text results in characters that differ from their visual appearance. This article explains the core technologies for extracting and investigating fonts from such PDFs, along with a minimal implementation using Python.

Preparation

Install the dependent libraries. Here is an example using uv.

uv add pymupdf freetype-py https://github.com/sbamboo/python-sixel.git

Please refer to the following article for information on Sixel.

https://qiita.com/7shi/items/69d1e7c15c7c6a5bb34f

Font Extraction

Extract the font files stored inside the PDF.

import fitz  # PyMuPDF

pdf = fitz.open("document.pdf")
for page in pdf:
    for font in page.get_fonts(full=True):
        xref = font[0]
        # Get font name, extension, and data body
        name, ext, _, data = pdf.extract_font(xref)
        if data:
            filename = f"{name}.{ext}"
            with open(filename, "wb") as f:
                f.write(data)
            print(f"Extracted font: {filename}")

Identifying the Correspondence between Character Codes and Glyph IDs

Check which character codes point to which glyph IDs (GIDs) inside the font.

import freetype

face = freetype.Face("font.cff")
for i, (code, gid) in enumerate(face.get_chars(), start=1):
    print(f"{i}: U+{code:04X}, gid={gid}")

Rendering Glyphs from Font Data

Convert the bitmap generated by FreeType into a Pillow Image object.

from PIL import Image

def render_glyph(face, gid):
    face.set_char_size(64 * 64)  # Equivalent to 64px
    face.load_glyph(gid)
    bmp = face.glyph.bitmap

    # Convert FreeType buffer to Pillow image object
    return Image.frombytes("L", (bmp.width, bmp.rows), bytes(bmp.buffer))

Previewing on the Terminal with Sixel

When investigating a large number of glyphs, saving image files one by one is cumbersome, so we display them directly in the terminal using Sixel. We use the render_glyph function implemented earlier.

import io
import sys
import sixel
import freetype

def show_sixel(image):
    with io.BytesIO() as buf:
        image.save(buf, format="PNG")
        sixel.converter.SixelConverter(buf).write(sys.stdout)

face = freetype.Face("font.cff")
for i, (code, gid) in enumerate(face.get_chars(), start=1):
    print(f"{i}: U+{code:04X}, gid={gid}", end=" ")
    image = render_glyph(face, gid)
    show_sixel(image)
    print()

Vertical Position Control Using Font Metrics

In this state, the heights of the displayed glyphs are inconsistent and do not align with their actual display positions. It is necessary to retrieve the metrics information for all glyphs in advance to align and display them based on the baseline.

import freetype
from PIL import Image

# Collect metrics information for all glyphs
face = freetype.Face("font.cff")
font_size = 64
face.set_char_size(font_size * font_size)
ascender = 0
descender = 0
for code, gid in face.get_chars():
    face.load_glyph(gid)
    metrics = face.glyph.metrics
    bearing_y = metrics.horiBearingY / font_size
    height = metrics.height / font_size
    ascender = max(ascender, bearing_y)
    descender = max(descender, height - bearing_y)

# Update render_glyph to support baseline alignment
def render_glyph(face, gid):
    face.load_glyph(gid)
    bmp = face.glyph.bitmap
    glyph_img = Image.frombytes("L", (bmp.width, bmp.rows), bytes(bmp.buffer))

    # Create a canvas and place the glyph in the correct position
    canvas = Image.new("L", (bmp.width, int(ascender + descender)))
    metrics = face.glyph.metrics
    bearing_y = metrics.horiBearingY / font_size
    canvas.paste(glyph_img, (0, int(ascender - bearing_y)))

    return canvas

Conceptual diagram of font metrics (example of Aq):

┌──────────────┐ ─── ascender line
│   █          │  ↑
│  █ █         │  │
│ █   █        │  │  ascender
│ █   █   ████ │  │
│ █████  █   █ │  │
│ █   █  █   █ │  ↓
├─█───█───████─┤ ─── baseline
│            █ │  ↑
│            █ │  │  descender
│            █ │  ↓
└──────────────┘ ─── descender line

Generating HTML with Embedded Images

While Sixel allows for confirming glyph shapes within the terminal, there are issues with saving logs.

  • In Windows Terminal, images are not included when copying and pasting.
  • If the output is redirected to a file, Sixel escape sequences are included as they are.

Since it is convenient to handle everything in a single file, we will generate an HTML file with images embedded using Base64.

import base64
import io
from PIL import ImageChops

def to_data_url(image):
    with io.BytesIO() as buf:
        image.save(buf, format="PNG")
        b64 = base64.b64encode(buf.getvalue()).decode()
        return f"data:image/png;base64,{b64}"

# Initialize HTML
html = '<!DOCTYPE html>\n<html>\n<head>\n<meta charset="UTF-8">\n</head>\n<body>\n'

face = freetype.Face("font.cff")
for i, (code, gid) in enumerate(face.get_chars(), start=1):
    glyph_info = f"{i}: U+{code:04X}, gid={gid}"

    # Output to terminal
    print(glyph_info, end=" ")
    image = render_glyph(face, gid)
    show_sixel(image)
    print()

    # Invert black and white
    inverted = ImageChops.invert(image)

    # Add to HTML
    data_url = to_data_url(inverted)
    html += f'<p>{glyph_info} <img src="{data_url}"></p>\n'

html += '</body>\n</html>\n'

# Save the generated HTML
with open("report.html", "w", encoding="utf-8") as f:
    f.write(html)
print("Saved HTML: report.html")

By including both print and html +=, you can simultaneously check the output in the terminal and generate an HTML report. Because images are Base64 encoded and embedded directly into the HTML, there is no dependency on external files.

If editing is required, you can open the generated HTML in a browser and copy-paste it into Word or other software.

Summary

The investigation flow combining these technologies is as follows:

  1. Extract fonts embedded in the PDF.
  2. Display glyphs in the terminal.
  3. Output the log as HTML.

I have previously used RichTextBox in C# to handle logs containing images.

https://qiita.com/7shi/items/cf9f7a8f0d53e6b6c841

GitHubで編集を提案

Discussion