iTranslated by AI
Investigating PDF Embedded Fonts with Custom Encoding
When a PDF uses fonts with custom encoding, copying and pasting text results in characters that differ from their visual appearance. This article explains the core technologies for extracting and investigating fonts from such PDFs, along with a minimal implementation using Python.
Preparation
Install the dependent libraries. Here is an example using uv.
uv add pymupdf freetype-py https://github.com/sbamboo/python-sixel.git
Please refer to the following article for information on Sixel.
Font Extraction
Extract the font files stored inside the PDF.
import fitz # PyMuPDF
pdf = fitz.open("document.pdf")
for page in pdf:
for font in page.get_fonts(full=True):
xref = font[0]
# Get font name, extension, and data body
name, ext, _, data = pdf.extract_font(xref)
if data:
filename = f"{name}.{ext}"
with open(filename, "wb") as f:
f.write(data)
print(f"Extracted font: {filename}")
Identifying the Correspondence between Character Codes and Glyph IDs
Check which character codes point to which glyph IDs (GIDs) inside the font.
import freetype
face = freetype.Face("font.cff")
for i, (code, gid) in enumerate(face.get_chars(), start=1):
print(f"{i}: U+{code:04X}, gid={gid}")
Rendering Glyphs from Font Data
Convert the bitmap generated by FreeType into a Pillow Image object.
from PIL import Image
def render_glyph(face, gid):
face.set_char_size(64 * 64) # Equivalent to 64px
face.load_glyph(gid)
bmp = face.glyph.bitmap
# Convert FreeType buffer to Pillow image object
return Image.frombytes("L", (bmp.width, bmp.rows), bytes(bmp.buffer))
Previewing on the Terminal with Sixel
When investigating a large number of glyphs, saving image files one by one is cumbersome, so we display them directly in the terminal using Sixel. We use the render_glyph function implemented earlier.
import io
import sys
import sixel
import freetype
def show_sixel(image):
with io.BytesIO() as buf:
image.save(buf, format="PNG")
sixel.converter.SixelConverter(buf).write(sys.stdout)
face = freetype.Face("font.cff")
for i, (code, gid) in enumerate(face.get_chars(), start=1):
print(f"{i}: U+{code:04X}, gid={gid}", end=" ")
image = render_glyph(face, gid)
show_sixel(image)
print()
Vertical Position Control Using Font Metrics
In this state, the heights of the displayed glyphs are inconsistent and do not align with their actual display positions. It is necessary to retrieve the metrics information for all glyphs in advance to align and display them based on the baseline.
import freetype
from PIL import Image
# Collect metrics information for all glyphs
face = freetype.Face("font.cff")
font_size = 64
face.set_char_size(font_size * font_size)
ascender = 0
descender = 0
for code, gid in face.get_chars():
face.load_glyph(gid)
metrics = face.glyph.metrics
bearing_y = metrics.horiBearingY / font_size
height = metrics.height / font_size
ascender = max(ascender, bearing_y)
descender = max(descender, height - bearing_y)
# Update render_glyph to support baseline alignment
def render_glyph(face, gid):
face.load_glyph(gid)
bmp = face.glyph.bitmap
glyph_img = Image.frombytes("L", (bmp.width, bmp.rows), bytes(bmp.buffer))
# Create a canvas and place the glyph in the correct position
canvas = Image.new("L", (bmp.width, int(ascender + descender)))
metrics = face.glyph.metrics
bearing_y = metrics.horiBearingY / font_size
canvas.paste(glyph_img, (0, int(ascender - bearing_y)))
return canvas
Conceptual diagram of font metrics (example of Aq):
┌──────────────┐ ─── ascender line
│ █ │ ↑
│ █ █ │ │
│ █ █ │ │ ascender
│ █ █ ████ │ │
│ █████ █ █ │ │
│ █ █ █ █ │ ↓
├─█───█───████─┤ ─── baseline
│ █ │ ↑
│ █ │ │ descender
│ █ │ ↓
└──────────────┘ ─── descender line
Generating HTML with Embedded Images
While Sixel allows for confirming glyph shapes within the terminal, there are issues with saving logs.
- In Windows Terminal, images are not included when copying and pasting.
- If the output is redirected to a file, Sixel escape sequences are included as they are.
Since it is convenient to handle everything in a single file, we will generate an HTML file with images embedded using Base64.
import base64
import io
from PIL import ImageChops
def to_data_url(image):
with io.BytesIO() as buf:
image.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode()
return f"data:image/png;base64,{b64}"
# Initialize HTML
html = '<!DOCTYPE html>\n<html>\n<head>\n<meta charset="UTF-8">\n</head>\n<body>\n'
face = freetype.Face("font.cff")
for i, (code, gid) in enumerate(face.get_chars(), start=1):
glyph_info = f"{i}: U+{code:04X}, gid={gid}"
# Output to terminal
print(glyph_info, end=" ")
image = render_glyph(face, gid)
show_sixel(image)
print()
# Invert black and white
inverted = ImageChops.invert(image)
# Add to HTML
data_url = to_data_url(inverted)
html += f'<p>{glyph_info} <img src="{data_url}"></p>\n'
html += '</body>\n</html>\n'
# Save the generated HTML
with open("report.html", "w", encoding="utf-8") as f:
f.write(html)
print("Saved HTML: report.html")
By including both print and html +=, you can simultaneously check the output in the terminal and generate an HTML report. Because images are Base64 encoded and embedded directly into the HTML, there is no dependency on external files.
If editing is required, you can open the generated HTML in a browser and copy-paste it into Word or other software.
Summary
The investigation flow combining these technologies is as follows:
- Extract fonts embedded in the PDF.
- Display glyphs in the terminal.
- Output the log as HTML.
Related Articles
I have previously used RichTextBox in C# to handle logs containing images.
Discussion