iTranslated by AI
Don't Trust LLM Output: 6 Safeguards Learned from Building a PDF-to-Anki CLI with Claude API
Introduction
In my previous article, I wrote about the first 10 days of my full-scale development in the Everything Claude Code (ECC) environment. It was a story of a beginner who didn't even know Git, learning development while cycling through the PDCA cycle many times.
This time, I'll talk about a tool called pdf2anki that I built during the latter half of those 10 days.
The Problem Wasn't the App, It Was the Content
To study for the G-test, I built a quiz app based on the forgetting curve. A Web version using Python/Streamlit and an iOS version using Swift/SwiftUI. I implemented them using TDD in the ECC environment, recording decisions with ADRs, and had a working product in two weeks.
After completing it, I realized: I had reinvented Anki.
There is no point in rebuilding something like Anki, which has over 20 years of history, in two weeks. However, during the development process, I saw the real bottleneck. It wasn't the app; it was the content.
In the iOS version, I implemented an algorithm to extract text from PDFs and use regex and LLMs to create 410 question-and-answer pairs. However, accuracy was unstable due to notation variations and inconsistencies in chapter structures. It's a world where matching breaks over things like "Chapter 1" and "Chapter 1" (presence or absence of a space).
If there were a tool that could "automatically generate high-quality Anki cards from any text," I could leave the app itself to Anki.
That's how I started building pdf2anki.
$ pdf2anki convert textbook.pdf -o cards.tsv
It is a CLI tool that takes PDF, text, or Markdown as input and outputs flashcards (TSV/JSON) using the Claude API.
In this article, I will write about the six pitfalls I encountered while creating pdf2anki and the defensive measures I took, so that those who use the Claude API in the future can avoid the same mistakes.
The "persistently asking questions" attitude I mentioned in the previous article remains the same. I kept asking Claude Code, "Why do it this way?" or "Are there other methods?" I will write as honestly as possible about the design decisions that emerged from those questions and answers.
1. Design Assuming LLM JSON Output Will Break
What happened
When asking the Claude API to generate cards, the returned JSON sometimes cannot be parsed directly.
Even if you write "Return only JSON" in the prompt, it might wrap it in Markdown like json . Sometimes, explanatory text is mixed in the middle of an array. Commas might be extra, or closing brackets might be missing.
LLM output is probabilistic. Even if you instruct it to "return it like this," there is no 100% guarantee.
Failed response
At first, I used json.loads() for batch parsing and retried the whole thing if it failed. Even if the format was broken for only 1 card out of 10, all 10 cards would be regenerated. This caused the API costs to swell unnecessarily.
Defensive measure: Validate on a per-card basis and allow partial success
for i, item in enumerate(data):
try:
card = AnkiCard.model_validate(item)
cards.append(card)
except (ValidationError, TypeError) as e:
logger.warning("Skipping invalid card at index %d: %s", i, e)
Even if 1 out of 10 cards is broken, the remaining 9 are saved.
A design that "allows partial success" rather than a "binary choice of success or failure." This is a principle applicable to LLM-integrated tools in general.
JSON wrapping (json ) is stripped beforehand with regex before passing it to the parser. It's grimy preprocessing, but this kind of accumulation of small defenses supports the stability of LLM integration.
2. The Quality of LLM-Generated Cards Varies
What happened
Even if the JSON is parsed correctly, the quality of the content is not guaranteed.
Here are examples of problematic cards that actually appeared:
- The front side only said "Please explain." Explain what?
- The back side was over 500 characters. It's a report, not a card.
- Three concepts were packed into one card. The principle of Anki cards is one concept per card.
- Output in QA format even though it was a list-type question. It should have been a fill-in-the-blank.
"It's fine because I left it to the LLM" was a dangerous assumption.
Failed response
I first tried a method where the LLM re-evaluated all cards. Since I hit the API for both generation and evaluation, the cost simply doubled. 200 API calls for every 100 cards created. This was not realistic.
Defensive measure: Ensuring quality with a 3-layer pipeline
Layer 1 (Code-based judgment, no LLM required): Heuristic scoring
I assigned scores from 0 to 1 across six axes, and if the weighted total was 0.90 or higher, it passed. No LLM is called here.
| Axis | Weight | Check Details |
|---|---|---|
| Front quality | 25% | 10–200 characters, presence of question marks |
| Back quality | 25% | 5–200 characters, conciseness |
| Card type suitability | 15% | Whether it's QA when it should be a list |
| Bloom level | 10% | Whether the cognitive level matches the content |
| Tag quality | 10% | Whether hierarchical tags are attached |
| Atomicity | 15% | Whether it follows the "one concept per card" rule |
The judgment of atomicity was interesting. I split the back side by periods, and if there were many sentences, I judged it as "containing multiple concepts."
# If the back side has 3 or more sentences -> suspected multiple concepts
sentences = [s for s in _SENTENCE_SPLIT_RE.split(back) if s.strip()]
if len(sentences) >= 3:
score = max(0.3, 1.0 - len(sentences) * 0.15)
# "Also," "Furthermore" -> additive conjunctions -> further deduction
if _MULTI_CONCEPT_RE.search(back):
score = max(0.2, score - 0.15)
If words like "Also" or "Furthermore" appear, there's a high possibility that too much information is packed into one card. Cards that a human would decide should be split can be detected by code.
Layer 2 (LLM usage): Let Claude critique only cards with a score below 0.90
The critique result has three options: improve (rewrite), split (split), or remove (delete). Maximum of 2 rounds.
The key point is not to use the LLM for all cards. Cards with a score of 0.90 or higher pass through Layer 1. In actual measurements, 60–70% of generated cards passed through Layer 1. LLM critique is only needed for 30–40%.
Layer 3 (Code-based judgment, no LLM required): Duplicate detection
Detect duplicates using Jaccard similarity of character bi-grams. If the similarity of the front side exceeds 0.7, a duplicate flag is set.
This is the same idea as the ECC TDD I wrote about in the previous article. Just as you write tests first, you define quality standards first. Without standards, you end up with just "feeling like it got better."
60!-- textlint-disable ja-technical-writing/ja-no-mixed-period --62
60!-- textlint-enable ja-technical-writing/ja-no-mixed-period --62
3. API Costs are Scary When "Invisible"
What happened
The Claude API is pay-as-you-go. For Sonnet, it's $3/M tokens for input and $15/M tokens for output.
How much does it cost to process a 100-page PDF? You won't know until you run it.
During initial testing, I accidentally processed a long PDF and spent nearly $2. For a personal learning tool, this expense is painful.
Three Defensive Measures
① Cost Estimation Before Execution
Use the preview command to get an estimate without hitting the API.
$ pdf2anki preview textbook.pdf
Estimated cost: $0.42 (Sonnet) / $0.11 (Haiku)
Sections: 12 | Chunks: 8 | Tokens: ~45,000
I designed the flow so that users can decide "this is okay" before running convert.
② Preventing Runaway with budget_limit
Set a budget limit (default $1.00) in CostTracker, and check the cumulative cost with every API call. It stops right there if the limit is exceeded.
@dataclass(frozen=True, slots=True)
class CostTracker:
budget_limit: float = 1.00
records: tuple[CostRecord, ...] = ()
It's made immutable with frozen=True. The "principle of immutability" mentioned in the previous article proved useful here. It returns a new CostTracker instance for every API call. The values are never overwritten midway. You can always accurately track "how much you've spent so far."
③ Automatic Model Selection Based on Text Volume
If the text is short (less than 10,000 characters or fewer than 30 cards), it uses Haiku, which costs about 1/4th, and routes to Sonnet if it's longer.
_SONNET_TEXT_THRESHOLD = 10_000 # Number of characters
_SONNET_CARD_THRESHOLD = 30 # Number of cards
It's a simple threshold branch, but it significantly reduces processing costs for short texts.
4. Incorrect Splitting of Long PDFs Breaks Card Context
What happened
Claude's input token limit is about 200K, but sending a large amount of text at once degrades the quality of the cards. Splitting is necessary.
Failed response
Initially, I was cutting the text evenly by character count. Naturally, it would cut in the middle of a chapter.
Cards generated from chunks containing a mix of "the latter half of Chapter 3" and "the first half of Chapter 4" had strange contexts. Term definitions from one chapter were confused with concepts from another chapter.
Defensive measure: Preserve context with section splitting + breadcrumbs
I changed to a method that logically splits by Markdown headings (#, ##, ###).
I attach breadcrumbs to each section. This is a navigation like "Home > Products > iPhone" on a website, showing which level of the overall hierarchy you are currently in.
breadcrumb: "Main Body > Chapter 1 Meaning of Treatise Title > 1.1 Etymology"
Claude can see this breadcrumb and understand, "Now we are talking about the etymology in Chapter 1." Just having this context at the beginning of the chunk made the context of the generated cards significantly more accurate.
The heading hierarchy is tracked using a dictionary called heading_stack. If an H2 appears, H3 information below it is cleared.
heading_stack: dict[int, str] = {}
# H2 appears -> clear H3
keys_to_remove = [k for k in heading_stack if k >= level]
If a section exceeds 30,000 characters, it is sub-divided by paragraphs, and (cont.) is added to the heading to indicate it's a continuous section.
Pitfalls Specific to Japanese Text
There are two.
First, handling text without Markdown headings. For Japanese books, I added a fallback to detect headings using patterns like "Chapter 1", "(1)", or "One,".
Second, errors in token estimation. The constant CHARS_PER_TOKEN = 4 is based on English. In Japanese, it's about 2 to 3 characters per token, so the cost estimate becomes roughly half of the actual cost. This is a known uncorrected issue. If handling Japanese, the coefficient needs to be corrected to about 2.5.
5. Image Handling is a Struggle Against Cost
What Happened
PDFs contain not only text but also diagrams and charts.
The Claude API has a Vision feature that allows you to input not just text but also images. If you send a photo or a diagram, Claude can understand "what is in this image" and provide an answer. Using this, you can generate cards from diagrams and tables in textbooks as well.
I thought, "Wouldn't it be perfect if I sent all pages to Vision as images?"
The cost calculation woke me up.
The token cost for images is (width × height) ÷ 750. Imaging all 100 pages results in about 150,000 tokens for images alone. Combined with text, it's 300,000 tokens, which is about $1. Text only would be around $0.15. That's a 7x cost difference.
Defensive Measure: Controlling Image Processing with Three Thresholds
① Image Coverage Threshold: 20%
Using pymupdf, I calculate the image area within a page and send only those pages where images occupy 20% or more of the page area to Vision. For text-heavy pages, text extraction is sufficient.
② DPI: Fixed at 150
Vision's constraints are 1568px on the long side and a maximum of 1.15 megapixels.
- 300 DPI: Clear, but tokens are doubled. Over 3,000 tokens per image.
- 72 DPI: Cheap, but text in diagrams becomes illegible.
- 150 DPI: Minimum resolution for text legibility. About 1,500 tokens per image.
③ Maximum 5 Images per Page
5 images × 1,500 tokens = 7,500 tokens. When combined with budget_limit, it prevents runaway costs.
With these three compromises, I kept the processing cost with Vision to 1.5 to 2 times that of text only. It didn't reach 7 times.
6. If You Can't Measure the Effect of Prompt Changes, Improvement Becomes a Gamble
What Happened
When I rewrote the prompt, I couldn't judge its impact on card quality. Relying on a "feeling that it got better" is unreliable. LLM output is non-deterministic, and even with the same prompt, you get different results every time.
Defensive Measure: Automating Quality Testing with Keyword Matching Against Expected Cards
In LLM development, automatic evaluation of output quality is called Eval.
In pdf2anki, I adopted a method where the expected values—"this text should generate these cards"—are defined in YAML and cross-referenced with the actual output.
- id: "dl-001"
text: |
Overfitting refers to a phenomenon where a model fits too closely
to the training data, leading to a decrease in generalization
performance on unseen data. Countermeasures include dropout,
early stopping, and data augmentation.
expected_cards:
- front_keywords: ["overfitting", "what"]
back_keywords: ["training data", "generalization performance", "decrease"]
card_type: qa
- front_keywords: ["overfitting", "countermeasure"]
back_keywords: ["dropout", "early stopping", "data augmentation"]
card_type: qa
I calculate the degree of match based on whether the generated cards contain the keywords and output Recall (coverage), Precision (accuracy), and F1 (the balance between the two).
$ pdf2anki eval --dataset evals/dataset.yaml
Recall: 0.78 | Precision: 0.85 | F1: 0.81
Keyword matching has its limits. For example, it cannot equate "Neural Network" with its Japanese synonym "神経回路網" (shinkei kairo-mo).
However, the important thing is detecting whether a prompt change has caused a regression (i.e., becoming worse than before). It doesn't need to be perfect semantic understanding. Every time I change the prompt, I run pdf2anki eval to ensure the F1 score hasn't dropped. This alone stops improvement from being a gamble.
It's the same concept I learned from TDD in ECC. If you have tests, you can refactor with peace of mind. If you have Eval, you can rewrite prompts with peace of mind.
Conclusion
The process of creating pdf2anki was a series of compromises.
| What I wanted to do | Reality | Compromise |
|---|---|---|
| Have LLM return perfect JSON | Output breaks | Allow partial success by skipping individual items |
| Quality evaluate all cards with LLM | Cost doubles | Filter 60–70% with heuristics |
| Use without worrying about cost | Pay-as-you-go is scary | Estimation + Limit + Model Selection |
| Split text evenly | Context breaks | Section splitting + breadcrumbs |
| Send all pages as images | 7x cost | Coverage threshold + DPI + Count limit |
| Judge prompt improvements by intuition | Can't detect regressions | Keyword match Eval |
None of these are "ideal designs" but rather products of "coming to terms with reality."
There was one judgment criterion that worked throughout the entire development: Always ask, "If I don't do this, what could I achieve with the same amount of time?"
At one point, I thought about supporting OpenAI because I had $10 in credits left. The implementation was estimated at 1,000 lines and 10 hours. That's like working for $1 an hour to recover $10. In those same 10 hours, I was able to build image card generation, a TUI, and the Eval framework.
The ECC PDCA cycle I wrote about in my previous article is running here too.
- Plan: Persistently ask Claude Code "Why do it this way?" to understand the design background.
- Do: Write tests first with TDD and implement.
- Check: Measure prompt quality with Eval and verify card quality with heuristics.
- Act: Record decisions like "Why I scrapped OpenAI support" or "Why I set the image threshold to 20%" in ADRs.
The Claude API is a powerful tool. However, the moment you think "hitting the API will solve it," costs swell, quality becomes unstable, and debugging becomes difficult. What brings out the power of the API is the defensive design built around it.
I hope this article is helpful to those walking the same path.
pdf2anki is available on GitHub.
Discussion