iTranslated by AI
The NFC/NFD Issue I Ran Into
Problem Occurred
Recently, I received a bug report regarding a search system feature in my work...
I entered the character string "サンプル" (Sample) into the text box and performed a search, but the expected data, "資料サンプル" (Document Sample), wasn't found!
This feature performs a partial match search using a LIKE clause against the database, so searching for "サンプル" (Sample) should definitely bring it up. Why isn't it working...!
Let's Investigate
First, I'll investigate in my development environment.
I replicated the same situation in my development environment. I created data named "資料サンプル" (Document Sample), searched for "サンプル" (Sample)... and sure enough...! It didn't appear...!!
Why? Is it a database setting?
But wait, if I search for just "サン" (San), it appears!! "プル" (Puru) doesn't😭
This led me to a brilliant hypothesis: perhaps the small circle (handakuten) is the problem!!👏
Investigating with Claude
So, I explained this problem to Claude.
Me: "Searching for 'サン' (San) works, but 'プル' (Puru) doesn't..." (I actually asked more thoroughly than this.)
Claude: "That might be because the normalization forms, NFD and NFC, are different... blah blah blah"
"If the data is NFD normalized and the search value is an NFC normalized string, the search results won't match!... blah blah blah"
Me: NFC? NFD? I don't quite understand what Claude is saying... Let's Google it! I found a Zenn article! Oh, so there's an issue like this!
Although the article above explains it well, I'll briefly write about the NFC/NFD problem from my perspective!
NFC/NFD Problem
What is NFC?
NFC (Normalization Form Canonical Composition) is a method that represents a single character with a single code point. This means that "プ" (pu) is stored as "プ" (pu) as a single character, which is the form we usually perceive.
What is NFD?
NFD (Normalization Form Canonical Decomposition) is a method that represents a single character by decomposing it. "プ" (pu) is decomposed and stored as "フ" (fu) + "゜" (handakuten) as two characters 😢
Comparison
| Method | Representation of "プ" (pu) | Code Point |
|---|---|---|
| NFC | プ (1 character) | U+30D7 |
| NFD | フ + ゜ (2 characters) | U+30D5 + U+309A |
I see, in the case of dakuten (voicing mark) and handakuten (semi-voicing mark), there are patterns where characters are treated as two characters.
This was the true nature of this bug.
When does this occur?
Mac File Uploads
This seems to occur when Mac users upload files.
macOS has a specification to convert filenames to NFD format when saving them. So, even if you save a file named "サンプル.pdf" (Sample.pdf), it is stored internally on the Mac as "サンプル.pdf" (NFD).
And it seems that Chrome and Firefox browsers do not specifically convert filenames (<input type="file"> from Mac) from file.name, so they handle them directly in NFD format. *Safari, however, seems to convert filenames to NFC upon upload.
Since the <input type="text"> used in the search bar accepts input in NFC format, a mismatch between NFC and NFD occurs here.
My case also fits this pattern, and illustrated it looks like this:
Windows users are generally NFC, so this problem doesn't occur for them. It seems to happen with the Mac × Chrome combination, which is troublesome...😭
Solution
JavaScript has a String.prototype.normalize() method that makes conversion easy 🙆 (I only develop in JavaScript, so I'll only write JS, sorry).
We decided to unify to NFC when saving to the database.
// Convert NFD formatted string to NFC
const fileName = inputValue.normalize("NFC");
Summary
- Even if characters look the same, internally computers have two representation methods: NFC and NFD.
- macOS has a specification to store filenames in NFD format, and Chrome/Firefox pass NFD directly from
<input type="file">. - In JS, you can easily convert with
String.prototype.normalize("NFC"), so it's good to unify to NFC when saving to the database.
Discussion