iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
😢

The NFC/NFD Issue I Ran Into

に公開

Problem Occurred

Recently, I received a bug report regarding a search system feature in my work...
I entered the character string "サンプル" (Sample) into the text box and performed a search, but the expected data, "資料サンプル" (Document Sample), wasn't found!
This feature performs a partial match search using a LIKE clause against the database, so searching for "サンプル" (Sample) should definitely bring it up. Why isn't it working...!

Let's Investigate

First, I'll investigate in my development environment.
I replicated the same situation in my development environment. I created data named "資料サンプル" (Document Sample), searched for "サンプル" (Sample)... and sure enough...! It didn't appear...!!
Why? Is it a database setting?
But wait, if I search for just "サン" (San), it appears!! "プル" (Puru) doesn't😭
This led me to a brilliant hypothesis: perhaps the small circle (handakuten) is the problem!!👏

Investigating with Claude

So, I explained this problem to Claude.

Me: "Searching for 'サン' (San) works, but 'プル' (Puru) doesn't..." (I actually asked more thoroughly than this.)

Claude: "That might be because the normalization forms, NFD and NFC, are different... blah blah blah"
"If the data is NFD normalized and the search value is an NFC normalized string, the search results won't match!... blah blah blah"

Me: NFC? NFD? I don't quite understand what Claude is saying... Let's Google it! I found a Zenn article! Oh, so there's an issue like this!

Although the article above explains it well, I'll briefly write about the NFC/NFD problem from my perspective!

NFC/NFD Problem

What is NFC?

NFC (Normalization Form Canonical Composition) is a method that represents a single character with a single code point. This means that "プ" (pu) is stored as "プ" (pu) as a single character, which is the form we usually perceive.

What is NFD?

NFD (Normalization Form Canonical Decomposition) is a method that represents a single character by decomposing it. "プ" (pu) is decomposed and stored as "フ" (fu) + "゜" (handakuten) as two characters 😢

Comparison

Method Representation of "プ" (pu) Code Point
NFC プ (1 character) U+30D7
NFD フ + ゜ (2 characters) U+30D5 + U+309A

I see, in the case of dakuten (voicing mark) and handakuten (semi-voicing mark), there are patterns where characters are treated as two characters.
This was the true nature of this bug.

When does this occur?

Mac File Uploads

This seems to occur when Mac users upload files.

macOS has a specification to convert filenames to NFD format when saving them. So, even if you save a file named "サンプル.pdf" (Sample.pdf), it is stored internally on the Mac as "サンプル.pdf" (NFD).

And it seems that Chrome and Firefox browsers do not specifically convert filenames (<input type="file"> from Mac) from file.name, so they handle them directly in NFD format. *Safari, however, seems to convert filenames to NFC upon upload.

Since the <input type="text"> used in the search bar accepts input in NFC format, a mismatch between NFC and NFD occurs here.

My case also fits this pattern, and illustrated it looks like this:

Windows users are generally NFC, so this problem doesn't occur for them. It seems to happen with the Mac × Chrome combination, which is troublesome...😭

Solution

JavaScript has a String.prototype.normalize() method that makes conversion easy 🙆 (I only develop in JavaScript, so I'll only write JS, sorry).

We decided to unify to NFC when saving to the database.

// Convert NFD formatted string to NFC
const fileName = inputValue.normalize("NFC");

Summary

  • Even if characters look the same, internally computers have two representation methods: NFC and NFD.
  • macOS has a specification to store filenames in NFD format, and Chrome/Firefox pass NFD directly from <input type="file">.
  • In JS, you can easily convert with String.prototype.normalize("NFC"), so it's good to unify to NFC when saving to the database.

References

Discussion