iTranslated by AI
[PHP] mb_convert_kana cannot be used for Unicode normalization of voiced and semi-voiced sound marks
I investigated whether mb_convert_kana could be used to convert NFD voiced and semi-voiced sound marks into NFC, but I concluded that it is better to avoid it. This is because U+3099 and U+309A, which are defined as combining characters for voiced and semi-voiced marks, are not subject to full-width/half-width conversion. It likely follows old conversion rules that were established before the Unicode specification for combining characters was defined.
The challenge in adding Unicode normalization functionality for voiced and semi-voiced sounds to mb_convert_kana would be deciding on the option string. Should an option like mb_convert_kana($str, 'NFC') be added? While Ruby supports UTF-8-MAC in its encode method, it requires managing a composition exclusion table.
<?php
// Voiced sound mark (Defined in JIS)
$str = "\u{309B}";
// Voiced sound mark (Combining character)
$str2 = "\u{3099}";
// Semi-voiced sound mark (Defined in JIS)
$str3 = "\u{309C}";
// Semi-voiced sound mark (Combining character)
$str4 = "\u{309A}";
// Full-width/Half-width
var_dump(
// Half-width voiced sound mark
"\u{FF9E}" === mb_convert_kana($str, 'k', 'utf-8'),
// No change
$str2 === mb_convert_kana($str2, 'k', 'utf-8'),
// Half-width semi-voiced sound
"\u{FF9F}" === mb_convert_kana($str3, 'k', 'utf-8'),
// No change
$str4 === mb_convert_kana($str4, 'k', 'utf-8')
);
// Half-width voiced sound mark
$str5 = "ハ\u{FF9E}";
// Full-width voiced sound mark
$str6 = "ハ\u{309B}";
$str7 = "ハ\u{3099}";
// Half-width semi-voiced sound mark
$str8 = "ハ\u{FF9F}";
// Full-width semi-voiced sound mark
$str9 = "ハ\u{309C}";
$str10 = "ハ\u{309A}";
// Half-width/Full-width
var_dump(
// Convert half-width voiced sound mark
"バ" === mb_convert_kana($str5, 'KV', 'utf-8'),
// Full-width voiced sound mark remains unchanged
"ハ\u{309B}"=== mb_convert_kana($str6, 'KV', 'utf-8'),
"ハ\u{3099}"=== mb_convert_kana($str7, 'KV', 'utf-8'),
// Convert half-width semi-voiced sound mark
"パ" === mb_convert_kana($str8, 'KV', 'utf-8'),
// Full-width semi-voiced sound mark remains unchanged
"ハ\u{309C}" === mb_convert_kana($str9, 'KV', 'utf-8'),
"ハ\u{309A}" === mb_convert_kana($str10, 'KV', 'utf-8')
);
Discussion