iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🌾

Converting Half-width Katakana to Full-width with a Loop-free Named Function in Google Sheets

に公開

Introduction

I have created a Google Sheets Named Function that converts half-width katakana and certain symbols ("。" "「" "」" "、" "・" "ー") into full-width characters.

In the main process, I have eliminated loop processing by manipulating arrays. Although the MAP function appears in the definition formula, it is used to allow a single-cell input to support array expansion; there is no looping performed for the processing of individual cells. (Does this count as clickbait in the title...? > <)

This project was largely created with the help of AI.

In fact, I didn't initially intend to build this, so I didn't start by "thinking it through myself." The entire process from start to finish was completed through an exchange with an AI.

It started when I happened to see a formula for half-width to full-width katakana conversion created by someone else, which required a separate conversion table outside the formula. I idly asked Gemini late at night, "Wouldn't it be possible to keep the correspondence table within the definition formula by using character code sequences?"... and before I knew it, it was 3:00 AM.

Regardless, here is the definition formula.

[Definition Formula]

Argument: Input

=LET(
  FullWidthKanaAll, ARRAYFORMULA(UNICHAR(SEQUENCE(86, 1, 12449))),
  HalfWidthKanaAll, ARRAYFORMULA(ASC(FullWidthKanaAll)),
  
  Condition, ARRAYFORMULA(LEN(HalfWidthKanaAll)=1),
  FullWidthKana, FILTER(FullWidthKanaAll, Condition),
  HalfWidthKana, FILTER(HalfWidthKanaAll, Condition),

  FullWidthSymbols, {"。";"「";"」";"、";"・";"ー"},
  HalfWidthSymbols, {"。";"「";"」";"、";"・";"ー"},
  FullWidthList, VSTACK(FullWidthSymbols, FullWidthKana),
  HalfWidthList, VSTACK(HalfWidthSymbols, HalfWidthKana),

  HalfWidthCodes, ARRAYFORMULA(UNICODE(HalfWidthList)),
  ClearSoundCodes, ARRAYFORMULA(UNICODE({"カ";"キ";"ク";"ケ";"コ";"サ";"シ";"ス";"セ";"ソ";"タ";"チ";"ツ";"テ";"ト";"ハ";"ヒ";"フ";"ヘ";"ホ";"ウ"})),
  HalfClearSoundCodes, ARRAYFORMULA(UNICODE({"ハ";"ヒ";"フ";"ヘ";"ホ"})),

  VoicedSounds, {"ガ";"ギ";"グ";"ゲ";"ゴ";"ザ";"ジ";"ズ";"ゼ";"ゾ";"ダ";"ヂ";"ヅ";"デ";"ド";"バ";"ビ";"ブ";"ベ";"ボ";"ヴ"},
  SemiVoicedSounds, {"パ";"ピ";"プ";"ペ";"ポ"},

  MAP(Input, LAMBDA(String, 
    IF(String="", "",
      ARRAYFORMULA(LET(
        Length, LEN(String),
        Sequence, SEQUENCE(Length),
        Current, MID(String, Sequence, 1),
        Next, MID(String & " ", Sequence + 1, 1),
        Prev, MID(" " & String, Sequence, 1),

        CurrentCode, UNICODE(Current),
        PrevCode, IFERROR(UNICODE(Prev), 0),

        IsVoiced, (Next = "゙") * ISNUMBER(XMATCH(CurrentCode, ClearSoundCodes)),
        IsSemiVoiced, (Next = "゚") * ISNUMBER(XMATCH(CurrentCode, HalfClearSoundCodes)),
        UnnecessarySymbol, ((Current = "゙") * ISNUMBER(XMATCH(PrevCode, ClearSoundCodes))) + 
                 ((Current = "゚") * ISNUMBER(XMATCH(PrevCode, HalfClearSoundCodes))),


        ConversionArray, IF(UnnecessarySymbol, "",
          IF(IsVoiced, XLOOKUP(CurrentCode, ClearSoundCodes, VoicedSounds),
            IF(IsSemiVoiced, XLOOKUP(CurrentCode, HalfClearSoundCodes, SemiVoicedSounds),
              IFNA(XLOOKUP(CurrentCode, HalfWidthCodes, FullWidthList), Current)
            )
          )
        ),

        JOIN("", ConversionArray)
      ))
    )
  ))
)

The following specifications make converting half-width katakana to full-width particularly tricky:

  • Voiced and semi-voiced marks are treated as single characters (voiced/semi-voiced sounds consist of two characters).
  • Google Sheets evaluates "イ" and "ィ" as TRUE (equal).

Handling these issues is where the creator's individuality shines. With that in mind, I will explain the collaboration between myself and Gemini below.

[Explanation]

Creating Full-Width and Half-Width Arrays

  FullWidthKanaAll, ARRAYFORMULA(UNICHAR(SEQUENCE(86, 1, 12449))),
  HalfWidthKanaAll, ARRAYFORMULA(ASC(FullWidthKanaAll)),

It was indeed possible to create the complete set of full-width kana using UNICHAR and SEQUENCE. As expected from an AI, it provided the correct numerical values. The complete set of half-width kana can be created one-to-one by applying ASC to this.

Removing Voiced/Semi-Voiced Sounds and Adding Symbols

  Condition, ARRAYFORMULA(LEN(HalfWidthKanaAll)=1),
  FullWidthKana, FILTER(FullWidthKanaAll, Condition),
  HalfWidthKana, FILTER(HalfWidthKanaAll, Condition),

  FullWidthSymbols, {"。";"「";"」";"、";"・";"ー"},
  HalfWidthSymbols, {"。";"「";"」";"、";"・";"ー"},
  FullWidthList, VSTACK(FullWidthSymbols, FullWidthKana),
  HalfWidthList, VSTACK(HalfWidthSymbols, HalfWidthKana),

Since voiced and semi-voiced sounds are handled separately, I exclude them from the full set. Because a voiced or semi-voiced mark adds a character, they can be filtered out using LEN(...) = 1.

Next, I create arrays for symbols not included in the character code sequence and match them up, then use VSTACK to complete the list.

Encoding and Lists for Voiced/Semi-Voiced Substitution

  HalfWidthCodes, ARRAYFORMULA(UNICODE(HalfWidthList)),
  ClearSoundCodes, ARRAYFORMULA(UNICODE({"カ";"キ";"ク";"ケ";"コ";"サ";"シ";"ス";"セ";"ソ";"タ";"チ";"ツ";"テ";"ト";"ハ";"ヒ";"フ";"ヘ";"ホ";"ウ"})),
  HalfClearSoundCodes, ARRAYFORMULA(UNICODE({"ハ";"ヒ";"フ";"ヘ";"ホ"})),

  VoicedSounds, {"ガ";"ギ";"グ";"ゲ";"ゴ";"ザ";"ジ";"ズ";"ゼ";"ゾ";"ダ";"ヂ";"ヅ";"デ";"ド";"バ";"ビ";"ブ";"ベ";"ボ";"ヴ"},
  SemiVoicedSounds, {"パ";"ピ";"プ";"ペ";"ポ"},

Now, here is the countermeasure for the issue where Google Sheets evaluates different characters as TRUE (equality). I encode the half-width list into numbers using UNICODE! This allows me to perform searches using codes (numbers), enabling exact matching.

Then, I prepare for the voiced/semi-voiced conversion. Since these cannot be obtained in a simple sequence, there is no other way than to write them out. A key point here is that the half-width side uses "ClearSoundCodes" (the base consonants) rather than the voiced/semi-voiced marks themselves; this differentiates my approach from methods using SUBSTITUTE or REGEXREPLACE.

Preparing for Cell-by-Cell Processing

  MAP(Input, LAMBDA(String, 
    IF(String="", "",
      ARRAYFORMULA(LET(
        Length, LEN(String),
        Sequence, SEQUENCE(Length),
        Current, MID(String, Sequence, 1),
        Next, MID(String & " ", Sequence + 1, 1),
        Prev, MID(" " & String, Sequence, 1),

        CurrentCode, UNICODE(Current),
        PrevCode, IFERROR(UNICODE(Prev), 0),

Everything up to this point was preparation for the prerequisite lists. From here, we begin processing the target cells. While MAP is used to support multiple cells, as I mentioned at the start, no loop processing occurs within the MAP function.

First, I create a sequence of numbers equal to the length of the string and use it with the MID function to create an array where each character is separated. Furthermore, I create an array starting from the second character (the 'Next' character at the same coordinate as 'Current') and an array where the first character is a space (the 'Prev' character at the same coordinate as 'Current').

Adding a space at the end of the 'Next' array is to match the array size.

For exact matching, I convert 'Current' and 'Prev' into their character codes. For 'Next', I only need to check for half-width voiced/semi-voiced marks, so encoding is unnecessary.

Setting Conditional Expressions

        IsVoiced, (Next = "゙") * ISNUMBER(XMATCH(CurrentCode, ClearSoundCodes)),
        IsSemiVoiced, (Next = "゚") * ISNUMBER(XMATCH(CurrentCode, HalfClearSoundCodes)),
        UnnecessarySymbol, ((Current = "゙") * ISNUMBER(XMATCH(PrevCode, ClearSoundCodes))) + 
                 ((Current = "゚") * ISNUMBER(XMATCH(PrevCode, HalfClearSoundCodes))),

I build the conditions for "if X, then Y."

  • IsVoiced: 'Next' is a voiced mark, and the current character code is included in the "ClearSoundCodes" list.
  • IsSemiVoiced: 'Next' is a semi-voiced mark, and the current character code is included in the "HalfClearSoundCodes" list.
  • UnnecessarySymbol: 'Current' is a voiced/semi-voiced mark, and 'Prev' is a kana that can take such a mark. The reason I don't just delete these marks unconditionally is to handle cases where these marks are intentionally added to characters that normally don't take them.

Substitution and Joining

        ConversionArray, IF(UnnecessarySymbol, "",
          IF(IsVoiced, XLOOKUP(CurrentCode, ClearSoundCodes, VoicedSounds),
            IF(IsSemiVoiced, XLOOKUP(CurrentCode, HalfClearSoundCodes, SemiVoicedSounds),
              IFNA(XLOOKUP(CurrentCode, HalfWidthCodes, FullWidthList), Current)
            )
          )
        ),

        JOIN("", ConversionArray)
      ))
    )
  ))
)

Finally, I define the substitution logic:

  • If "UnnecessarySymbol," delete it.
  • If "IsVoiced," use XLOOKUP to find the current code in the "ClearSoundCodes" list and return the corresponding "VoicedSound."
  • If "IsSemiVoiced," use XLOOKUP to find the current code in the "HalfClearSoundCodes" list and return the corresponding "SemiVoicedSound."
  • Otherwise, use exact character code matching to search the half-width kana list; if found, replace it with the full-width version. If not found (not a half-width kana), leave the character as is!

This completes the substitution. Finally, I use JOIN to concatenate the results...!! By repeating this for each cell, it supports array expansion.

Thank you for your hard work > <!!

Conclusion

While Gemini wrote most of the content this time, my initial attempts involved processing voiced/semi-voiced sounds using REDUCE, among other differences. I'm not sure which approach is truly better, but since I have an image of loop processing being heavy, I asked the AI, "Umm, for instance, if I read the positions of voiced/semi-voiced marks using a zero-indexed SEQUENCE, could I identify the positions of characters to be voiced... can we do something from there?" and it came up with the "Current," "Next," and "Prev" three-array process shown above.

I could only respond with "Oh, I see!" but that realization was only possible because I engaged with it, and I feel I learned something about how to "use AI."

Thank you for reading this far.

I would be happy to receive opinions and feedback from experts.
(However, if you feel like saying "that's not good enough," I would appreciate it if you could wrap it in some sugar coating.)

Discussion