🚀

【Go 1.22.3】utf8.RuneStart の誤判定

2024/05/24に公開

utf8.RuneStart は UTF-8 の文字を構成する先行バイトであるかどうかを判定する関数であるが、0x00..0xFF の範囲で調べたところ、不正なバイト列（0xC0, 0xC1, 0xF5.. 0xFF）を先行バイトとして誤判定していることが判明しました。バージョンは Go 1.22.3 です

package main

import (
  "fmt"
  "unicode/utf8"
)

func main() {

  for i := range 0xFF+1 {
    if i > 0x80 && utf8.RuneStart(byte(i)) {
      fmt.Printf("%X ", i)
    }
  }

  fmt.Println()
}

誤判定した理由は不十分なビット演算です。実装のコメントを見ると「possibly invalid rune」と書かれているので、不正なバイト列を想定していないことが推測されます。

// https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/unicode/utf8/utf8.go;l=471

// RuneStart reports whether the byte could be the first byte of an encoded,
// possibly invalid rune. Second and subsequent bytes always have the top two
// bits set to 10.
func RuneStart(b byte) bool { return b&0xC0 != 0x80 }

Discussion