iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🚀

[Go 1.22.3] Misidentification in utf8.RuneStart

に公開

utf8.RuneStart is a function that determines whether a byte is a leading byte of a UTF-8 character. However, upon investigating the range 0x00..0xFF, it was found that it incorrectly identifies invalid byte sequences (0xC0, 0xC1, 0xF5.. 0xFF) as leading bytes. The version is Go 1.22.3.

package main

import (
  "fmt"
  "unicode/utf8"
)

func main() {

  for i := range 0xFF+1 {
    if i > 0x80 && utf8.RuneStart(byte(i)) {
      fmt.Printf("%X ", i)
    }
  }

  fmt.Println()
}

The reason for the misidentification is insufficient bitwise operations. Looking at the comment in the implementation, it says "possibly invalid rune," which suggests that it does not account for invalid byte sequences.

// https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/unicode/utf8/utf8.go;l=471

// RuneStart reports whether the byte could be the first byte of an encoded,
// possibly invalid rune. Second and subsequent bytes always have the top two
// bits set to 10.
func RuneStart(b byte) bool { return b&0xC0 != 0x80 }

Discussion