iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐕

Lazy-Style Character Encoding Detection for Go on Windows

に公開

As I mentioned in "Windows, Unicode, and Me," Notepad in Windows 10 no longer attaches a BOM (Byte Order Mark) when saving in UTF-8.

As a result, there is no longer any definitive information to distinguish between UTF-8 and non-UTF-8 text like Shift-JIS, so we have to detect it ourselves.

In this article, I would like to propose a detection procedure that is relatively simple and does not depend on a Japanese environment.

Problems with Conventional Methods

What has been commonly done in the past with tools like nkf was to determine whether a byte sequence satisfies the valid range of Shift-JIS.

However, with this method, the determination of the Shift-JIS valid range is complicated, and the code ends up being locally dependent on a Japanese environment. That's no good.

Proposed Method

We just need to do the same thing with the valid range of UTF-8. There is a standard function called utf8.Valid for checking the valid range. The character code range of UTF-8 is surprisingly strict, so it seems quite rare for a Shift-JIS sequence to continuously succeed in this check.

General logic:

  • Read the text file line by line.
  • If utf8.Valid is true, assume that line is UTF-8.
  • If false, assume the line is a string in the current code page; if conversion to UTF-16 succeeds, confirm it as non-UTF-8 (and treat all subsequent lines as non-UTF-8 as well).
  • If it fails, treat it as a detection failure (maybe it's binary?)

Sample

Let's try to create something like a bufio.Scanner with a conversion function.

The detection method is as described above, but for now, we need a separate function for Non-UTF-8 to UTF-8 conversion.

Windows has an API called MultiByteToWideChar, which allows conversion between any character code and UTF-16. And the Go semi-standard library golang.org/x/sys/windows has the definition to call it.

The procedure for using this is:

  1. In the first call, pass NULL as the destination for the UTF-16 text to get the buffer size required to store the converted result.
  2. In the second call, actually convert it to UTF-16.
  3. Convert the text converted to UTF-16 into UTF-8.

That is the plan. Let's define this as a single general-purpose function.

package mbcs

import (
    "bufio"
    "io"
    "unicode/utf8"

    "golang.org/x/sys/windows"
)

const _ACP = 0 // Represents the current code page

func ansiToUtf8(mbcs []byte) (string, error) {
    // query buffer's size
    size, _ := windows.MultiByteToWideChar(
            _ACP, 0, &mbcs[0], int32(len(mbcs)), nil, 0)
    if size <= 0 {
            return "", windows.GetLastError()
    }

    // convert ansi to utf16
    utf16 := make([]uint16, size)
    rc, _ := windows.MultiByteToWideChar(
            _ACP, 0, &mbcs[0], int32(len(mbcs)), &utf16[0], size)
    if rc == 0 {
            return "", windows.GetLastError()
    }
    // convert utf16 to utf8
    return windows.UTF16ToString(utf16), nil
}

This function converts strings from the current code page to UTF-8. If text that is already UTF-8 is passed through it, it will result in an incorrect conversion or an error. Therefore, let's perform proper detection.

type Filter struct {
    sc   *bufio.Scanner
    text string
    ansi bool
    err  error
}

This is the type definition for a Scanner with character code detection. It basically wraps bufio.Scanner, but we replace only the Scan() and Text() methods.

func NewFilter(r io.Reader) *Filter {
    return &Filter{
        sc: bufio.NewScanner(r),
    }
}

Constructor. We'll take advantage of zero values to keep things simple.

  • Scanner member: The bufio.Scanner used internally.
  • text member: Stores the conversion result.
  • ansi member: Set to true if it's confirmed as non-UTF-8. Initially false because it's unknown.
  • err member: Error...

All detection and conversion are performed within Scan.

func (f *Filter) Scan() bool {
    if !f.sc.Scan() {
        f.err = f.sc.Err()
        return false
    }
    line := f.sc.Bytes()
    if !f.ansi && utf8.Valid(line) {
        f.text = f.sc.Text()
    } else {
        f.text, f.err = ansiToUtf8(line)
        if f.err != nil {
            return false
        }
        f.ansi = true
    }
    return true
}
func (f *Filter) Text() string {
    return f.text
}

func (f *Filter) Err() error {
    return f.err
}

So, can we really achieve proper automatic detection with this? Let's check with some sample code.

For now, the package is named "mbcs". Initialize it with go mod init mbcs and write the following code as example.go in the same folder.

Normally, main and mbcs cannot coexist in the same directory, but since this main is only intended for go run, let's add //+build ignore to the first line to exclude it from go build.

//+build ignore

package main

import (
    "fmt"
    "os"

    "mbcs"
)

func main() {
    filter := mbcs.NewFilter(os.Stdin)
    for filter.Scan() {
        fmt.Println(filter.Text())
    }
    if err := filter.Err(); err != nil {
        fmt.Fprintln(os.Stderr, err.Error())
        os.Exit(1)
    }
}

Well, does it work?

Test #1:

$ nkf32 --guess sample1.txt
Shift_JIS (CRLF)
$ hexdump sample1.txt
53 68 69 66 74 4A 49 53 0D 0A 82 C5 8F 91 82 A2
82 BD 81 42 83 54 83 93 83 76 83 8B 83 65 83 4C
83 58 83 67 82 C5 82 B7 81 42 0D 0A 94 BB 92 E8
82 C5 82 AB 82 E9 82 A9 82 C8 0D 0A
$ go run example.go < sample1.txt
ShiftJIS
で書いた。サンプルテキストです。
判定できるかな

Test #2:

$ nkf32 --guess sample2.txt
UTF-8 (CRLF)
$ hexdump sample2.txt
55 54 46 38 0D 0A E3 81 A7 E6 9B B8 E3 81 84 E3
81 9F E3 80 82 E3 82 B5 E3 83 B3 E3 83 97 E3 83
AB E3 83 86 E3 82 AD E3 82 B9 E3 83 88 E3 81 A7
E3 81 99 E3 80 82 0D 0A E5 88 A4 E5 AE 9A E3 81
A7 E3 81 8D E3 82 8B E3 81 8B E3 81 AA 0D 0A
$ go run example.go < sample2.txt
UTF8
で書いた。サンプルテキストです。
判定できるかな

Looks like it works, doesn't it? (By the way, the built-in commands TYPE/MORE in the command-line shell NYAGOS use roughly the same logic.)

Additionally, the code above is available from the following repository (under the MIT license):

https://github.com/zetamatta/go-mbcs-to-utf8-filter

GitHubで編集を提案

Discussion