iTranslated by AI
Lazy-Style Character Encoding Detection for Go on Windows
As I mentioned in "Windows, Unicode, and Me," Notepad in Windows 10 no longer attaches a BOM (Byte Order Mark) when saving in UTF-8.
As a result, there is no longer any definitive information to distinguish between UTF-8 and non-UTF-8 text like Shift-JIS, so we have to detect it ourselves.
In this article, I would like to propose a detection procedure that is relatively simple and does not depend on a Japanese environment.
Problems with Conventional Methods
What has been commonly done in the past with tools like nkf was to determine whether a byte sequence satisfies the valid range of Shift-JIS.
However, with this method, the determination of the Shift-JIS valid range is complicated, and the code ends up being locally dependent on a Japanese environment. That's no good.
Proposed Method
We just need to do the same thing with the valid range of UTF-8. There is a standard function called utf8.Valid for checking the valid range. The character code range of UTF-8 is surprisingly strict, so it seems quite rare for a Shift-JIS sequence to continuously succeed in this check.
General logic:
- Read the text file line by line.
- If
utf8.Validis true, assume that line is UTF-8. - If false, assume the line is a string in the current code page; if conversion to UTF-16 succeeds, confirm it as non-UTF-8 (and treat all subsequent lines as non-UTF-8 as well).
- If it fails, treat it as a detection failure (maybe it's binary?)
Sample
Let's try to create something like a bufio.Scanner with a conversion function.
The detection method is as described above, but for now, we need a separate function for Non-UTF-8 to UTF-8 conversion.
Windows has an API called MultiByteToWideChar, which allows conversion between any character code and UTF-16. And the Go semi-standard library golang.org/x/sys/windows has the definition to call it.
The procedure for using this is:
- In the first call, pass
NULLas the destination for the UTF-16 text to get the buffer size required to store the converted result. - In the second call, actually convert it to UTF-16.
- Convert the text converted to UTF-16 into UTF-8.
That is the plan. Let's define this as a single general-purpose function.
package mbcs
import (
"bufio"
"io"
"unicode/utf8"
"golang.org/x/sys/windows"
)
const _ACP = 0 // Represents the current code page
func ansiToUtf8(mbcs []byte) (string, error) {
// query buffer's size
size, _ := windows.MultiByteToWideChar(
_ACP, 0, &mbcs[0], int32(len(mbcs)), nil, 0)
if size <= 0 {
return "", windows.GetLastError()
}
// convert ansi to utf16
utf16 := make([]uint16, size)
rc, _ := windows.MultiByteToWideChar(
_ACP, 0, &mbcs[0], int32(len(mbcs)), &utf16[0], size)
if rc == 0 {
return "", windows.GetLastError()
}
// convert utf16 to utf8
return windows.UTF16ToString(utf16), nil
}
This function converts strings from the current code page to UTF-8. If text that is already UTF-8 is passed through it, it will result in an incorrect conversion or an error. Therefore, let's perform proper detection.
type Filter struct {
sc *bufio.Scanner
text string
ansi bool
err error
}
This is the type definition for a Scanner with character code detection. It basically wraps bufio.Scanner, but we replace only the Scan() and Text() methods.
func NewFilter(r io.Reader) *Filter {
return &Filter{
sc: bufio.NewScanner(r),
}
}
Constructor. We'll take advantage of zero values to keep things simple.
-
Scannermember: Thebufio.Scannerused internally. -
textmember: Stores the conversion result. -
ansimember: Set to true if it's confirmed as non-UTF-8. Initially false because it's unknown. -
errmember: Error...
All detection and conversion are performed within Scan.
func (f *Filter) Scan() bool {
if !f.sc.Scan() {
f.err = f.sc.Err()
return false
}
line := f.sc.Bytes()
if !f.ansi && utf8.Valid(line) {
f.text = f.sc.Text()
} else {
f.text, f.err = ansiToUtf8(line)
if f.err != nil {
return false
}
f.ansi = true
}
return true
}
func (f *Filter) Text() string {
return f.text
}
func (f *Filter) Err() error {
return f.err
}
So, can we really achieve proper automatic detection with this? Let's check with some sample code.
For now, the package is named "mbcs". Initialize it with go mod init mbcs and write the following code as example.go in the same folder.
Normally, main and mbcs cannot coexist in the same directory, but since this main is only intended for go run, let's add //+build ignore to the first line to exclude it from go build.
//+build ignore
package main
import (
"fmt"
"os"
"mbcs"
)
func main() {
filter := mbcs.NewFilter(os.Stdin)
for filter.Scan() {
fmt.Println(filter.Text())
}
if err := filter.Err(); err != nil {
fmt.Fprintln(os.Stderr, err.Error())
os.Exit(1)
}
}
Well, does it work?
Test #1:
$ nkf32 --guess sample1.txt
Shift_JIS (CRLF)
$ hexdump sample1.txt
53 68 69 66 74 4A 49 53 0D 0A 82 C5 8F 91 82 A2
82 BD 81 42 83 54 83 93 83 76 83 8B 83 65 83 4C
83 58 83 67 82 C5 82 B7 81 42 0D 0A 94 BB 92 E8
82 C5 82 AB 82 E9 82 A9 82 C8 0D 0A
$ go run example.go < sample1.txt
ShiftJIS
で書いた。サンプルテキストです。
判定できるかな
Test #2:
$ nkf32 --guess sample2.txt
UTF-8 (CRLF)
$ hexdump sample2.txt
55 54 46 38 0D 0A E3 81 A7 E6 9B B8 E3 81 84 E3
81 9F E3 80 82 E3 82 B5 E3 83 B3 E3 83 97 E3 83
AB E3 83 86 E3 82 AD E3 82 B9 E3 83 88 E3 81 A7
E3 81 99 E3 80 82 0D 0A E5 88 A4 E5 AE 9A E3 81
A7 E3 81 8D E3 82 8B E3 81 8B E3 81 AA 0D 0A
$ go run example.go < sample2.txt
UTF8
で書いた。サンプルテキストです。
判定できるかな
Looks like it works, doesn't it? (By the way, the built-in commands TYPE/MORE in the command-line shell NYAGOS use roughly the same logic.)
Additionally, the code above is available from the following repository (under the MIT license):
Discussion