概要

日本語をはじめとするマルチバイトを含むデータでは、文字コードに悩まされることが多いです。

今回はタイトル通り、PowerShellのFunctionを2つ自作してみました。
これらにより文字コードを判定と BOM付きか判定の2つが実現できます。

この記事のターゲット

PowerShell ユーザーの方
ファイルの文字コードを判定したい方
ファイルの文字コードでBOM付きか判定したい方

自作したFunction

作成したFunctionでは、Write-Hostで判定した結果を表示しています。
PowerShellスクリプトに組み込む場合、System.Booleanで返すようなFunctionの方が便利だと思うので、
必要に応じてカスタマイズしてください。

ファイルの文字コードを判定できるFunction

自作Function「Get-PsEncoding」

# 文字コードの判定
Function Get-PsEncoding {
	Param (
		[Parameter(Mandatory=$true)][System.String]$targetfile
	)
	$stream_reader = [System.IO.StreamReader] $targetfile
	$profile_encoding = $stream_reader.CurrentEncoding
	$stream_reader.Close()

	Write-Host "EncodingName: [$($profile_encoding.EncodingName)]"
}

ファイルの文字コードでBOM付きか判定できるFunction

自作Function「Check-BOMStatus」

Function Check-BOMStatus {
    Param (
        [Parameter(Mandatory=$true)][System.String]$targetfile
    )
    if (-Not (Test-Path $targetfile)) {
        Write-Host 'The target file does not exist.' -ForegroundColor Red
        return
    }
    
    # BOMのバイトシーケンス
    $UTF7_BOM1 = [System.Byte[]](0x2B,0x2F,0x76,0x38)
    $UTF7_BOM2 = [System.Byte[]](0x2B,0x2F,0x76,0x39)
    $UTF7_BOM3 = [System.Byte[]](0x2B,0x2F,0x76,0x2B)
    $UTF7_BOM4 = [System.Byte[]](0x2B,0x2F,0x76,0x2F)
    $UTF8_BOM = [System.Byte[]](0xEF,0xBB,0xBF)
    $UTF16BE_BOM = [System.Byte[]](0xFE,0xFF)
    $UTF16LE_BOM = [System.Byte[]](0xFF,0xFE)
    $UTF32BE_BOM = [System.Byte[]](0x00,0x00,0xFE,0xFF)
    $UTF32LE_BOM = [System.Byte[]](0xFF,0xFE,0x00,0x00)
    
    # 先頭行をバイトで読み込み先頭から3バイト分のデータを取得
    [System.Byte[]]$first_4bytes = (Get-Content -Path $targetfile -Encoding Byte -TotalCount 4)
    [System.Byte[]]$first_3bytes = $first_4bytes[0..2]
    [System.Byte[]]$first_2bytes = $first_4bytes[0..1]
    
    # 先頭バイトでBOM付きか判定
    # UTF-7
    if (($null -eq (Compare-Object $first_4bytes $UTF7_BOM1 -SyncWindow 0)) -Or
        ($null -eq (Compare-Object $first_4bytes $UTF7_BOM2 -SyncWindow 0)) -Or
        ($null -eq (Compare-Object $first_4bytes $UTF7_BOM3 -SyncWindow 0)) -Or
        ($null -eq (Compare-Object $first_4bytes $UTF7_BOM4 -SyncWindow 0))) {
        Write-Host "[$($targetfile)] is UTF-7 BOM."
    }
    # UTF-8
    elseif ($null -eq (Compare-Object $first_3bytes $UTF8_BOM -SyncWindow 0)) {
        Write-Host "[$($targetfile)] is UTF-8 BOM."
    }
    # UTF-16 BE
    elseif ($null -eq (Compare-Object $first_2bytes $UTF16BE_BOM -SyncWindow 0)) {
        Write-Host "[$($targetfile)] is UTF-16 BE BOM."
    }
    # UTF-16 LE
    elseif ($null -eq (Compare-Object $first_2bytes $UTF16LE_BOM -SyncWindow 0)) {
        Write-Host "[$($targetfile)] is UTF-16 LE BOM."
    }
    # UTF-32 BE
    elseif ($null -eq (Compare-Object $first_4bytes $UTF32BE_BOM -SyncWindow 0)) {
        Write-Host "[$($targetfile)] is UTF-32 BE BOM."
    }
    # UTF-32 LE
    elseif ($null -eq (Compare-Object $first_4bytes $UTF32LE_BOM -SyncWindow 0)) {
        Write-Host "[$($targetfile)] is UTF-32 LE BOM."
    }
    else {
        Write-Host "[$($targetfile)] is not BOM." -ForegroundColor Red
    }
}

参考情報

まとめ

System.IO.StreamReaderを使って文字コードを判定
先頭のバイト数を読み取りBOM付きか判定
UTF-7の場合：先頭4バイトにより判断
UTF-8の場合：先頭3バイトにより判断
UTF-16の場合：先頭2バイトにより判断
UTF-32の場合：先頭4バイトにより判断

関連記事

Discussion