📑
クエリのノーマライズ処理を実装してみた

2025/01/02に公開
MySQL
DBRE
normalizesql
tech
明けましておめでとうございます、今年もどうぞよろしくお願いしますｍｍ
データベースを利用するアプリケーションを開発していると、SQLクエリのログを解析したり、同じクエリを正規化して統計情報を集める必要が生じることがあります。特に、似たクエリ（例えば、条件値やリテラル値だけが異なるもの）を1つのグループにまとめることは、パフォーマンス分析やデータ利用パターンの把握に役立ちます。
冬休みの課題的に SQLクエリを正規化するための関数を Go で実装したので、その内容を紹介します。

 なぜクエリの正規化が必要か？データベースログの分析において、以下のような課題がありました：
条件値（リテラル値）や IN 句の内容が異なるクエリを同一視したい
サブクエリやコメントなど、解析対象として不要な部分を無視したい
SQL構文のばらつき（大文字・小文字、ホワイトスペースの違いなど）を統一したい
これらを解決するために、SQLクエリを正規化する関数を実装しました。

 実装内容以下に、主な実装内容をステップごとに説明します。

 1. クエリのトリミング
まず、クエリの先頭と末尾のホワイトスペースを削除します。また、クエリがシングルクオートで囲まれている場合、そのクオートも削除します。

 2. ホワイトスペースの統一
複数のスペースやタブを1つのスペースに置き換え、クエリの見た目を統一します。

 3. コメントの削除
SQLクエリには、ブロックコメント (/* ... */) が含まれることがあります。これらを削除して、クエリの本質的な部分だけを残します。

シングルラインコメント (--) に関しては AuditLog からのインプットを処理する際に一行で出力されているため、以降が全部削除されてしまうことを考慮すると許容するしかない、としたため、今回の処理には含めていません。

 4. リテラル値の正規化
クエリ内に含まれるリテラル値（文字列リテラル '...' や数値リテラル）をすべて ? に置き換えます。これにより、値の違いを無視した正規化が可能になります。

 5. IN句の正規化
IN句の中身を確認し、リストの場合は IN (?) に置き換えます。中身がサブクエリの場合は、そのサブクエリを再帰的に正規化します。

 6. VALUES句の正規化
INSERTクエリにおける VALUES句を VALUES (?) に置き換えます。

 7. LIMIT句の正規化
LIMIT句に含まれる数値を ? に置き換え、オフセット付きの形式（例: LIMIT 10, 20）にも対応します。

 8. サブクエリの正規化
サブクエリを再帰的に正規化し、全体の一貫性を確保します。

 9. クエリ全体を大文字に変換
最終的に、クエリをすべて大文字に変換してフォーマットのばらつきを排除します。

 10. SHA-256ハッシュの生成
正規化したクエリを SHA-256でハッシュ化し、一意の識別子として扱います。このハッシュを利用することで、クエリを簡単に比較できます。

 実際のコード以下が今回実装した関数のコードです：
normalizeSQL.gopackage cmd

import (
	"crypto/sha256"
	"encoding/hex"
	"regexp"
	"strings"
)

// Compile regexes once for efficiency
var (
	collapseWhitespaceRegex = regexp.MustCompile(`\s+`)
	blockCommentRegex       = regexp.MustCompile(`/\*.*?\*/`)
	singleLineCommentRegex  = regexp.MustCompile(`(?m)(--|#).*?$`)
	stringLiteralRegex      = regexp.MustCompile(`(?s)('[^']*'|"[^"]*")`)
	numericLiteralRegex     = regexp.MustCompile(`\b[0-9]+\b`)
	inClauseRegex           = regexp.MustCompile(`\bIN\s*\((.*?)\)`)
	valuesClauseRegex       = regexp.MustCompile(`\bVALUES?\s*\(.*?\)(?:\s*,\s*\(.*?\))*`)
	limitRegex              = regexp.MustCompile(`\bLIMIT\s+\d+(?:,\s*\d+)?`)
	subqueryRegex           = regexp.MustCompile(`\(\s*SELECT.*?\)`)
	unionRegex              = regexp.MustCompile(`(?i)SELECT .*?`)
)

func NormalizeSQL(query string) string {
	normarizedQuery := ExecNormalizeSQL(query)
	return computeHash(normarizedQuery)
}

func ExecNormalizeSQL(query string) string {
	// Step 1: Remove leading and trailing whitespaces
	if strings.HasPrefix(query, "'") && strings.HasSuffix(query, "'") {
		query = query[1 : len(query)-1]
	}

	// Step 2: Collapse multiple spaces or tabs into a single space
	query = collapseWhitespaceRegex.ReplaceAllString(query, " ")

	// Step 3: Remove block comments /* ... */
	query = blockCommentRegex.ReplaceAllString(query, " ")

	// Step 4: Remove single-line comments (-- or #)
	query = singleLineCommentRegex.ReplaceAllString(query, " ")

	// Step 5: Normalize subqueries
	query = normalizeSubqueries(query, subqueryRegex)

	// Step 6: Replace string literals ('...' or "...") with ?
	query = stringLiteralRegex.ReplaceAllString(query, "?")

	// Step 7: Replace numeric literals with ?
	query = numericLiteralRegex.ReplaceAllString(query, "?")

	// Step 8: Normalize IN(...) lists
	query = normalizeInClauses(query)

	// Step 9: Normalize VALUES(...) lists
	if strings.HasPrefix(strings.ToUpper(query), "INSERT INTO") {
		query = valuesClauseRegex.ReplaceAllString(query, "VALUES (?)")
	}

	// Step 10: Normalize LIMIT clauses
	query = limitRegex.ReplaceAllString(query, "LIMIT ?")

	// Step 11: Normalize UNION repetitions manually
	query = normalizeUnions(query)

	// Step 12: Convert to uppercase
	query = strings.ToUpper(query)

	// Step 13: Trim leading and trailing spaces
	query = strings.TrimSpace(query)

	// Step 14: Collapse multiple spaces or tabs into a single space
	query = collapseWhitespaceRegex.ReplaceAllString(query, " ")

	return query
}

func normalizeUnions(query string) string {
	matches := unionRegex.FindAllString(query, -1)
	if len(matches) < 2 {
		return query // No repetitions to normalize
	}

	normalized := matches[0] // Keep the first SELECT clause
	for i := 1; i < len(matches); i++ {
		if matches[i] != normalized {
			return query // Found a different SELECT clause, no normalization
		}
	}
	return strings.Replace(query, matches[0], normalized, 1)
}

func normalizeSubqueries(query string, subqueryRegex *regexp.Regexp) string {
	// Count the number of `SELECT ` occurrences
	selectCount := strings.Count(strings.ToUpper(query), "SELECT ")
	processed := 0

	for processed < selectCount {
		match := subqueryRegex.FindStringSubmatchIndex(query)
		if match == nil {
			break
		}

		// Extract the inner part of the subquery (without parentheses)
		subquery := query[match[0]+1 : match[1]-1]

		// Normalize the inner subquery
		normalizedSubquery := ExecNormalizeSQL(subquery)

		// Replace the original subquery with the normalized version
		query = query[:match[0]] + "(" + normalizedSubquery + ")" + query[match[1]:]

		processed++
	}

	return query
}

func normalizeInClauses(query string) string {
	return inClauseRegex.ReplaceAllStringFunc(query, func(match string) string {
		// Extract the content inside the parentheses
		start := strings.Index(match, "(")
		end := strings.LastIndex(match, ")")
		if start == -1 || end == -1 || start >= end {
			return match // Invalid IN clause; return as is
		}

		content := strings.TrimSpace(match[start+1 : end]) // Extract the part inside parentheses

		// Check if the content is a subquery
		if strings.HasPrefix(strings.TrimSpace(strings.ToUpper(content)), "SELECT ") {
			// Normalize the subquery
			normalizedSubquery := ExecNormalizeSQL(content)
			return "IN (" + normalizedSubquery + ")"
		}

		// Otherwise, treat it as a list of literals and replace with ?
		return "IN (?)"
	})
}

func computeHash(query string) string {
	hasher := sha256.New()
	hasher.Write([]byte(query))
	return hex.EncodeToString(hasher.Sum(nil))
}

 テストコード簡単なテストコードも載せてみます
normalizeSQL_test.gopackage cmd

import (
	"testing"
)

func TestExecNormalizeSQL(t *testing.T) {
	tests := []struct {
		name     string
		input    string
		expected string
	}{
		{
			name:     "Multiple spaces",
			input:    "SELECT    *    FROM    users",
			expected: "SELECT * FROM USERS",
		},
		{
			name:     "Tabs",
			input:    "SELECT\t*\tFROM\tusers",
			expected: "SELECT * FROM USERS",
		},
		{
			name:     "Mixed spaces and tabs",
			input:    "SELECT \t *  \t FROM \t users",
			expected: "SELECT * FROM USERS",
		},
		{
			name:     "Empty input",
			input:    "",
			expected: "",
		},
		{
			name:     "rollback",
			input:    "rollback",
			expected: "ROLLBACK",
		},
		{
			name:     "commit",
			input:    "commit",
			expected: "COMMIT",
		},
		{
			name:     "set names",
			input:    "SET NAMES utf8mb4",
			expected: "SET NAMES UTF8MB4",
		},
		{
			name:     "set autocommit",
			input:    "set autocommit = 0",
			expected: "SET AUTOCOMMIT = ?",
		},
		{
			name:     "max and interval",
			input:    "SELECT  MAX(update_At) + interval 5 SECOND FROM  contract",
			expected: "SELECT MAX(UPDATE_AT) + INTERVAL ? SECOND FROM CONTRACT",
		},
		{
			name:     "Single line comment",
			input:    "SELECT * FROM users /* get all users */ WHERE id = 1;",
			expected: "SELECT * FROM USERS WHERE ID = ?;",
		},
		{
			name:     "Multiple comments",
			input:    "/* comment */ SELECT * /* another comment */ FROM users;",
			expected: "SELECT * FROM USERS;",
		},
		{
			name:     "No comments",
			input:    "SELECT * FROM users WHERE id = 1;",
			expected: "SELECT * FROM USERS WHERE ID = ?;",
		},
		{
			name:     "Simple INSERT",
			input:    "INSERT INTO users (id, name) VALUES ('1', 'John Doe');",
			expected: "INSERT INTO USERS (ID, NAME) VALUES (?);",
		},
		{
			name:     "INSERT with numbers",
			input:    "INSERT INTO users (id, name) VALUES (1, 'John Doe');",
			expected: "INSERT INTO USERS (ID, NAME) VALUES (?);",
		},
		{
			name:     "INSERT with JSON",
			input:    "INSERT INTO users (id, data) VALUES (1, '{\"key\":\"value\"}');",
			expected: "INSERT INTO USERS (ID, DATA) VALUES (?);",
		},
		{
			name:     "Equals",
			input:    "SELECT * FROM users WHERE id = 123",
			expected: "SELECT * FROM USERS WHERE ID = ?",
		},
		{
			name:     "Not Equals",
			input:    "SELECT * FROM users WHERE name != 'John'",
			expected: "SELECT * FROM USERS WHERE NAME != ?",
		},
		{
			name:     "Less Than",
			input:    "SELECT * FROM users WHERE age < 30",
			expected: "SELECT * FROM USERS WHERE AGE < ?",
		},
		{
			name:     "Greater Than or Equals",
			input:    "SELECT * FROM users WHERE salary >= 5000",
			expected: "SELECT * FROM USERS WHERE SALARY >= ?",
		},
		{
			name:     "Quoted String",
			input:    "SELECT * FROM users WHERE nickname = 'Johnny'",
			expected: "SELECT * FROM USERS WHERE NICKNAME = ?",
		},
		{
			name:     "Double Quoted String",
			input:    "SELECT * FROM users WHERE title = \"Manager\"",
			expected: "SELECT * FROM USERS WHERE TITLE = ?",
		},
		{
			name:     "Equals with string containing equals",
			input:    "SELECT * FROM users WHERE description = 'name = value'",
			expected: "SELECT * FROM USERS WHERE DESCRIPTION = ?",
		},
		{
			name:     "Not Equals with string containing not equals",
			input:    "SELECT * FROM users WHERE note != '!= symbol used'",
			expected: "SELECT * FROM USERS WHERE NOTE != ?",
		},
		{
			name:     "Mixed operators in string",
			input:    "SELECT * FROM users WHERE config = 'x <= 10 and y >= 20'",
			expected: "SELECT * FROM USERS WHERE CONFIG = ?",
		},
		{
			name: "CASE statement",
			input: "SELECT id, " +
				"CASE WHEN age < 18 THEN 'Minor' " +
				"WHEN age BETWEEN 18 AND 65 THEN 'Adult' " +
				"ELSE 'Senior' END AS age_group " +
				"FROM users WHERE salary > 5000",
			expected: "SELECT ID, " +
				"CASE WHEN AGE < ? THEN ? " +
				"WHEN AGE BETWEEN ? AND ? THEN ? " +
				"ELSE ? END AS AGE_GROUP " +
				"FROM USERS WHERE SALARY > ?",
		},
		{
			name: "UNION example",
			input: "SELECT id, name FROM users WHERE age < 30 " +
				"UNION ALL " +
				"SELECT id, name FROM employees WHERE salary >= 5000",
			expected: "SELECT ID, NAME FROM USERS WHERE AGE < ? " +
				"UNION ALL " +
				"SELECT ID, NAME FROM EMPLOYEES WHERE SALARY >= ?",
		},
		{
			name: "Nested CASE and UNION",
			input: "SELECT id, " +
				"CASE WHEN age < 30 THEN 'Young' ELSE 'Old' END AS age_group " +
				"FROM users WHERE status = 'active' " +
				"UNION " +
				"SELECT id, 'Employee' AS role " +
				"FROM employees WHERE department = 'IT'",
			expected: "SELECT ID, " +
				"CASE WHEN AGE < ? THEN ? ELSE ? END AS AGE_GROUP " +
				"FROM USERS WHERE STATUS = ? " +
				"UNION " +
				"SELECT ID, ? AS ROLE " +
				"FROM EMPLOYEES WHERE DEPARTMENT = ?",
		},
		{
			name:     "Nested WHERE with OR",
			input:    "SELECT * FROM users WHERE (age < 30 OR name = 'John') AND salary >= 1000",
			expected: "SELECT * FROM USERS WHERE (AGE < ? OR NAME = ?) AND SALARY >= ?",
		},
		{
			name:     "Subquery",
			input:    "SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE amount > 100)",
			expected: "SELECT * FROM USERS WHERE ID IN (SELECT USER_ID FROM ORDERS WHERE AMOUNT > ?)",
		},
		{
			name:     "UPDATE statement",
			input:    "UPDATE users SET name = 'John', age = 30 WHERE id = 1",
			expected: "UPDATE USERS SET NAME = ?, AGE = ? WHERE ID = ?",
		},
		{
			name:     "Nested Subquery",
			input:    "SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.amount > 100)",
			expected: "SELECT * FROM USERS WHERE EXISTS (SELECT ? FROM ORDERS WHERE ORDERS.USER_ID = USERS.ID AND ORDERS.AMOUNT > ?)",
		},
		{
			name:     "JOIN Clause",
			input:    "SELECT users.id, orders.id FROM users JOIN orders ON users.id = orders.user_id WHERE orders.amount > 100",
			expected: "SELECT USERS.ID, ORDERS.ID FROM USERS JOIN ORDERS ON USERS.ID = ORDERS.USER_ID WHERE ORDERS.AMOUNT > ?",
		},
		{
			name:     "JOIN Clause with Subquery",
			input:    "SELECT users.id, orders.id FROM users JOIN orders ON users.id = orders.user_id WHERE orders.AMOUNT > 100",
			expected: "SELECT USERS.ID, ORDERS.ID FROM USERS JOIN ORDERS ON USERS.ID = ORDERS.USER_ID WHERE ORDERS.AMOUNT > ?",
		},
		{
			name:     "Non-matching JOIN Clause",
			input:    "SELECT users.id, orders.id FROM users JOIN orders ON users.name = orders.name WHERE orders.amount > 100",
			expected: "SELECT USERS.ID, ORDERS.ID FROM USERS JOIN ORDERS ON USERS.NAME = ORDERS.NAME WHERE ORDERS.AMOUNT > ?",
		},
		{
			name: "Deeply Nested Subqueries",
			input: "SELECT id FROM users WHERE EXISTS (" +
				"SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.amount > (" +
				"SELECT MAX(amount) FROM transactions WHERE transactions.order_id = orders.id AND transactions.status = 'approved'" +
				"))",
			expected: "SELECT ID FROM USERS WHERE EXISTS (" +
				"SELECT ? FROM ORDERS WHERE ORDERS.USER_ID = USERS.ID AND ORDERS.AMOUNT > (" +
				"SELECT MAX(AMOUNT) FROM TRANSACTIONS WHERE TRANSACTIONS.ORDER_ID = ORDERS.ID AND TRANSACTIONS.STATUS = ?" +
				"))",
		},
	}

	for _, test := range tests {
		t.Run(test.name, func(t *testing.T) {
			result := ExecNormalizeSQL(test.input)
			if result != test.expected {
				t.Errorf("unexpected result for %s: got %q, want %q", test.name, result, test.expected)
			}
		})
	}
}

func TestNormalizeSQL(t *testing.T) {
	tests := []struct {
		name     string
		input    string
		expected string
	}{
		{
			name:     "Multiple spaces",
			input:    "SELECT    *    FROM    users",
			expected: "5322201d23213d0fffb03c11be918a0a8b811fbf76313dcd241a0505d7ea1309",
		},
		{
			name:     "Tabs",
			input:    "SELECT\t*\tFROM\tusers",
			expected: "5322201d23213d0fffb03c11be918a0a8b811fbf76313dcd241a0505d7ea1309",
		},
		{
			name:     "Mixed spaces and tabs",
			input:    "SELECT \t *  \t FROM \t users",
			expected: "5322201d23213d0fffb03c11be918a0a8b811fbf76313dcd241a0505d7ea1309",
		},
		{
			name:     "Empty input",
			input:    "",
			expected: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
		},
		{
			name:     "rollback",
			input:    "rollback",
			expected: "587fa628a9fe4642ac0e22825a97e81d388af674f7251a9fb699f96901dcb710",
		},
		{
			name:     "commit",
			input:    "commit",
			expected: "79663c1d3b43ceaf9ee728e501c64b57ae2e4d6cb36ff5c65eddcda1de27342f",
		},
		{
			name:     "set names",
			input:    "SET NAMES utf8mb4",
			expected: "52bec1dad9fdc25ef7ce9b3c98e28aeed8889150d944b378eff3edb6d8c540ca",
		},
		{
			name:     "set autocommit",
			input:    "set autocommit = 0",
			expected: "71c14252ba2b66238a774c159df713d225f462836827b7aceeb7ca168025622d",
		},
		{
			name:     "max and interval",
			input:    "SELECT  MAX(update_At) + interval 5 SECOND FROM  contract",
			expected: "7ac949445a6e0bce3bd482c29a705d278b7fa6d4de0fa7e6b0448cc6b5038b7a",
		},
		{
			name:     "Single line comment",
			input:    "SELECT * FROM users /* get all users */ WHERE id = 1;",
			expected: "2148f9d8bd4edfc70fdccf0422164be320bd6665d8207fecf4f0fe87364e90fb",
		},
		{
			name:     "Multiple comments",
			input:    "/* comment */ SELECT * /* another comment */ FROM users;",
			expected: "74869f1296eac72dad6cbb881ba738f08a4d588c2d8daa96160fbd68e37f10b0",
		},
		{
			name:     "No comments",
			input:    "SELECT * FROM users WHERE id = 1;",
			expected: "2148f9d8bd4edfc70fdccf0422164be320bd6665d8207fecf4f0fe87364e90fb",
		},
		{
			name:     "Simple INSERT",
			input:    "INSERT INTO users (id, name) VALUES ('1', 'John Doe');",
			expected: "a0dd57292e81a88c183d22e45fc1b9c9036584c76b4624b58af9fa1bb75f5903",
		},
		{
			name:     "INSERT with numbers",
			input:    "INSERT INTO users (id, name) VALUES (1, 'John Doe');",
			expected: "a0dd57292e81a88c183d22e45fc1b9c9036584c76b4624b58af9fa1bb75f5903",
		},
		{
			name:     "INSERT with JSON",
			input:    "INSERT INTO users (id, data) VALUES (1, '{\"key\":\"value\"}');",
			expected: "ef4fbdbdec2ef0e10a72d1c12ffb5cb75b84259cd8069e2f9ac8fe7cf6a5072c",
		},
		{
			name:     "Equals",
			input:    "SELECT * FROM users WHERE id = 123",
			expected: "1a28ec2d7c90a1992f1aa3caa3245eb7ba41ddd76431b7495740f8e14bc07eb7",
		},
		{
			name:     "Not Equals",
			input:    "SELECT * FROM users WHERE name != 'John'",
			expected: "299a33fe6cd02b18a4075ae6b5b1bb63970f88ddcf9c732c76999c2a6e6d51f1",
		},
		{
			name:     "Less Than",
			input:    "SELECT * FROM users WHERE age < 30",
			expected: "22e1b0ca37b1d9a7906eb47d6d3f8b4a290782fc335bca5e869e1434885faffa",
		},
		{
			name:     "Greater Than or Equals",
			input:    "SELECT * FROM users WHERE salary >= 5000",
			expected: "4440b74ef2af3b8f71bcbdecf9f0c0b3401999a259b67a9302c34f76f5efb932",
		},
		{
			name:     "Quoted String",
			input:    "SELECT * FROM users WHERE nickname = 'Johnny'",
			expected: "a3936e69e3dd6546621498bdf5051f8433be3c9cec0dac6e999b2e9a052e6c46",
		},
		{
			name:     "Double Quoted String",
			input:    "SELECT * FROM users WHERE title = \"Manager\"",
			expected: "6eebdd4eab483faa488d0ff10980e41c2333f65fc8d0b8dd0ddcbcd39fa6b3f2",
		},
		{
			name:     "Equals with string containing equals",
			input:    "SELECT * FROM users WHERE description = 'name = value'",
			expected: "d30e3dafd2d3a1f1c29488c30685bc5627bac007b6a5ddbf07f1d80c13e7717f",
		},
		{
			name:     "Not Equals with string containing not equals",
			input:    "SELECT * FROM users WHERE note != '!= symbol used'",
			expected: "1bcb07d41f4abddeb8e226f3958fb0329a6f715c1e263bec7be37e638b8b6c0e",
		},
		{
			name:     "Mixed operators in string",
			input:    "SELECT * FROM users WHERE config = 'x <= 10 and y >= 20'",
			expected: "f87c0538e6ff33e2453736b7223af915adebf30ccdbf91080fd1154a7570b8db",
		},
		{
			name: "CASE statement",
			input: "SELECT id, " +
				"CASE WHEN age < 18 THEN 'Minor' " +
				"WHEN age BETWEEN 18 AND 65 THEN 'Adult' " +
				"ELSE 'Senior' END AS age_group " +
				"FROM users WHERE salary > 5000",
			expected: "7f46bb0f210d168cb0a154403d10bd52a6faca07b89ad7238e42b352101faf32",
		},
		{
			name: "UNION example",
			input: "SELECT id, name FROM users WHERE age < 30 " +
				"UNION ALL " +
				"SELECT id, name FROM employees WHERE salary >= 5000",
			expected: "a89e0be349dce5269642895959635b7a42fc543802a1bd3aa2ccac0d580f02f2",
		},
		{
			name: "Nested CASE and UNION",
			input: "SELECT id, " +
				"CASE WHEN age < 30 THEN 'Young' ELSE 'Old' END AS age_group " +
				"FROM users WHERE status = 'active' " +
				"UNION " +
				"SELECT id, 'Employee' AS role " +
				"FROM employees WHERE department = 'IT'",
			expected: "cc8c80953bedce98eade878ee6757219789037e20a3346ea3f7a32487b644f00",
		},
		{
			name:     "Nested WHERE with OR",
			input:    "SELECT * FROM users WHERE (age < 30 OR name = 'John') AND salary >= 1000",
			expected: "504d7cdbf010c05e0053d2c5a76a635f9165458322e45b19f9b168230e0311ab",
		},
		{
			name:     "Subquery",
			input:    "SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE amount > 100)",
			expected: "d47fa3d1d57d671389af68ba2e1b0380dc7e5e1c6d22ccd370fb736962302b2e",
		},
		{
			name:     "UPDATE statement",
			input:    "UPDATE users SET name = 'John', age = 30 WHERE id = 1",
			expected: "2c990554b9251214198fa5e0e94ee3011c3847cb1252569898c42bea8710b1cc",
		},
		{
			name:     "Nested Subquery",
			input:    "SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.amount > 100)",
			expected: "18d995d15ea85a6f399966820f3ef6bb6ac8ec1a30973ef0cf3817fade481e4f",
		},
		{
			name:     "JOIN Clause",
			input:    "SELECT users.id, orders.id FROM users JOIN orders ON users.id = orders.user_id WHERE orders.amount > 100",
			expected: "834d6ac0c815f6776c1e86889e8cea979acaf526569a9877815554f903dda0fc",
		},
		{
			name:     "Non-matching JOIN Clause",
			input:    "SELECT users.id, orders.id FROM users JOIN orders ON users.name = orders.name WHERE orders.amount > 100",
			expected: "93ff4fd9152c8e3fecbb009701b96aa09697c4338fecf5ba5f0bf7eaa701fac3",
		},
		{
			name: "Deeply Nested Subqueries",
			input: "SELECT id FROM users WHERE EXISTS (" +
				"SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.amount > (" +
				"SELECT MAX(amount) FROM transactions WHERE transactions.order_id = orders.id AND transactions.status = 'approved'" +
				"))",
			expected: "6be1d1abc0fee95e4ce90c0a8ef2a2701d02acfb2510aca9b80258bd201b810e",
		},
	}

	for _, test := range tests {
		t.Run(test.name, func(t *testing.T) {
			result := NormalizeSQL(test.input)
			if result != test.expected {
				t.Errorf("unexpected hash result for %s: got %s, want %s", test.input, result, test.expected)
			}
		})
	}
}

 結果この正規化関数を使うことで、次のようなクエリが同じハッシュに正規化されます：
元のクエリ例 1:
SELECT * FROM users WHERE id = 123
元のクエリ例 2:
SELECT * FROM users WHERE id = 456
正規化後:
SELECT * FROM USERS WHERE ID = ?
ハッシュ値:
2148f9d8bd4edfc70fdccf0422164be320bd6665d8207fecf4f0fe87364e90fb
様々なクエリのタイプを検証したかったのでちょうど手元にあった Aurora の Audit Log を喰わせてみました。
このファイルだと 総SQL数 5942527 行、 SQL セッションで 289726 行、SQL Hash 数 (ほぼユニーククエリ)で 7353 行、これくらいを 10分程度で終わらせるならまぁまぁ許容範囲かなぁと思います。
そして 総SQL数 5942527 行に対して SQL Hash 数が 7353 だとすると 0.12% 程度しか見なくてもいいかもしれない、となるとちょっと希望が見えてくる気がしませんか？
LOG18:02:39 Query Processing Progress:   5% (  862483 / 17249624 lines processed,   553488 skipped)
18:02:52 Query Processing Progress:  10% ( 1724965 / 17249624 lines processed,  1116278 skipped)
18:03:07 Query Processing Progress:  15% ( 2587444 / 17249624 lines processed,  1619920 skipped)
18:03:29 Query Processing Progress:  20% ( 3449928 / 17249624 lines processed,  2162617 skipped)
18:03:57 Query Processing Progress:  25% ( 4312407 / 17249624 lines processed,  2704205 skipped)
18:04:25 Query Processing Progress:  30% ( 5174900 / 17249624 lines processed,  3270711 skipped)
18:04:55 Query Processing Progress:  35% ( 6037370 / 17249624 lines processed,  3831227 skipped)
18:05:24 Query Processing Progress:  40% ( 6899850 / 17249624 lines processed,  4396881 skipped)
18:05:55 Query Processing Progress:  45% ( 7762336 / 17249624 lines processed,  4964287 skipped)
18:06:25 Query Processing Progress:  50% ( 8624812 / 17249624 lines processed,  5529362 skipped)
18:06:55 Query Processing Progress:  55% ( 9487294 / 17249624 lines processed,  6110344 skipped)
18:07:25 Query Processing Progress:  60% (10349782 / 17249624 lines processed,  6677649 skipped)
18:07:56 Query Processing Progress:  65% (11212257 / 17249624 lines processed,  7252093 skipped)
18:08:28 Query Processing Progress:  70% (12074746 / 17249624 lines processed,  7826188 skipped)
18:09:02 Query Processing Progress:  75% (12937218 / 17249624 lines processed,  8389854 skipped)
18:09:34 Query Processing Progress:  80% (13799713 / 17249624 lines processed,  8962829 skipped)
18:10:10 Query Processing Progress:  85% (14662182 / 17249624 lines processed,  9546664 skipped)
18:10:48 Query Processing Progress:  90% (15524663 / 17249624 lines processed, 10141122 skipped)
18:11:24 Query Processing Progress:  95% (16387152 / 17249624 lines processed, 10730542 skipped)
18:11:58 Query Processing Progress: 100% (17249624 / 17249624 lines processed, 11307097 skipped)
Summary:
{
  "#of_sql_texts": 5942527,
  "#of_unique_sessions": 289726,
  "#of_unique_sql_hashes": 7353
}
(現実的にはまだ Hash 化が甘いと思いますが。。。もうちょっとなんとかしたいところです)

 まとめ今回の実装により、SQLクエリのばらつきを統一的に処理する基盤ができました。
この関数を活用することで、データベースログの分析がさらに効率的になります。今後も機能拡張や改善を進めていきたいと思います。
もし同じような課題に直面している方がいれば、ぜひ参考にしてみてください！
Discussion

ログインするとコメントできます