iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🤪

macOS Sorting is Broken and It's Frustrating

に公開

[Update] It seems to be fixed

【Good News】In macOS 15.4, the sort order of the sort command has been fixed properly! 〜 Why Jupiter was cutting in between Mars and Mercury #grep - Qiita


This is an article for Day 2 of the SmartHR Advent Calendar 2024 Series 2.

https://qiita.com/advent-calendar/2024/smarthr

It seems a tweet I made a while ago got a fair amount of attention, so I'll write it down properly (crude language isn't great, after all).

https://x.com/tmtms/status/1859033067115213118

When you sort the following text file with the sort command on macOS, it ends up in a complete mess.

千葉県千葉市
千葉県市川市
千葉県柏市
東京都北区
東京都千代田区
東京都大田区
東京都江戸川区
東京都江東区
東京都港区
$ sort a.txt
千葉県柏市
東京都北区
東京都港区
千葉県千葉市
千葉県市川市
東京都大田区
東京都江東区
東京都千代田区
東京都江戸川区

It looks like they're being ordered by the number of characters, ignoring the character codes. It's useless. macOS is so stupid.

The cause seems to be that the /usr/share/locale/ja_JP.UTF-8/LC_COLLATE file is linked to la_LN.US-ASCII/LC_COLLATE.

$ cd /usr/share/locale/
$ ls -l ja_JP.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 10 15 20:22 ja_JP.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

https://blog.zhimingwang.org/macos-lc_collate-hunt

https://ja.stackoverflow.com/questions/100019/

LC_COLLATE determines collation rules, and since it points to a file for ASCII characters, it seems to treat all non-ASCII characters as having the same value.

In other words, it's like sorting data like this. No wonder it ends up ordered by character count. It seems that if characters are considered identical, they are then ordered by their character codes.










Note that by specifying LC_COLLATE=C, it is sorted as a simple byte sequence, resulting in sorting by character code order.

$ LC_COLLATE=C sort a.txt
千葉県千葉市
千葉県市川市
千葉県柏市
東京都北区
東京都千代田区
東京都大田区
東京都江戸川区
東京都江東区
東京都港区

I originally noticed this not from the sort command, but because the sorting in PostgreSQL's ORDER BY was behaving strangely.

My work development PC is a MacBook, and I run PostgreSQL on Linux using Docker. It's quite slow because of that, so when I tried running it directly on macOS without Docker, the sort order went crazy.

PostgreSQL defaults to using libc for its locale, so it depends on the OS environment.

$ initdb -U postgres /tmp/hoge
postgres=# \x
Expanded display is on.
postgres=# \l postgres
List of databases
-[ RECORD 1 ]--------+----------------
Name                 | postgres
Owner                | masahiro.tomita
Encoding             | UTF8
Locale Provider      | libc
Collation            | ja_JP.UTF-8
Ctype                | ja_JP.UTF-8
ICU Locale           | 
ICU Rules            | 
Access privileges    | 

In this state, performing an ORDER BY reveals that it has become just as "stupid" as the OS.

postgres=# create table test (c varchar);
CREATE TABLE
postgres=# insert into test (c) values ('あああああ'),('いいいい'),('ううう'),('ええ'),('お');
INSERT 0 5
postgres=# select * from test;
     c      
------------
 あああああ
 いいいい
 ううう
 ええ

(5 rows)

postgres=# select * from test order by c;
     c      
------------

 ええ
 ううう
 いいいい
 あああああ
(5 rows)

The strange behavior caused by RDB Collation reminded me of the MySQL "Sushi-Beer" problem.

https://twitter.com/tmtms/status/546925668424896512

Just like I did with sort, specifying --lc-collate=C during initdb worked fine.

$ initdb -U postgres --lc-collate=C /tmp/hoge
postgres=# \l postgres
List of databases
-[ RECORD 1 ]--------+------------
Name                 | postgres
Owner                | postgres
Encoding             | UTF8
Locale Provider      | libc
Collation            | C
Ctype                | ja_JP.UTF-8
ICU Locale           | 
ICU Rules            | 
Access privileges    | 


postgres=# select * from test order by c;
     c      
------------
 あああああ
 いいいい
 ううう
 ええ

(5 rows)

However, I wondered about having COLLATE as C when the encoding is UTF-8... so since PostgreSQL can use the ICU locale in addition to the libc locale, I decided to specify that.

$ initdb -U postgres --locale-provider=icu --icu-locale=ja_JP.UTF-8 /tmp/hoge
postgres=# \l postgres
List of databases
-[ RECORD 1 ]--------+------------
Name                 | postgres
Owner                | postgres
Encoding             | UTF8
Locale Provider      | icu
Collation            | ja_JP.UTF-8
Ctype                | ja_JP.UTF-8
ICU Locale           | ja-JP
ICU Rules            | 
Access privileges    | 

postgres=# select * from test order by c;
     c      
------------
 あああああ
 いいいい
 ううう
 ええ

(5 rows)

I've always felt that the macOS UI was hard to use, but I didn't realize it was this flawed as an OS too.
Well, I could just use a Docker container to get a proper Linux environment instead of macOS, but it's just so slow...

Honestly, I want to use desktop Linux for development.

Discussion