iTranslated by AI
Indexing PostgreSQL text Columns
If you create a B-tree index on a text type column in PostgreSQL, an error will occur if a large string is inserted into that column.
(This article is based on verification results using PostgreSQL 12.5)
For example, prepare a table and index as follows:
CREATE TABLE table1 (
value text
);
CREATE INDEX ON table1(value);
When you insert large text, you can confirm that an error occurs.
INSERT INTO table1 VALUES(repeat('a', 300000));
testdb=# INSERT INTO table1 VALUES(repeat('a', 300000));
ERROR: index row size 3456 exceeds btree version 4 maximum 2704 for index "table1_value_idx"
DETAIL: Index row references tuple (0,1) in relation "table1".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
Note that the error message varies depending on the size of the text.
INSERT INTO table1 VALUES(repeat('a', 1000000));
testdb=# INSERT INTO table1 VALUES(repeat('a', 1000000));
ERROR: index row requires 11464 bytes, maximum size is 8191
How to Create an Index on Large Text
I think there are very few cases where you would want to set an index on a text type. Perhaps it's a situation where you want to store everything, including values that might be anomalies, and then match them against something else.
For such cases, I will try several methods to work around this issue and create an index on large text.
To check whether the index is effective, let's first prepare some test data.
CREATE TABLE table1 (
id integer,
value text
);
-- 10,000,000 rows of data from 0 to 999 characters
INSERT INTO table1
SELECT
seq,
repeat(seq::text, seq % 1000)
FROM
generate_series(1, 10000000) AS seq;
-- Data that causes an error with a B-tree index
INSERT INTO table1 VALUES(-1, repeat('a', 300000));
INSERT INTO table1 VALUES(-2, repeat('a', 1000000));
Let's look at the execution plan without an index for now.
SELECT id FROM table1 WHERE value = '1';
testdb=# EXPLAIN ANALYZE SELECT id FROM table1 WHERE value = '1';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..515777.44 rows=1 width=4) (actual time=3106.957..3110.776 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on table1 (cost=0.00..514777.34 rows=1 width=4) (actual time=2072.206..3104.169 rows=0 loops=3)
Filter: (value = '1'::text)
Rows Removed by Filter: 3333334
Planning Time: 0.058 ms
Execution Time: 3110.792 ms
(8 rows)
Time: 3111.104 ms (00:03.111)
With a Seq Scan, the cost was 1000.00..515777.44, and the execution time was about 3 seconds.
(Since I just want to know rough numbers, I won't repeatedly clear the cache and take an average of measurements this time.)
md5 functional index
As the error message suggested with Consider a function index of an MD5 hash of the value, or use full text indexing., by using a functional index with the md5 function, the value is restricted to the 32 characters of the MD5 hash. This allows the index to be created even for large strings without causing an error.
CREATE INDEX ON table1(md5(value));
testdb=# CREATE INDEX ON table1(md5(value));
CREATE INDEX
Time: 94736.893 ms (01:34.737)
To use this index, you must explicitly specify the md5 function in the WHERE clause.
Also, due to the possibility of collisions, you need to specify the condition without the md5 function as well. It's a bit of a hassle.
SELECT id FROM table1 WHERE value = '1' AND md5(value) = md5('1');
testdb=# EXPLAIN ANALYZE SELECT id FROM table1 WHERE value = '1' AND md5(value) = md5('1');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table1 (cost=1819.56..146875.59 rows=1 width=4) (actual time=1.648..1.650 rows=1 loops=1)
Recheck Cond: (md5(value) = 'c4ca4238a0b923820dcc509a6f75849b'::text)
Filter: (value = '1'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on table1_md5_idx (cost=0.00..1819.56 rows=50000 width=0) (actual time=0.844..0.845 rows=1 loops=1)
Index Cond: (md5(value) = 'c4ca4238a0b923820dcc509a6f75849b'::text)
Planning Time: 0.096 ms
Execution Time: 1.670 ms
(8 rows)
Time: 12.226 ms
The functional index is used, resulting in a Bitmap Index Scan with a cost of 1819.56..146875.59 and an execution time of about 0.012 seconds, which is significantly faster.
Large text that caused errors with B-tree can also be searched without issues.
testdb=# SELECT id FROM table1 WHERE value = repeat('a', 1000000) AND md5(value) = md5(repeat('a', 1000000));
id
----
-2
(1 row)
Time: 54.118 ms
The index size was 590,536,704 bytes.
testdb=# SELECT *, pg_relation_size(indexname::regclass) FROM pg_indexes WHERE tablename = 'table1';
schemaname | tablename | indexname | tablespace | indexdef | pg_relation_size
------------+-----------+----------------+------------+-----------------------------------------------------------------------+------------------
public | table1 | table1_md5_idx | | CREATE INDEX table1_md5_idx ON public.table1 USING btree (md5(value)) | 590536704
(1 row)
Time: 13.024 ms
substring functional index
Restricting the number of characters with the substring function can achieve an effect similar to the md5 function.
Let's try creating it with 32 characters, same as md5.
CREATE INDEX ON table1(substring(value from 1 for 32));
testdb=# CREATE INDEX ON table1(substring(value from 1 for 32));
CREATE INDEX
Time: 78851.593 ms (01:18.852)
Just like when using the md5 function, add the same condition as the functional index to the WHERE clause.
SELECT id FROM table1 WHERE value = '1' AND substring(value from 1 for 32) = substring('1' from 1 for 32);
testdb=# EXPLAIN ANALYZE SELECT id FROM table1 WHERE value = '1' AND substring(value from 1 for 32) = substring('1' from 1 for 32);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table1 (cost=1815.56..146871.59 rows=1 width=4) (actual time=6.278..6.280 rows=1 loops=1)
Recheck Cond: ("substring"(value, 1, 32) = '1'::text)
Filter: (value = '1'::text)
Heap Blocks: exact=1
-3 Bitmap Index Scan on table1_substring_idx (cost=0.00..1815.56 rows=50000 width=0) (actual time=6.267..6.267 rows=1 loops=1)
Index Cond: ("substring"(value, 1, 32) = '1'::text)
Planning Time: 0.092 ms
Execution Time: 6.302 ms
(8 rows)
Time: 6.855 ms
The functional index is used, resulting in a Bitmap Index Scan with a cost of 1815.56..146871.59 and an execution time of about 0.006 seconds, which is significantly faster.
You can see that the same effect as md5 was obtained.
Large text that caused errors with B-tree can also be searched without issues.
testdb=# SELECT id FROM table1 WHERE value = repeat('a', 1000000) AND substring(value from 1 for 32) = substring(repeat('a', 1000000) from 1 for 32);
id
----
-2
(1 row)
Time: 54.850 ms
The index size was 590,536,704 bytes. Perhaps because I matched the character count with md5, the size is almost the same as when created with md5.
testdb=# SELECT *, pg_relation_size(indexname::regclass) FROM pg_indexes WHERE tablename = 'table1';
schemaname | tablename | indexname | tablespace | indexdef | pg_relation_size
------------+-----------+----------------------+------------+--------------------------------------------------------------------------------------------+------------------
public | table1 | table1_substring_idx | | CREATE INDEX table1_substring_idx ON public.table1 USING btree ("substring"(value, 1, 32)) | 588972032
(1 row)
Time: 2.016 ms
Naturally, reducing the number of characters extracted by substring can also reduce the index size.
Let's try it with 10 characters.
CREATE INDEX ON table1(substring(value from 1 for 10));
testdb=# CREATE INDEX ON table1(substring(value from 1 for 10));
CREATE INDEX
Time: 66769.009 ms (01:06.769)
The execution plan doesn't seem to change much even with 10 characters.
testdb=# EXPLAIN ANALYZE SELECT id FROM table1 WHERE value = '1' AND substring(value from 1 for 10) = substring('1' from 1 for 10);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on table1 (cost=1147.44..146203.47 rows=1 width=4) (actual time=0.039..0.040 rows=1 loops=1)
Recheck Cond: ("substring"(value, 1, 10) = '1'::text)
Filter: (value = '1'::text)
Heap Blocks: exact=1
-3 Bitmap Index Scan on table1_substring_idx (cost=0.00..1147.43 rows=50000 width=0) (actual time=0.033..0.033 rows=1 loops=1)
Index Cond: ("substring"(value, 1, 10) = '1'::text)
Planning Time: 0.091 ms
Execution Time: 0.060 ms
(8 rows)
Time: 0.456 ms
The index size decreased from 590,536,704 bytes to 315,318,272 bytes.
testdb=# SELECT *, pg_relation_size(indexname::regclass) FROM pg_indexes WHERE tablename = 'table1';
schemaname | tablename | indexname | tablespace | indexdef | pg_relation_size
------------+-----------+----------------------+------------+--------------------------------------------------------------------------------------------+------------------
public | table1 | table1_substring_idx | | CREATE INDEX table1_substring_idx ON public.table1 USING btree ("substring"(value, 1, 10)) | 315318272
(1 row)
Time: 1.354 ms
Summary
I found that by using md5 or substring to create an index with restricted character counts, you can avoid errors when creating indexes for large text and improve performance for searches using those indexes.
Comparing md5 and substring, md5 takes more time to execute. This is because md5 performs a hash calculation by looking at the entire string, whereas substring only looks at a portion of the characters.
Consequently, the time required to create the index is also longer for md5. Additionally, with substring, using fewer characters further reduces the time.
The following are the measured times (in milliseconds) for repeated execution of CREATE INDEX.
| md5(value) | substring(value from 1 for 32) | substring(value from 1 for 10) | |
|---|---|---|---|
| 1 | 82,491.53 | 72,999.52 | 65,689.24 |
| 2 | 81,820.62 | 71,914.60 | 65,977.31 |
| 3 | 82,934.11 | 71,851.98 | 65,046.34 |
| Average | 82,415.42 | 72,255.37 | 65,570.96 |
In the case of substring, since it only looks at the leading characters, there is a possibility that the parts extracted by substring will have similar values—in other words, the cardinality might be low. If that happens, the effectiveness of the index will also decrease.
Conversely, md5 will not result in similar values and will maintain high cardinality, so this type of problem does not occur.
Based on the above, I believe it is best to create an index with a character count that maintains sufficient cardinality using substring. If it cannot be maintained with substring, then using md5 would be the way to go.
Also, although I didn't try it this time, there is a method of using something like full-text search (such as pg_bigm). However, that is generally intended for cases like partial matches; for exact matches like this one, I think a functional index is sufficient.
Reference: Support in MySQL
In MySQL, you can specify a prefix length for string column types (such as CHAR, VARCHAR, TEXT). (Prefix length is required for the TEXT type)
CREATE INDEX ON table1(value(32));
A very convenient thing about this is that you don't need to be aware of the prefix length when searching.
Unlike in PostgreSQL, there is no need to specify a substring function in the WHERE clause, so you won't run into a situation where you prepared a functional index but it wasn't used because it wasn't specified in the WHERE clause.
It would be convenient if something similar could be done in PostgreSQL.
Discussion