iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
💽

An In-depth Study of SQLite Internals

に公開

I recently read through the official SQLite documentation and wanted to summarize some of the interesting content I found.

While I have never used SQLite professionally, I have used it quite a bit for personal projects. For simple web services built as hobbies, SQLite is often more than sufficient as a database.
However, there are parts of SQLite that I have been using somewhat intuitively, and I have always felt the need to understand them more thoroughly.

SQLite Tables

There are three types of tables in SQLite:

  • Virtual
  • WITHOUT ROWID
  • ROWID

Virtual Tables

These are tables that allow you to operate on data outside the control of SQLite as if it were managed by SQLite. Examples include csv, which treats CSV files as tables, and fsdir, which parses specified paths or directories to treat them as tables.

Virtual tables have the following limitations:

  • You cannot create triggers.
  • You cannot create indexes using CREATE INDEX.

ROWID Tables

These are standard tables.

When creating an SQLite table, even if you specify a PRIMARY KEY, it might not be used as the primary key internally. Consider the following table:

CREATE TABLE word_counts (
  word TEXT PRIMARY KEY,
  cnt INTEGER
);

Even though the TEXT column word is specified as the PRIMARY KEY, it does not become the true primary key. Instead, the table maintains a separate INTEGER primary key called ROWID. In this case, the primary key on the word column of the word_counts table is equivalent to a simple unique index.

In the original SQL standard, PRIMARY KEY columns must be NOT NULL. However, in SQLite, PRIMARY KEY columns were allowed to be NULL in the initial release. To maintain backward compatibility, SQLite still allows PRIMARY KEY columns to be NULL.

So, how can you make the true primary key match the specified PRIMARY KEY? If the primary key is of type INTEGER, it will match the ROWID.

To be precise, the primary key must be a single column specified as an INTEGER type. An INT type will not work.

If you specify a key other than INTEGER as the primary key, you need to perform double index scans. For example, in the word_counts table, you would first scan the index of word, and then scan the table index using the corresponding ROWID.

WITHOUT ROWID Tables

As explained in the ROWID tables section, if the PRIMARY KEY is not an INTEGER, there is an additional computational cost to scan the true ROWID primary key, as well as an additional storage cost to maintain the ROWID separately.

This is why WITHOUT ROWID tables were introduced.
Since WITHOUT ROWID tables do not have an INTEGER ROWID, they have the advantage of saving both computational and storage costs.

However, WITHOUT ROWID tables come with limitations (along with other minor restrictions):

  1. A PRIMARY KEY is mandatory.
  2. AUTOINCREMENT does not function.
  3. All data in PRIMARY KEY columns must be NOT NULL.

Point 1 is unlikely to be an issue in practice. It is not difficult for most tables to have a PRIMARY KEY, and I have rarely seen cases where one is not defined. Point 2 is also not a practical issue since you generally wouldn't use AUTOINCREMENT on a non-INTEGER column. Point 3 is generally not a problem, as PRIMARY KEY columns are supposed to be NOT NULL by the SQL standard.

Session Extension

The Session extension records changes to an SQLite database table and saves them into a changeset or patchset file.
It functions similarly to the UNIX patch command, providing a mechanism to record and exchange differences (diffs) made to an SQLite database.

By using this Session extension, if two people update the same SQLite database and exchange their changeset files at the end of the day, they can reflect each other's changes into their respective SQLite databases. While it might be difficult at first glance to see how this is useful, consider a scenario where there is a master SQLite database on a server, and each client copies it to perform updates offline. When they come back online, they can upload their changeset files to the server to merge the changes from each client.

Changeset vs. Patchset

There are several differences between changeset and patchset:

Changeset Patchset
Content Records details of changes (old/new values) Records primary keys and changes
Size Large Small
DELETE Records old values for all columns Records only the primary key
UPDATE Records values before and after change Records only the changed values
Use Case Data synchronization and UNDO operations Data synchronization

Transactions

SQLite is a transactional database.
This means that changes made to the database within a transaction appear as if they occurred instantaneously, or not at all.

Here is how SQLite achieves this.

Rollback Journal

First, let's look at the Rollback Journal mode.
In this mode, SQLite creates a rollback journal file when modifying the database file. This file contains the original data of the database pages that are scheduled for modification.
This enables recovery from the rollback journal file even if a power failure occurs during database writing, which would otherwise result in an incomplete update.

The specific procedure is as follows:

Various locks are involved here; I will explain each of these locks later. While some might seem unnecessary at first, each has a specific and important meaning.


1. Acquire SHARED lock
Since the database must be read before writing, a SHARED lock is acquired first.

2. Acquire RESERVED lock
When writing data, a RESERVED lock is acquired first. No changes have started at this stage.

3. Create rollback journal file
Before applying changes, create the rollback journal file and write the original contents of the database pages slated for modification. In other words, the journal contains the information necessary to revert the database to its original state.

4. Write to database in user memory space
Apply changes to the database in the user memory space. Since each database connection has an independent user memory space, these changes are not yet visible to other connections.

5. Flush rollback journal file
Flush the changes to non-volatile storage (HDD, SSD, etc.) by calling the fsync system call. This protects the database from unexpected power loss.

6. Acquire EXCLUSIVE lock
Before writing changes to the database file itself, acquire an EXCLUSIVE lock.
In practice, this is done by first acquiring a PENDING lock and then upgrading it to an EXCLUSIVE lock.

7. Write to database file
Write the changes to the disk.

8. Flush database file
Flush the changes to non-volatile storage.

9. Delete rollback journal
Once the changes are successfully written to non-volatile storage, the journal is no longer needed, so it is deleted.

10. Release EXCLUSIVE lock
Release the lock.


If power is lost before step 6, there is no issue because the database file itself has not been modified. However, if power is lost during steps 7 or 8, the database file might contain inconsistencies. In such cases, the rollback journal file is used to recover the database.

WAL

Next, there is WAL (Write-Ahead Logging) mode, which is a newer mode compared to Rollback Journal.

In Rollback Journal mode, the database file is constantly updated with the latest data, and the journal file temporarily holds pre-update data for rollbacks. In WAL mode, the relationship between the database file and the WAL file is reversed; the WAL file holds the latest state.

WAL mode has the following trade-offs:

Advantages

  • Significantly faster in most scenarios.
  • Readers do not block writers, and writers do not block readers, increasing concurrency.
  • Disk I/O operations tend to be more sequential.
  • Fewer fsync calls compared to Rollback Journal mode (making it more resilient on systems with fsync issues and generally more performant).

Disadvantages

  • Transactions involving multiple databases are atomic per database, but there is no atomicity spanning all databases.
  • Slightly slower than Rollback Journal mode for applications that are read-heavy but have rare writes.

In WAL mode, two files are automatically created in addition to the main database file: the wal file and the shm file. The former is the WAL file used to append changes as described, and the latter is the shm file, also known as wal-index, which serves to avoid full scans of the WAL file. Simply put, the shm file is an index that manages where data is written within the WAL file, and it is shared among processes using memory-mapped files.

WAL File

In WAL mode, changes are written to the WAL file. Therefore, the WAL file would grow indefinitely unless it is reset and the changes are moved to the database file. To prevent this, an operation called a "checkpoint" is performed when data accumulates in the WAL file. A checkpoint reflects the data accumulated in the WAL file into the original database file and resets the WAL file. Checkpoints are executed automatically by SQLite, but this can be disabled to allow the application to handle them manually.

1 WAL file growth degrades read performance because both the database file and the WAL file must be read during data retrieval. Conversely, write performance does not degrade because writes are simply appended to the WAL.

The WAL file is a simple append-only log file managed in units called "frames."

  • A frame consists of a header and one page of database data.
  • The frame header contains two fields: page number and database size.

A frame containing a non-zero database size field is a "commit frame," which signifies the end of a transaction.

The WAL file contains concrete examples of frames like the following:

[Frame1: page=5, size=0]   # Mid-transaction A
[Frame2: page=8, size=0]   # Mid-transaction A
[Frame3: page=2, size=100] # Transaction A commit. Identified as a commit frame because size is non-zero
[Frame4: page=5, size=0]   # Mid-transaction B
[Frame5: page=9, size=0]   # Mid-transaction B

In the example above, Frame 4 and 5 represent an incomplete transaction without a commit frame. Such incomplete transactions are ignored during reads. Therefore, data up to Frame 3 is considered valid. A commit is completed by updating the mxFrame in the wal-index after fsync-ing the commit frame.

wal-index

With only the WAL file, reading data would require a full linear scan of the WAL file to find the latest committed frame corresponding to a page number, which would degrade read performance. This is where the wal-index comes in.

The wal-index is an index that reverse-maps page numbers to the latest commit frame numbers.

Furthermore, the wal-index maintains mxFrame, which represents the number of valid committed frames within the WAL file. Since WAL frames are numbered starting from 1, mxFrame effectively serves as the index of the last valid commit frame in the WAL file. When reading data, the value of mxFrame is recorded as the reader's snapshot boundary.

WAL Locks

In WAL mode, there are two types of locks: database-level locks and wal-index-level locks. Regardless of whether it is a read or write operation, all connections always hold a SHARED lock in WAL mode. While database-level locks will be described later, I will touch upon wal-index locks here.

State Description
WAL_WRITE Exclusive lock. Required to append to the end of the WAL file.
WAL_READ Read lock.
WAL_CKPT_LOCK Exclusive lock. Required when executing a checkpoint.

There are also WAL_RECOVER_LOCK and WAL_CHECKOUT_LOCK, which are omitted here.

Writing to the Database

  1. Determine WAL append position: Acquire the WAL_WRITE lock and reference the wal-index to determine the append position.
    • The wal-index is checked here to verify if the snapshot's mxFrame (from when the database read started) matches the current latest mxFrame in the wal-index. If they differ, the transaction fails due to a potential lost update.
  2. Append to WAL file: Perform an append to the end (no lock required as it is an append operation). The transaction is marked committed by appending a commit frame with a non-zero database size as the final frame of the transaction.
  3. Release locks: Release the WAL_WRITE lock.

In Rollback Journal mode, since the database file itself is modified, locks had to be escalated (SHARED -> RESERVED -> PENDING -> EXCLUSIVE) to exclude other connections. In WAL mode, because the writer does not modify the database file, there is no need for locks that conflict with readers in the database file.

Reading from the Database

When reading from the database, the system first checks the WAL. If the data is not in the WAL, it reads from the database file.

  1. At the start of a read: Acquire a WAL_READ lock and record the current mxFrame as the reader's snapshot boundary.
  2. Page retrieval: When a specific page number P is required, search for the largest frame number less than or equal to the reader's mxFrame in the wal-index. If found, read from the corresponding frame in the WAL file; otherwise, read from the main database file.
  3. Read completion: Release the WAL_READ lock.

Checkpoint Operation

  1. Start checkpoint process: Acquire the WAL_CKPT_LOCK.
  2. Determine backfill range: Determine the "safely backfillable range" up to the minimum readMark value among all connections.
  3. Copy data from WAL to database: Copy page data from the WAL file to the database file, fsync the database file, and update nBackfill.
  4. Reset WAL: If nBackfill == mxFrame (meaning the WAL is fully backfilled) and no connections hold WAL_READ locks above slot 0, allow the WAL to be reused from the beginning.
  5. Release locks: Release the WAL_CKPT_LOCK.

Two new keywords appear here: nBackfill and readMark.

  • nBackfill is a field that records the number of frames in the WAL file that have been written back to the database file. When nBackfill == mxFrame, it indicates that all frames in the WAL have been written back, allowing the WAL file to be reset and reused from the start.
  • readMark is a slot that records which mxFrame snapshot an active read connection is currently viewing.
WAL File: [Frame1][Frame2][Frame3][Frame4][Frame5][Frame6][Frame7]
           ^              ^                                ^
        Reader A         Reader B                         Current
        readMark=0       readMark=3                       mxFrame=7
        (Read DB only)   (Referencing up to Frame 3)

In the example above, Reader B is reading up to Frame 3 (Reader B's readMark = 3). This means it is reading under the assumption that the data for Frames 1-3 exists in the WAL file, not the database file. If a checkpoint is executed here and, for instance, data up to Frame 4 is written to the database file, Reader B might unexpectedly see data that should not be visible within its snapshot boundary.

To prevent this, during a checkpoint, the system determines the minimum readMark of all connections as the safely backfillable range (maxSafeFrame) and moves frames from nBackfill+1 to maxSafeFrame into the database file.

Fault Tolerance During Checkpoint Operations

One might wonder: "During a checkpoint operation, data is copied from the WAL file to the database file. Is it really okay if power is lost while writing this data? In Rollback Journal mode, the database file update was handled so carefully, but in WAL mode, is it just glossed over?"

To be specific, let's explain why this is not an issue by assuming a power failure occurs during the database file update process. The process steps are as follows:

  1. fsync the WAL file (Note: As explained before, even if power is lost here, transactions without a commit frame are ignored, so there is no issue).
  2. Copy from WAL file to database file.
  3. fsync the main database.
  4. Update nBackfill in the wal-index.
  5. If the WAL can be reset, rewrite the WAL file header to allow reuse from the beginning.

Case 1: Power failure during steps 2 or 3

If power is lost during the write to the main database and the file enters an inconsistent state, the WAL file still exists. Executing the checkpoint process again will restore the database file to a normal state.

Case 2: Power failure during step 4

Similar to Case 1, the WAL file can be used to re-execute the checkpoint process to return the database to a normal state.

Since the main role of the wal-index is performance improvement and coordination between database connections, even if it is lost, it can simply be rebuilt from the WAL file.

Case 3: Power failure during step 5

We can further break this down into cases where the WAL file checksum fails or succeeds. In the former, the entire WAL file is simply ignored. Since the entire WAL file has already been copied to the database file, ignoring it is not a problem. In the latter, the checkpoint process is re-executed, which is inherently harmless.

Types of Database Locks

State Description
UNLOCKED No locks held. Cannot read or write.
SHARED Can read but cannot write. Multiple processes can hold SHARED locks simultaneously.
RESERVED Planning to write in the future, but currently only reading. Only one RESERVED lock is allowed at a time, but it can coexist with multiple SHARED locks. Unlike PENDING, new SHARED locks can be acquired while in the RESERVED state.
PENDING Trying to write to the database as soon as possible and is waiting for existing SHARED locks to be released to acquire an EXCLUSIVE lock. While a PENDING lock is held, no new SHARED locks can be acquired.
EXCLUSIVE Required to write to the database. Only one is allowed at a time, and it cannot coexist with any other lock.

One question that arises is the necessity of RESERVED and PENDING locks. Wouldn't just SHARED and EXCLUSIVE locks suffice?

This exists to avoid a phenomenon called "writer starvation." Since an EXCLUSIVE lock cannot coexist with any other lock, if reads are constantly occurring, an EXCLUSIVE lock might never be acquired, effectively preventing any writes to the database. PENDING locks exist to avoid this problem. Specifically, the following steps are taken:

  1. First, acquire a RESERVED lock. At this stage, it can coexist with SHARED locks, and new SHARED locks can still be acquired.
  2. Next, escalate the RESERVED lock to a PENDING lock. While a PENDING lock can coexist with existing SHARED locks, new SHARED locks cannot be acquired.
  3. Finally, once all SHARED locks are released, escalate the PENDING lock to an EXCLUSIVE lock and perform the write to the database.

This allows writes to occur even in environments where reads are constantly being performed.

Discussion