iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐕

Building a Database from Scratch in Rust Part 8: Graduating from Full Scans — Implementing a Minimal Primary Key Index

に公開

Last time, we introduced a file-backed TableStorage and progressed to the point where we can:

  • insert rows
  • get rows by RowId
  • read all rows with scan()
  • read data even after reopening

This was a major step forward. However, once you get to this point, you quickly feel a limitation.

It is painful to have to scan every time to find a row

For example, if you want to get the "user with id = 42," you currently have no choice but to check all pages and all slots in order.

So, this time, we will introduce a minimal primary key index.

Our goal this time is simple:

  • Extract the primary key from a row
  • Map it to a RowId
  • Enable retrieving rows via get_by_key(key)

In short, the theme for this time is:

Implement a DB-like read path of key -> RowId -> Row with minimal configuration


The Structure Until Last Time

As of last time, the structure looked roughly like this:

Row
  |
  v
row codec
  |
  v
Page
  |
  v
Pager
  |
  v
File

TableStorage
  ├─ insert(row)
  ├─ get(RowId)
  └─ scan()

In other words, you could save a row and read it if you knew the RowId, or find it by traversing everything with scan(). However, the layer to "search by key" did not exist yet.


The Big Picture for This Time

We are now adding a new PrimaryIndex:

                    +----------------------+
                    |   PrimaryIndex       |
                    |   key -> RowId       |
                    +----------------------+
                              |
                              v
+-----------+     +----------------------+     +----------------------+
| Row       | --> | IndexedTableStorage  | --> | TableStorage         |
+-----------+     | - insert(row)        |     | - insert/get/scan    |
                  | - get_by_key(key)    |     +----------------------+
                  | - rebuild_index()    |                 |
                  +----------------------+                 v
                                                    Page / Pager / File

From now on, the structure will be:

  • TableStorage for persistence
  • PrimaryIndex for search routing
  • A thin facade to connect them

What is important here is that the index does not know about files or pages. The index focuses strictly on key -> RowId.


Why an In-Memory Index at First?

It is natural to think, "Since I'm at it, I want to make the index persistent too." But we won't go there this time.

The reason is simple: what we need right now is:

The initial contract for key lookup

In other words, what we want first is a minimal read path that can:

  • Extract a primary key
  • Detect duplicates
  • Resolve a RowId from a key
  • Read a row from a RowId

If we tried to go straight to index persistence, we would have to tackle all of these at once:

  • Index page layout
  • Index file format
  • Index rebuild
  • Consistency during a crash
  • Synchronization with the table file

That is too heavy to do right now. So, keeping it in-memory at first is the natural choice.


How to Represent PrimaryKey

This is a subtle but important design point.

One could use db-core::Value directly as a key, but it is cleaner to avoid that initially. The reason is that the types we want to allow as primary keys are not necessarily identical to the types that can be held as general values.

For example, in this minimal configuration, these restrictions are natural:

  • Int64 is allowed
  • UInt64 is allowed
  • Bool is allowed
  • String is allowed
  • Bytes is allowed
  • Float64 is not allowed
  • Null is not allowed

In short, it is easier to understand if we have an explicit type for the primary key.

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub enum PrimaryKey {
    Int64(i64),
    UInt64(u64),
    Bool(bool),
    String(String),
    Bytes(Vec<u8>),
}

By using this form, we gain the following advantages:

  • Since it has Ord, it can be used in a BTreeMap
  • Float64 can be excluded from the start
  • Null can be excluded from the start

Why Exclude Float64?

There are technically situations where you might want to use Float64 as a key, but it is safer to exclude it for the first primary key implementation. There are several reasons for this:

  • Handling NaN is difficult
  • It does not mesh well with total ordering
  • The meaning of comparisons can sometimes deviate from intuition

In the stage of building the first "easy-to-understand primary key index," it is cleaner design not to force it here.


Why BTreeMap?

HashMap is also a candidate for the first index implementation, but BTreeMap is quite natural for this purpose. The reasons are:

  • Easy to handle based on Ord
  • Easily expandable to range queries in the future
  • Stable and easy to explain
  • Fast enough for a first index

In other words, this time, the decision is:

6 Choose a minimal structure that is easier to expand upon, rather than aiming for top speed


extract_primary_key Is the Entry Point This Time

Before putting a row into the index, this is what is needed first:

Row + TableSchema
      |
      v
extract_primary_key(...)
      |
      v
PrimaryKey

This is the "entry point for converting a logical row into an index key" for this iteration.

The sample code looks something like this:

pub fn extract_primary_key(schema: &TableSchema, row: &Row) -> Result<PrimaryKey, DbError> {
    row.validate(schema)?;

    let pk_index = schema.primary_key_index;
    let value = &row.values[pk_index];

    match value {
        Value::Int64(v) => Ok(PrimaryKey::Int64(*v)),
        Value::UInt64(v) => Ok(PrimaryKey::UInt64(*v)),
        Value::Bool(v) => Ok(PrimaryKey::Bool(*v)),
        Value::String(v) => Ok(PrimaryKey::String(v.clone())),
        Value::Bytes(v) => Ok(PrimaryKey::Bytes(v.clone())),
        Value::Null => Err(DbError::InvalidRow("primary key cannot be null".into())),
        Value::Float64(_) => Err(DbError::InvalidRow("float64 is not supported as primary key".into())),
    }
}

Although it looks simple, this clearly defines:

  • Which types are allowed
  • Which types are excluded
  • How to follow the schema

PrimaryIndex Itself Can Be Surprisingly Thin

The PrimaryIndex for this iteration can be extremely thin.

pub struct PrimaryIndex {
    map: BTreeMap<PrimaryKey, RowId>,
}

A minimal API is sufficient:

  • insert(key, row_id)
  • get(&key)
  • len()
  • clear()

What is more important is "what it does not hold." The PrimaryIndex here does not contain:

  • File I/O
  • Page information
  • Entire schema
  • Row codec
  • Scan
  • Durability

In other words, it is a conscious decision that the index is nothing more than a reference table.


key -> RowId -> Row in Diagrams

What is most important this time is this single path:

PrimaryKey
   |
   v
PrimaryIndex
   |
   v
RowId(page_id, slot_id)
   |
   v
TableStorage::get(RowId)
   |
   v
Row

Once this form is established, it becomes much more like a database.

For example, the flow to search for id = 42 looks like this:

42
 |
 v
PrimaryKey::UInt64(42)
 |
 v
PrimaryIndex -> RowId { page_id: 3, slot_id: 1 }
 |
 v
TableStorage::get(...)
 |
 v
Row { id: 42, name: "alice", ... }

At this point, "where the row is saved" and "how to find it" are separated. This is quite significant.


What Happens During an insert?

The flow during an insert becomes:

Row
 |
 v
extract_primary_key
 |
 v
duplicate check on PrimaryIndex
 |
 v
TableStorage::insert(row)
 |
 v
RowId
 |
 v
PrimaryIndex::insert(key, row_id)

Visualized as a diagram, it looks like this:

+--------+
| Row    |
+--------+
    |
    v
+-------------------+
| extract key       |
+-------------------+
    |
    v
+-------------------+
| duplicate check   |
+-------------------+
    |
    v
+-------------------+
| storage insert    |
+-------------------+
    |
    v
+-------------------+
| RowId             |
+-------------------+
    |
    v
+-------------------+
| index insert      |
+-------------------+

This order is quite important:

  • Create the key first
  • Verify duplicates
  • Write to storage
  • Finally, update the index

By doing this, at least "obvious duplicates" can be blocked before they reach the storage.


Sample Code: insert

Conceptually, it looks like this:

pub fn insert(&mut self, row: &Row) -> Result<RowId, DbError> {
    let key = extract_primary_key(&self.schema, row)?;

    if self.index.get(&key).is_some() {
        return Err(DbError::KeyAlreadyExists("duplicate primary key".into()));
    }

    let row_id = self.storage.insert(row)?;
    self.index.insert(key, row_id)?;

    Ok(row_id)
}

The good thing about this code is that its responsibilities are very clear:

  • Key extraction is the job of the key extractor
  • Duplicate checking is the job of the index
  • Persistence is the job of the storage
  • Linking is the job of the index

Each layer does only its own job.


Sample Code: get_by_key

get_by_key is quite straightforward:

pub fn get_by_key(&mut self, key: &PrimaryKey) -0 Result0Option0Row0, DbError0 {
    let row_id = match self.index.get(key) {
        Some(row_id) =0 row_id,
        None =0 return Ok(None),
    };

    self.storage.get(row_id)
}

With this in place, we can graduate from the state of "having to scan everything to find a row."

Of course, since this is still an in-memory index, a rebuild is required after restarting. However, it still holds significant meaning as a read path.

How to Handle Reopening

Here, the important point is that the index has not yet been persisted.

In other words, only the table storage side remains in the file.

persisted:
  - table file
  - page contents
  - rows

not persisted yet:
  - primary index

So how do we handle this after reopening?

Initially, this is sufficient:

scan() to rebuild the index

Represented in a diagram:

reopen table
   |
   v
scan all rows
   |
   v
extract primary key from each row
   |
   v
rebuild BTreeMap<PrimaryKey, RowId>

This might be slow. However, what we need right now is the "initial correct contract." Fast index rebuilding comes later.


Sample Code: rebuild_index

pub fn rebuild_index(&mut self) -> Result<(), DbError> {
    self.index.clear();

    for (row_id, row) in self.storage.scan()? {
        let key = extract_primary_key(&self.schema, &row)?;
        self.index.insert(key, row_id)?;
    }

    Ok(())
}

If any duplicate keys are mixed in, they can be detected here. This is also important.

In other words, a rebuild is not just a reconstruction; it also serves as a primary key integrity check for the storage side.


An End-to-End Sample Looks Quite Impressive

let schema = user_schema();
let mut table = IndexedTableStorage::open(&path, schema.clone(), 8192)?;

// insert
table.insert(&Row {
    values: vec![
        Value::UInt64(1),
        Value::String("alice".into()),
        Value::Bool(true),
        Value::Null,
    ],
})?;

table.insert(&Row {
    values: vec![
        Value::UInt64(2),
        Value::String("bob".into()),
        Value::Bool(false),
        Value::String("memo".into()),
    ],
})?;

// key lookup
let row = table.get_by_key(&PrimaryKey::UInt64(2))?.unwrap();
assert_eq!(row.values[1], Value::String("bob".into()));

// reopen + rebuild
drop(table);

let mut reopened = IndexedTableStorage::open(&path, schema, 8192)?;
reopened.rebuild_index()?;

let row_again = reopened.get_by_key(&PrimaryKey::UInt64(1))?.unwrap();
assert_eq!(row_again.values[1], Value::String("alice".into()));

The good part about this sample is that you can see everything:

  • insert
  • key lookup
  • reopen
  • rebuild
  • re-lookup

It really gives you the feeling of "building the read path of a small DB yourself."


What We Are Not Doing Yet

Emphasizing this actually makes the article tighter.

Things we are not doing this time include:

  • Index persistence
  • Secondary indexes
  • Full-fledged range query API
  • Planner integration
  • Design for full synchronization between delete and index
  • Compaction and index updates
  • WAL / recovery and index consistency
  • Concurrent index updates

Why not now?

Because what we need right now is:

The minimal primary key read path

We need the initial contract of being able to:

  • Create keys
  • Find by keys
  • Rebuild after reopening

Responsibility Separation This Time

The responsibility separation that becomes quite clear in this installment is as follows:

db-core::Row / TableSchema
    └─ Logical row and schema

PrimaryKey extractor
    └─ Create a key from a row

PrimaryIndex
    └─ key -> RowId

TableStorage
    └─ RowId <-> Row
       (insert / get / scan)

Indexed facade
    └─ Connects storage and index

What is particularly important is:

  • The index does not know about pages
  • The storage does not know about keys
  • The facade remains a thin bridge

With this separation, the structure is less likely to collapse when adding index persistence or secondary indexes later.


Impressions

I think this installment introduces quite a bit of "DB-like searching."

Up until the last time, we were at the stage of:

  • Being able to save
  • Being able to read
  • Being able to scan

This time, we took a step forward, and a more database-like read path appeared:

Find by key and read the row

And the good thing is that we didn't make it too complicated from the start:

  • in-memory index
  • BTreeMap
  • rebuild on reopen
  • PrimaryKey -> RowId -> Row

It's quite simple, but for that very reason, the core of the design is easy to see.

Personally, I really like this feeling of "graduating from a scan-only world." It's small, but there is a clear sense of moving forward.


Summary

This time, we introduced a minimal primary key index to perform the following tasks:

  • Extract a primary key from a row
  • Maintain key -> RowId
  • Establish get_by_key(key)
  • Rebuild via scan() after reopening

In short, this is the installment where we add the first "searching mechanism" on top of our storage.

The natural next steps look something like this:

  • Index persistence
  • Delete/update and index synchronization
  • Simple CRUD from the API layer
  • Write ordering with an eye toward WAL

Little by little, our table storage has evolved from a "place to save" to a "place where you can find and read."

Next, it would be natural to prepare an easy-to-use CRUD entry point from the API layer to make this primary key read path even more practical.

For example:

  • create/open table
  • insert
  • get by key
  • scan
  • simple delete

Now that the storage and index are in place, it might be the time to move toward a surface API regarding "how to use it from the outside."

This was a long one, but please stay tuned for the next installment!


Summary Diagram for the End of the Article

Before T-008
------------
scan() -> find RowId -> read Row

After T-008
-----------
PrimaryKey -> RowId -> Row

Discussion