iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📥

Building a Database from Scratch in Rust Part 23: Adding Logical Import to Restore Databases from Export JSON

に公開

In the previous post, we introduced read-only logical export and added:

  • export-table
  • export-db

This allowed us to safely extract schema and rows. At this point, we have a DB that can be quite easily "taken out":

  • It has a catalog
  • It can hold multi-table
  • It can be inspected
  • It can be viewed via CLI
  • It can be exported

Then, the next thing you naturally want is this:

I want to restore the JSON I took out back into the DB

In other words, this time is for logical import / restore.

The theme for this installment is:

Use the logical format defined for export as the restore path as well


It is natural for import to follow export

Last time's export was about safely pushing schema + rows out as JSON.

For example, table export was conceptually shaped like this:

{
  "format": "hpd-logical-export",
  "version": 1,
  "table": {
    "name": "users",
    "schema": { ... }
  },
  "rows": [
    { "values": [ ... ] },
    { "values": [ ... ] }
  ]
}

Given this, it is quite natural to want to use that JSON as input.

Therefore, instead of creating a new import-specific format, we adopt the policy:

Use the export JSON defined in T-022 as the sole import source

This consistency is very important.


This time, it's a "restore to empty space," not an overwrite

I want to emphasize first that this import is quite conservative.

To list what we won't do yet:

  • Merge into existing tables
  • Overwrite restore
  • Upsert
  • Partial overwrite
  • Conflict resolution

So, the premise for this time is:

Restore logical export into an empty target

By placing this constraint, the responsibilities become very clean.


The overall picture

There are two main imports this time:

  • import-table
  • import-db

Represented in a diagram:

export JSON
  ├─ table export    -> import_table()
  └─ database export -> import_database()

The roles of each are as follows:

import_table
  -> Restore schema + rows of 1 table to target DB

import_database
  -> Restore schema + rows of multiple tables to target DB

In other words, two entry points that act as counterparts to export are created.


Why use the export format as the import source?

This is very important from a design perspective.

If we were to create a separate import-specific format here, we would immediately run into these issues:

  • Export format
  • Import format
  • Conversion rules
  • Which is the master?
  • Version discrepancies

This is quite burdensome.

On the other hand, if we use the export JSON as the import source directly, it is quite simple:

  • Export can write it
  • Import can read it
  • Roundtrip is established
  • Easy to explain in documentation

In other words, this installment is also about:

Promoting the export format to a restore contract


What table import does

The flow for a minimal import_table is quite straightforward.

  1. Parse the input JSON
  2. Verify the format/version
  3. Restore the table schema
  4. Verify that no table with the same name exists in the target DB
  5. Perform create_table(schema)
  6. Perform open_table(name)
  7. Insert rows in sequence

Diagrammatically:

table export JSON
   |
   v
parse / validate
   |
   v
schema decode
   |
   v
create_table(schema)
   |
   v
open_table(name)
   |
   v
insert rows

The key here is to use the existing mutation path as is. We do not create a low-level write path specific to import.


What database import does

Restoring from a database export is fundamentally the same, but requires some caution since there are multiple tables involved.

For a minimal implementation, the following flow is natural:

  1. Parse / validate JSON
  2. Confirm that no table names collide with the target DB
  3. Perform create_table for all schemas
  4. Insert rows for each table in sequence

Diagrammatically:

database export JSON
   |
   v
parse / validate
   |
   v
check all table names are free
   |
   v
create all schemas
   |
   v
insert rows for each table

It is important here to "check for all name collisions first." Otherwise, restoring partially and then failing due to a collision later only increases the mess.


Value decoding is the inverse of export

In T-022, we made values type-tagged JSON. For example:

{"type":"uint64","value":1}
{"type":"string","value":"alice"}
{"type":"bool","value":true}
{"type":"null"}
{"type":"bytes","value_base64":"AQID"}

This import will decode this as is.

In other words, the task is quite clear:

  • uint64 -> Value::UInt64
  • string -> Value::String
  • bool -> Value::Bool
  • null -> Value::Null
  • bytes -> base64 decode -> Value::Bytes

Having this symmetry makes roundtrips much easier to implement.


Decoding Bytes from base64

Bytes was base64 during export. The import side will decode this.

Diagrammatically:

{"type":"bytes","value_base64":"AQID"}
   |
   v
Bytes([0x01, 0x02, 0x03])

If the base64 is malformed, we trigger a failure. It is important not to try to fix it ambiguously.


Restoring schema logically as well

Schemas are also restored directly from the export JSON. At a minimum, we need:

  • table name
  • columns
  • data type
  • nullable
  • primary key

Again, the important point here is to restore the logical schema, not the storage layout.

In other words, import takes the position of:

  • restoring the meaning of the table

instead of

  • restoring the raw bytes of the pages

Rejecting import on dirty DBs

As with export, we proceed with a safety-first approach. We reject dirty DBs in this import as well.

The reason is clear:

  • Import is a mutation
  • We don't want to automatically mix recovery into it
  • We don't want to pipe a restore into a DB in an uncertain state

Thus, the contract for this time is:

clean DB
  -> import allowed

dirty DB
  -> import rejected

This policy is quite easy to understand.


Therefore, we recommend using a "clean / empty" destination for import

Ideally, this import assumes a target that is:

  • Already clean-closed
  • Has no name collisions
  • As empty as possible

This is a bit strict, but I believe it is correct for an initial restore.

If you want to restore to a dirty DB as well, that is a task for the next stage. Right now, it is better not to mix mutation and recovery.


Rejecting collisions with existing table names

This is also important. For example, if you try to import an export that includes a users table, and a users table already exists in the target DB, we will cause a failure on the safe side this time.

The reason is simple:

  • Overwrite is not yet defined
  • Merge is not yet defined
  • There are no rules for which one to keep

So the contract is:

same table name exists
  -> import failure

This makes it quite clear.


Rollback is not guaranteed this time

This is something I should write honestly. This import is built on top of the existing mutation path. Therefore, if a row insert fails midway, the following state can occur:

  • Table was created
  • Rows were partially inserted
  • Failure on the remainder

In other words, partial restore is possible. We will not fully rollback this time.

The reason is that if we try to handle that much now, the implementation suddenly becomes quite heavy:

  • Catalog rollback
  • Table drop rollback
  • Inserted rows rollback
  • Multi-table rollback

That is work for the next stage. So, we will explicitly state this in the docs:

Partial restore is possible

We recommend importing to an empty destination

This trade-off is important.


Building import on the existing mutation path

The most important thing in the design of this import is not to write low-level code specific to restore.

In other words, we use existing paths:

  • create_table
  • open_table
  • insert

Diagrammatically:

import layer
   |
   v
existing mutation APIs
   |
   v
catalog / storage / wal / index

With this, we can delegate everything to existing paths:

  • WAL contract
  • Index updates
  • Table storage updates

This is a solid approach.


Enabling execution via CLI

This time, we will also allow execution via CLI, not just via API. For a start, the following two commands are sufficient:

db-cli import-table --db ./restore-db --in users.json
db-cli import-db --db ./restore-db --in mydb-export.json

Here too, the CLI acts strictly as a thin wrapper:

  • Open the file
  • Pass the input stream
  • Call the import API

We do not push format interpretation into the CLI.


Internal image of import_table

Conceptually, the code looks like this:

pub fn import_table(&mut self, input: impl Read) -> Result<(), DbError> {
    self.ensure_database_is_clean()?;

    let export = parse_table_export(input)?;
    validate_table_export(&export)?;

    if self.describe_table(&export.table.name)?.is_some() {
        return Err(DbError::Internal("table already exists".into()));
    }

    self.create_table(export.table.schema.clone())?;
    let mut table = self.open_table(&export.table.name)?;

    for row in export.rows {
        table.insert(decode_export_row(row)?)?;
    }

    Ok(())
}

The key here is to use the existing table handle as is.


Internal image of import_database

Database import is a bit more staged:

pub fn import_database(&mut self, input: impl Read) -> Result<(), DbError> {
    self.ensure_database_is_clean()?;

    let export = parse_database_export(input)?;
    validate_database_export(&export)?;

    for table in &export.tables {
        if self.describe_table(&table.name)?.is_some() {
            return Err(DbError::Internal("table already exists".into()));
        }
    }

    for table in &export.tables {
        self.create_table(table.schema.clone())?;
    }

    for table in export.tables {
        let mut handle = self.open_table(&table.name)?;
        for row in table.rows {
            handle.insert(decode_export_row(row)?)?;
        }
    }

    Ok(())
}

This allows us to finish collision checks before entering the restore process.


The export -> import roundtrip is quite important

The value most easily demonstrated this time is the roundtrip:

original DB
   |
   v
export JSON
   |
   v
import into clean target
   |
   v
same logical state

This is quite powerful because:

  • The format is consistent
  • The export becomes a future restore source
  • The logical layer contract solidifies

Naturally importing states after rename / drop

Since we have been building up the multi-table lifecycle, it is also important that the state after export can be restored as is.

For example, after a rename:

  • users -> customers

If you export-db after changing the name, it should be restored as customers in import-db as well.

The same goes for drops:

  • If you drop orders and export
  • orders is not included in the import

In other words, import/export must be consistent with the catalog, which is the source of truth.


End-to-end samples look quite impressive

What is easiest to show in this article is the export -> import roundtrip.

let mut source = MultiTableDb::open("./source-db")?;

source.create_table(users_schema())?;
source.create_table(orders_schema())?;

{
    let mut users = source.open_table("users")?;
    users.insert(user_row(1, "alice"))?;
    users.insert(user_row(2, "bob"))?;
}

{
    let mut orders = source.open_table("orders")?;
    orders.insert(order_row(100, 1, "book"))?;
}

We export here:

db-cli export-db --db ./source-db --out export.json

Next, we import to an empty target:

db-cli import-db --db ./target-db --in export.json

And we verify:

db-cli tables --db ./target-db
orders
users
db-cli describe users --db ./target-db
table: users
prefix: users.table
columns:
  - id: UInt64 (pk, not null)
  - name: String (not null)

This gives quite a feeling of restoration.


Division of roles between export/import is becoming clear

By this point, the roles of export and import are neatly separated.

export

  • read-only
  • logical extraction
  • reject if dirty

import

  • mutation
  • logical restoration
  • reject if target is dirty
  • rollback is still weak

Diagrammatically:

export
  -> safe read-only output

import
  -> safe-ish restore into clean target

This contrast is important.


Things we won't do yet

Writing this clearly makes for a better article. Things we are not doing this time include:

  • Overwrite restore
  • Merge import
  • Upsert restore
  • Guaranteed rollback
  • Binary snapshot restore
  • Incremental restore
  • Online transaction-consistent restore
  • Remote replication

Why not now?

Because what is needed this time is:

The first restore path to act as a counterpart to export

So, what we want now is this minimal contract:

  • Read export JSON
  • Restore to a clean target
  • Reuse existing mutation paths
  • Fail on dirty / collision states by falling back to the safe side

Responsibility separation this time

The responsibility separation visible in this import is quite clean:

CLI
  └─ input file designation
     thin wrapper

import API
  └─ format parse / validate
     create/open/insert orchestration

catalog
  └─ source of truth for table existence / schema

existing mutation path
  └─ create_table / open_table / insert

recovery / cleanup / checkpoint
  └─ separate from import

Crucially:

  • Import does not re-implement storage/WAL/index
  • Export format is the sole source
  • CLI does not re-implement logical format

With this separation, we maintain a foothold for considering overwrites or rollbacks next.


Thoughts

This task feels like the installment where we "leave and return to the closed world."

With the export in T-022, the DB became able to go outside. But if it couldn't come back, it would still be a one-way street. By adding import this time, we have made significant progress:

  • Can be taken out
  • Can be returned
  • Can perform roundtrips

Of course, it is not yet a powerful restore:

  • Dirty targets are rejected
  • Rollback is weak
  • No overwrite or merge

But even so, it is quite a solid approach for an initial restore. Especially, using the export format as the restore source directly is very clean.


Summary

This time, we introduced logical import / restore and added:

  • import-table
  • import-db

to enable restoration of tables / databases from export JSON.

In short, this was the installment where we raised the logical export made in T-022 into a contract that can be brought back.

With this, our self-made DB moves closer to being a small system that can not only move its state outside but also safely restore that state elsewhere.

Next, the natural theme will be to add preflight / dry-run verification for imports on top of this foundation. Since we now have the entry point for restore, moving toward "confirming what will happen before actually writing" should take us much closer to practical utility.

See you next time!

Discussion