iTranslated by AI
Building a Database from Scratch in Rust Part 23: Adding Logical Import to Restore Databases from Export JSON
In the previous post, we introduced read-only logical export and added:
export-tableexport-db
This allowed us to safely extract schema and rows. At this point, we have a DB that can be quite easily "taken out":
- It has a catalog
- It can hold multi-table
- It can be inspected
- It can be viewed via CLI
- It can be exported
Then, the next thing you naturally want is this:
I want to restore the JSON I took out back into the DB
In other words, this time is for logical import / restore.
The theme for this installment is:
Use the logical format defined for export as the restore path as well
It is natural for import to follow export
Last time's export was about safely pushing schema + rows out as JSON.
For example, table export was conceptually shaped like this:
{
"format": "hpd-logical-export",
"version": 1,
"table": {
"name": "users",
"schema": { ... }
},
"rows": [
{ "values": [ ... ] },
{ "values": [ ... ] }
]
}
Given this, it is quite natural to want to use that JSON as input.
Therefore, instead of creating a new import-specific format, we adopt the policy:
Use the export JSON defined in T-022 as the sole import source
This consistency is very important.
This time, it's a "restore to empty space," not an overwrite
I want to emphasize first that this import is quite conservative.
To list what we won't do yet:
- Merge into existing tables
- Overwrite restore
- Upsert
- Partial overwrite
- Conflict resolution
So, the premise for this time is:
Restore logical export into an empty target
By placing this constraint, the responsibilities become very clean.
The overall picture
There are two main imports this time:
import-tableimport-db
Represented in a diagram:
export JSON
├─ table export -> import_table()
└─ database export -> import_database()
The roles of each are as follows:
import_table
-> Restore schema + rows of 1 table to target DB
import_database
-> Restore schema + rows of multiple tables to target DB
In other words, two entry points that act as counterparts to export are created.
Why use the export format as the import source?
This is very important from a design perspective.
If we were to create a separate import-specific format here, we would immediately run into these issues:
- Export format
- Import format
- Conversion rules
- Which is the master?
- Version discrepancies
This is quite burdensome.
On the other hand, if we use the export JSON as the import source directly, it is quite simple:
- Export can write it
- Import can read it
- Roundtrip is established
- Easy to explain in documentation
In other words, this installment is also about:
Promoting the export format to a restore contract
What table import does
The flow for a minimal import_table is quite straightforward.
- Parse the input JSON
- Verify the format/version
- Restore the table schema
- Verify that no table with the same name exists in the target DB
- Perform
create_table(schema) - Perform
open_table(name) - Insert rows in sequence
Diagrammatically:
table export JSON
|
v
parse / validate
|
v
schema decode
|
v
create_table(schema)
|
v
open_table(name)
|
v
insert rows
The key here is to use the existing mutation path as is. We do not create a low-level write path specific to import.
What database import does
Restoring from a database export is fundamentally the same, but requires some caution since there are multiple tables involved.
For a minimal implementation, the following flow is natural:
- Parse / validate JSON
- Confirm that no table names collide with the target DB
- Perform
create_tablefor all schemas - Insert rows for each table in sequence
Diagrammatically:
database export JSON
|
v
parse / validate
|
v
check all table names are free
|
v
create all schemas
|
v
insert rows for each table
It is important here to "check for all name collisions first." Otherwise, restoring partially and then failing due to a collision later only increases the mess.
Value decoding is the inverse of export
In T-022, we made values type-tagged JSON. For example:
{"type":"uint64","value":1}
{"type":"string","value":"alice"}
{"type":"bool","value":true}
{"type":"null"}
{"type":"bytes","value_base64":"AQID"}
This import will decode this as is.
In other words, the task is quite clear:
-
uint64->Value::UInt64 -
string->Value::String -
bool->Value::Bool -
null->Value::Null -
bytes-> base64 decode ->Value::Bytes
Having this symmetry makes roundtrips much easier to implement.
Decoding Bytes from base64
Bytes was base64 during export. The import side will decode this.
Diagrammatically:
{"type":"bytes","value_base64":"AQID"}
|
v
Bytes([0x01, 0x02, 0x03])
If the base64 is malformed, we trigger a failure. It is important not to try to fix it ambiguously.
Restoring schema logically as well
Schemas are also restored directly from the export JSON. At a minimum, we need:
- table name
- columns
- data type
- nullable
- primary key
Again, the important point here is to restore the logical schema, not the storage layout.
In other words, import takes the position of:
- restoring the meaning of the table
instead of
- restoring the raw bytes of the pages
Rejecting import on dirty DBs
As with export, we proceed with a safety-first approach. We reject dirty DBs in this import as well.
The reason is clear:
- Import is a mutation
- We don't want to automatically mix recovery into it
- We don't want to pipe a restore into a DB in an uncertain state
Thus, the contract for this time is:
clean DB
-> import allowed
dirty DB
-> import rejected
This policy is quite easy to understand.
Therefore, we recommend using a "clean / empty" destination for import
Ideally, this import assumes a target that is:
- Already clean-closed
- Has no name collisions
- As empty as possible
This is a bit strict, but I believe it is correct for an initial restore.
If you want to restore to a dirty DB as well, that is a task for the next stage. Right now, it is better not to mix mutation and recovery.
Rejecting collisions with existing table names
This is also important. For example, if you try to import an export that includes a users table, and a users table already exists in the target DB, we will cause a failure on the safe side this time.
The reason is simple:
- Overwrite is not yet defined
- Merge is not yet defined
- There are no rules for which one to keep
So the contract is:
same table name exists
-> import failure
This makes it quite clear.
Rollback is not guaranteed this time
This is something I should write honestly. This import is built on top of the existing mutation path. Therefore, if a row insert fails midway, the following state can occur:
- Table was created
- Rows were partially inserted
- Failure on the remainder
In other words, partial restore is possible. We will not fully rollback this time.
The reason is that if we try to handle that much now, the implementation suddenly becomes quite heavy:
- Catalog rollback
- Table drop rollback
- Inserted rows rollback
- Multi-table rollback
That is work for the next stage. So, we will explicitly state this in the docs:
Partial restore is possible
We recommend importing to an empty destination
This trade-off is important.
Building import on the existing mutation path
The most important thing in the design of this import is not to write low-level code specific to restore.
In other words, we use existing paths:
create_tableopen_tableinsert
Diagrammatically:
import layer
|
v
existing mutation APIs
|
v
catalog / storage / wal / index
With this, we can delegate everything to existing paths:
- WAL contract
- Index updates
- Table storage updates
This is a solid approach.
Enabling execution via CLI
This time, we will also allow execution via CLI, not just via API. For a start, the following two commands are sufficient:
db-cli import-table --db ./restore-db --in users.json
db-cli import-db --db ./restore-db --in mydb-export.json
Here too, the CLI acts strictly as a thin wrapper:
- Open the file
- Pass the input stream
- Call the import API
We do not push format interpretation into the CLI.
Internal image of import_table
Conceptually, the code looks like this:
pub fn import_table(&mut self, input: impl Read) -> Result<(), DbError> {
self.ensure_database_is_clean()?;
let export = parse_table_export(input)?;
validate_table_export(&export)?;
if self.describe_table(&export.table.name)?.is_some() {
return Err(DbError::Internal("table already exists".into()));
}
self.create_table(export.table.schema.clone())?;
let mut table = self.open_table(&export.table.name)?;
for row in export.rows {
table.insert(decode_export_row(row)?)?;
}
Ok(())
}
The key here is to use the existing table handle as is.
Internal image of import_database
Database import is a bit more staged:
pub fn import_database(&mut self, input: impl Read) -> Result<(), DbError> {
self.ensure_database_is_clean()?;
let export = parse_database_export(input)?;
validate_database_export(&export)?;
for table in &export.tables {
if self.describe_table(&table.name)?.is_some() {
return Err(DbError::Internal("table already exists".into()));
}
}
for table in &export.tables {
self.create_table(table.schema.clone())?;
}
for table in export.tables {
let mut handle = self.open_table(&table.name)?;
for row in table.rows {
handle.insert(decode_export_row(row)?)?;
}
}
Ok(())
}
This allows us to finish collision checks before entering the restore process.
The export -> import roundtrip is quite important
The value most easily demonstrated this time is the roundtrip:
original DB
|
v
export JSON
|
v
import into clean target
|
v
same logical state
This is quite powerful because:
- The format is consistent
- The export becomes a future restore source
- The logical layer contract solidifies
Naturally importing states after rename / drop
Since we have been building up the multi-table lifecycle, it is also important that the state after export can be restored as is.
For example, after a rename:
-
users->customers
If you export-db after changing the name, it should be restored as customers in import-db as well.
The same goes for drops:
- If you drop
ordersand export -
ordersis not included in the import
In other words, import/export must be consistent with the catalog, which is the source of truth.
End-to-end samples look quite impressive
What is easiest to show in this article is the export -> import roundtrip.
let mut source = MultiTableDb::open("./source-db")?;
source.create_table(users_schema())?;
source.create_table(orders_schema())?;
{
let mut users = source.open_table("users")?;
users.insert(user_row(1, "alice"))?;
users.insert(user_row(2, "bob"))?;
}
{
let mut orders = source.open_table("orders")?;
orders.insert(order_row(100, 1, "book"))?;
}
We export here:
db-cli export-db --db ./source-db --out export.json
Next, we import to an empty target:
db-cli import-db --db ./target-db --in export.json
And we verify:
db-cli tables --db ./target-db
orders
users
db-cli describe users --db ./target-db
table: users
prefix: users.table
columns:
- id: UInt64 (pk, not null)
- name: String (not null)
This gives quite a feeling of restoration.
Division of roles between export/import is becoming clear
By this point, the roles of export and import are neatly separated.
export
- read-only
- logical extraction
- reject if dirty
import
- mutation
- logical restoration
- reject if target is dirty
- rollback is still weak
Diagrammatically:
export
-> safe read-only output
import
-> safe-ish restore into clean target
This contrast is important.
Things we won't do yet
Writing this clearly makes for a better article. Things we are not doing this time include:
- Overwrite restore
- Merge import
- Upsert restore
- Guaranteed rollback
- Binary snapshot restore
- Incremental restore
- Online transaction-consistent restore
- Remote replication
Why not now?
Because what is needed this time is:
The first restore path to act as a counterpart to export
So, what we want now is this minimal contract:
- Read export JSON
- Restore to a clean target
- Reuse existing mutation paths
- Fail on dirty / collision states by falling back to the safe side
Responsibility separation this time
The responsibility separation visible in this import is quite clean:
CLI
└─ input file designation
thin wrapper
import API
└─ format parse / validate
create/open/insert orchestration
catalog
└─ source of truth for table existence / schema
existing mutation path
└─ create_table / open_table / insert
recovery / cleanup / checkpoint
└─ separate from import
Crucially:
- Import does not re-implement storage/WAL/index
- Export format is the sole source
- CLI does not re-implement logical format
With this separation, we maintain a foothold for considering overwrites or rollbacks next.
Thoughts
This task feels like the installment where we "leave and return to the closed world."
With the export in T-022, the DB became able to go outside. But if it couldn't come back, it would still be a one-way street. By adding import this time, we have made significant progress:
- Can be taken out
- Can be returned
- Can perform roundtrips
Of course, it is not yet a powerful restore:
- Dirty targets are rejected
- Rollback is weak
- No overwrite or merge
But even so, it is quite a solid approach for an initial restore. Especially, using the export format as the restore source directly is very clean.
Summary
This time, we introduced logical import / restore and added:
import-tableimport-db
to enable restoration of tables / databases from export JSON.
In short, this was the installment where we raised the logical export made in T-022 into a contract that can be brought back.
With this, our self-made DB moves closer to being a small system that can not only move its state outside but also safely restore that state elsewhere.
Next, the natural theme will be to add preflight / dry-run verification for imports on top of this foundation. Since we now have the entry point for restore, moving toward "confirming what will happen before actually writing" should take us much closer to practical utility.
See you next time!
Discussion