iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🧨

Boosting the Maintainability of Long SQL (SELECT Queries)

に公開

Summary

  • SQL has low maintainability
  • Experimentally created a DSL in C#
  • Give it a try

Target Readers

  • Those who write long SQL (select queries)
  • Those who want to improve readability
  • Those who want to improve maintainability
  • Those who want to improve reusability
  • Those who want to improve testability
  • Those who are open to experimental attempts

Challenges of SQL

I will touch on this briefly before getting to the main topic. You can skip this part if you wish.

Weak refactoring capabilities

Table alias names, column alias names, and CTE names. Naming is important, but there are no support features for renaming them. Changing these names later is a major undertaking.

Cannot write in a fluent manner

When writing long SQL, the following steps are often taken:

  • Write a select query
  • Verify the behavior
  • Move the written query into a WITH clause or a subquery
  • Write the next select query
  • (Repeat)

You cannot write the next piece of code without going through this "move" task. Furthermore, the act of "moving" leads to the "difficulty of debugging" problem mentioned later.

Hard to debug
with 
cte_a as (
    select * from table_data
),
cte_b as (
    select * from cte_a
),
cte_c as (
    select * from cte_b
)
select
    *
from
    cte_c

Only the last SELECT query described can be debugged (viewing the data).
To debug the contents of a query defined in a WITH clause, you need to comment out and rewrite the very last query.

Because of this, debugging involves:

  • Changing the code to a state where it can be debugged (*)
  • Identifying and fixing the location of the bug
  • Reverting the code that was rewritten in (*)

SQL does not allow for non-destructive approaches such as breakpoints or log output, which are available in other languages.

Weak commonization features

While queries can be commonized using CTEs, a CTE is only valid within that specific query. To use it in other queries, you need to create a VIEW or a table-valued function.

However, since these are defined in the database, there are concerns regarding visibility, maintainability, portability, and ease of deployment. While they can be beneficial, if you create VIEWs with the same frequency as commonizing code in other languages, you will end up with an overflow of VIEWs, making them difficult to use casually.

Unit testing is difficult

To verify behavior, test data must be registered in the database. To achieve this, the following requirements must be met:

  • A development database is required
  • Environments such as tables must be set up
  • Test data must be independent for each developer

Items 1 and 2 are shared development environments, but item 3 requires an individual environment for each developer.

Proposed Solution

Refactoring issues
→IDE issue

Inability to write fluently
→SQL syntax issue

Hard to debug
→SQL syntax issue

Difficult to commonize
→SQL syntax issue

Difficult to test
→Using CTEs can eliminate physical table dependencies

It seems most of these challenges cannot be solved through standard approaches.

Therefore, I attempted to solve them using a DSL (Domain Specific Language).

How to use the DSL

Let's look at how to use the DSL and its effects while actually building a console application.

Development Environment

  • IDE: Visual Studio 2022
  • Language: C#
  • Version: .NET 8

Deliverables (Source Code)

I'm making the deliverables of the demo public.

Please refer here if the demo doesn't work well or if you just want to see the code.

Project

Demo Code

Installing the DSL

Install Carbunql.TypeSafe from the Package Manager.

PM > NuGet\Install-Package Carbunql.TypeSafe

Define the Model

First, define a model to map the table. Please define it according to the following rules:

  • Table name = Record name
  • Column name = Property name
  • Implement the IDataRow interface

If the table definition is as follows:

The model definition will look like this:

using Carbunql;
using Carbunql.TypeSafe;

public record store : IDataRow
{
    public long store_id { get; set; }
    public string store_name { get; set; }
    public IDataSet DataSet { get; set; } = null!;
}

IDataRow is an interface that indicates TypeSafe builds are possible. When you try to apply the interface, you will be required to implement the DataSet property, so:

public IDataSet DataSet { get; set; } = null!;

define it exactly as shown.

Defining a Select Query as a Function

Next, create a select query. Since select queries are intended to be reused, define them as functions.

I will first present the code to retrieve all columns and all rows.

using Carbunql;
using Carbunql.TypeSafe;

internal class Program
{
    private static void Main(string[] args)
    {
    }

    /// <summary>
    /// Select query to get all stores
    /// </summary>
    /// <returns></returns>
    private static FluentSelectQuery<store> SelectAllStore()
    {
        // Assign an alias name to the dataset (table)
        var s = Sql.DefineDataSet<store>();
        // Set the dataset in the From clause
        return Sql.From(() => s);
    }
}

In this library, SQL is built in the following order:

  • Definition of the dataset and its name
  • Construction of the SQL statement using the dataset (SQL statements are written starting from the FROM clause)

First, define the dataset.

// Assign an alias name to the dataset (table)
var s = Sql.DefineDataSet<store>();

The above means that the name "s" has been given to the "store" table.

Next, use this dataset to construct the SQL statement. To specify the dataset in the From clause, write it as follows:

Sql.From(() => s)

This creates a function synonymous with the following SQL:

SELECT * FROM store AS s

Testing the Function

You can get the command string by using the ToText method.

private static void Main(string[] args)
{
    // Get the select query
    var query = SelectAllStore();

    // Output the select query as a command string
    // select * from store
    Console.WriteLine(query.ToText());
}

The execution result is as follows.

SELECT
    *
FROM
    store AS s

Reusing a Select Query Function as a Subquery

Let's try referencing a previously defined function as a subquery and adding search conditions.

/// <summary>
/// Query to select store_id = 1
/// </summary>
/// <returns></returns>
private static FluentSelectQuery<store> SelectStoreId1()
{
    // Assign an alias name to the dataset (query)
    var all_s = Sql.DefineDataSet(() => SelectAllStore());
    // Set the dataset in the From clause
    // Also add a Where condition
    return Sql.From(() => all_s)
        .Where(() => all_s.store_id == 1);
}

Defining the dataset is the same as before. Note that this time we are referencing a function instead of a table.

// Assign an alias name to the dataset (query)
var all_s = Sql.DefineDataSet(() => SelectAllStore());

The above means that the name "all_s" has been given to the "SelectAllStore query".

The execution result is as follows.

SELECT
    *
FROM
    (
        SELECT
            *
        FROM
            store AS s
    ) AS all_s
WHERE
    all_s.store_id = 1

Reusing a Select Query Function as a CTE

Previously, we referenced a select query function as a subquery, but it is also possible to reference it as a CTE.

/// <summary>
/// Query to select store_id = 1 (CTE)
/// </summary>
/// <returns></returns>
private static FluentSelectQuery<store> SelectStoreId1UseCTE()
{
    // Giving a name to the select query makes it recognized as a CTE
    var all_store = SelectAllStore();

    // Assign an alias name to the dataset (CTE)
    var all_s = Sql.DefineDataSet(() => all_store);
    // Set the dataset in the From clause
    // Also add a Where condition
    return Sql.From(() => all_s)
        .Where(() => all_s.store_id == 1);
}

It is almost the same as using it as a subquery, but the key point is giving a name to the select query.

// Giving a name to the select query makes it recognized as a CTE
var all_store = SelectAllStore();

By doing this, you can treat the select query function as a CTE. Everything else is exactly the same as a subquery. Simply by setting it in the FROM clause, the WITH clause is automatically generated.

Sql.From(() => all_s)

The execution result is as follows.

WITH
    all_store AS (
        SELECT
            *
        FROM
            store AS s
    )
SELECT
    *
FROM
    all_store AS all_s
WHERE
    all_s.store_id = 1

As seen above, we were able to reuse a select query that can be executed independently to generate a new select query. With this, I believe:

  • Reusability
  • Maintainability
  • Readability

have been greatly improved.

Defining Test Data as a Function

Let's also improve testability by utilizing the specification [1] where CTEs are prioritized over physical tables.

Test data queries can be created simply by passing a collection to the constructor.

private static FluentSelectQuery<store> SelectTestStoreData()
{
    // Generate a VALUES query from a collection and turn it into a select query
    return new FluentSelectQuery<store>([
           new store(){store_id = 1, store_name = "abc"},
           new store(){store_id = 2, store_name = "def"},
    ]);
}

The execution result is as follows.

SELECT
    v.store_id,
    v.store_name
FROM
    (
        VALUES
            (1, CAST('abc' AS text)),
            (2, CAST('def' AS text))
    ) AS v (
        store_id, store_name
    )

Injecting Test Data

Let's inject the test data query we just created into the WITH clause.

Defining the dataset is the same as before, but since it won't be referenced in the FROM clause, we set it directly in the WITH clause.

query = SelectStoreId1UseCTE();
var store = SelectTestStoreData();
// Inject into the WITH clause
query.With(() => store);

The execution result is as follows.

WITH
    store AS (
        SELECT
            v.store_id,
            v.store_name
        FROM
            (
                VALUES
                    (1, CAST('abc' AS text)),
                    (2, CAST('def' AS text))
            ) AS v (
                store_id, store_name
            )
    ),
    all_store AS (
        SELECT
            *
        FROM
            store AS s
    )
SELECT
    *
FROM
    all_store AS all_s
WHERE
    all_s.store_id = 1

This completes a test query that does not reference physical tables at all.

Name Refactoring

You can use Visual Studio features. You can later modify function names and alias names as you like.

Comment Feature

As the SQL grows longer, you might want to map the SQL CTEs to C# functions. In this case, use the Comment method.

private static FluentSelectQuery<store> SelectTestStoreData()
{
    // Generate a VALUES query from a collection and turn it into a select query
    return new FluentSelectQuery<store>([
           new store(){store_id = 1, store_name = "abc"},
           new store(){store_id = 2, store_name = "def"},
    ]).Comment($"Function:{nameof(SelectTestStoreData)}");
}
WITH
    store AS (
        /* Function:SelectTestStoreData */
        SELECT
            v.store_id,
            v.store_name
        FROM
            (
                VALUES
                    (1, CAST('abc' AS text)),
                    (2, CAST('def' AS text))
            ) AS v (
                store_id, store_name
            )
    ),
    all_store AS (
        SELECT
            *
        FROM
            store AS s
    )
SELECT
    *
FROM
    all_store AS all_s
WHERE
    all_s.store_id = 1

A More Practical Demo

Consumption Tax Calculation

Let's write code to perform tax calculation processing with the following specifications:

  • Tax is calculated by rounding down the total amount.
  • Tax is also calculated by rounding down on a line-item basis.
  • If the total amount and the sum of line-item tax amounts do not match, the discrepancy is allocated.
  • The priority for allocation is in descending order of the fraction amount that was rounded down.

Here is the resulting function:

private static FluentSelectQuery<SaleReport> SelectSaleReportQuery()
{
    var sale_with_tax = SelectSaleWithTaxQuery();
    var swt = Sql.DefineDataSet(() => sale_with_tax);

    var tax_summary = SelectTaxSummary();
    var ts = Sql.DefineDataSet(() => tax_summary);

    var getSubQuery = () => Sql.From(() => swt)
        .InnerJoin(() => ts, () => swt.tax_rate == ts.tax_rate)
        .Select(() => swt)
        .Select(() => new SaleReport
        {
            tax_adjustment = (swt.adjust_priority <= ts.tax_difference) ? 1 : 0
        }).Compile<SaleReport>(force: true);

    var q = Sql.DefineDataSet(() => getSubQuery());

    var query = Sql.From(() => q)
        .Comment("If the tax amount per line does not match the tax amount per tax rate, it will be adjusted.\r\nThe priority for adjustments will be in descending order of the rounded down fraction.")
        .Select(() => q)
        .Select(() => new SaleReport
        {
            tax = q.line_tax_truncated + q.tax_adjustment
        });

    return query.Compile<SaleReport>(); ;
}
Generated SQL (Folded because it's long)
/* If the tax amount per line does not match the tax amount per tax rate, it will be adjusted.
The priority for adjustments will be in descending order of the rounded down fraction. */
WITH
    sale AS (
        /* test data */
        SELECT
            v.sale_id,
            v.product_name,
            v.unit_price,
            v.quantity,
            v.tax_rate
        FROM
            (
                VALUES
                    (1, CAST('apple' AS text), 105, 5, CAST(0.07 AS double precision)),
                    (2, CAST('orange' AS text), 203, 3, CAST(0.07 AS double precision)),
                    (3, CAST('banana' AS text), 233, 9, CAST(0.07 AS double precision)),
                    (4, CAST('tea' AS text), 309, 7, CAST(0.08 AS double precision)),
                    (5, CAST('coffee' AS text), 555, 9, CAST(0.08 AS double precision)),
                    (6, CAST('cola' AS text), 456, 2, CAST(0.08 AS double precision))
            ) AS v (
                sale_id, product_name, unit_price, quantity, tax_rate
            )
    ),
    sale_with_tax AS (
        /* Calculate tax on a line-by-line basis */
        SELECT
            q.sale_id,
            q.product_name,
            q.unit_price,
            q.quantity,
            q.tax_rate,
            q.line_tax_truncated,
            q.line_tax_raw,
            ROW_NUMBER() OVER(
                PARTITION BY
                    q.tax_rate
                ORDER BY
                    q.line_tax_raw - CAST(q.line_tax_truncated AS double precision) DESC,
                    q.sale_id
            ) AS adjust_priority
        FROM
            (
                SELECT
                    s.sale_id,
                    s.product_name,
                    s.unit_price,
                    s.quantity,
                    s.tax_rate,
                    CAST(TRUNC(CAST(s.unit_price * s.quantity AS double precision) * s.tax_rate) AS integer) AS line_tax_truncated,
                    CAST(s.unit_price * s.quantity AS double precision) * s.tax_rate AS line_tax_raw,
                    CAST(null AS integer) AS adjust_priority
                FROM
                    Sale AS s
            ) AS q
    ),
    tax_summary AS (
        /* Calculate tax by tax rate */
        SELECT
            q.tax_rate,
            q.total_tax,
            q.line_tax_trancated_summary,
            q.total_tax - q.line_tax_trancated_summary AS tax_difference
        FROM
            (
                SELECT
                    swt.tax_rate,
                    CAST(TRUNC(SUM(swt.line_tax_raw)) AS integer) AS total_tax,
                    SUM(swt.line_tax_truncated) AS line_tax_trancated_summary,
                    CAST(null AS integer) AS tax_difference
                FROM
                    sale_with_tax AS swt
                GROUP BY
                    swt.tax_rate
            ) AS q
    )
SELECT
    q.sale_id,
    q.product_name,
    q.unit_price,
    q.quantity,
    q.tax_rate,
    q.line_tax_truncated,
    q.line_tax_raw,
    q.adjust_priority,
    q.tax_adjustment,
    q.line_tax_truncated + q.tax_adjustment AS tax
FROM
    (
        SELECT
            swt.sale_id,
            swt.product_name,
            swt.unit_price,
            swt.quantity,
            swt.tax_rate,
            swt.line_tax_truncated,
            swt.line_tax_raw,
            swt.adjust_priority,
            CASE
                WHEN swt.adjust_priority <= ts.tax_difference THEN 1
                ELSE 0
            END AS tax_adjustment,
            CAST(null AS integer) AS tax
        FROM
            sale_with_tax AS swt
            INNER JOIN tax_summary AS ts ON swt.tax_rate = ts.tax_rate
    ) AS q

The full source code can be found here:
https://github.com/mk3008/Carbunql/blob/main/demo/TypeSafeBuild/Program.cs

Create Table Conversion

Please use the ToCreateTableQuery method.

query.ToCreateTableQuery("tmp", isTemporary: true).ToText();

Insert Query Conversion

Please use the ToInsertQuery method.

query.ToInsertQuery("tmp", isTemporary: true).ToText();

Summary

While you need to learn the DSL, I believe it can significantly improve the maintainability of long SQL queries.

However, there are also some drawbacks to the DSL:

  • Code is required to translate C# to SQL (exceptions occur for patterns that cannot be translated).
  • There are SQL dialect issues, making it difficult to translate universally (currently only Postgres is supported).
  • The build process is not particularly fast (not very suitable for OLTP).

Since long SQL queries are less likely to occur in OLTP, improving the build process speed is a lower priority, but I intend to continue improving the translation and dialect issues.

https://github.com/mk3008/Carbunql

脚注
  1. Postgres documentation
    https://www.postgresql.org/docs/current/sql-select.html
    A WITH query is referenced by writing its name, just as though the query's name were a table name. (In fact, the WITH query hides any real table of the same name for the purposes of the primary query. If necessary, you can refer to a real table of the same name by schema-qualifying the table's name.) An alias can be provided in the same way as for a table. ↩︎

Discussion