LLMでいい感じにSQLのロジックテストのデータを用意したい

兒玉知己

dbt v1.8がリリースされ、作成したモデルのロジックテストができるようになりました。

せっかくロジックテストができるのであれば、CIでテストを実行して安心感を得たいです。

が、問題はテストデータを用意しなければならない点でしょう。
運用中のデータの一部を用いてテストができるのなら良いですが、持ち出してCIに使うことが難しい場合も多いはずです。

そこで、LLMにSQLを渡して、それらしいデータが簡単に作れないかな？というのがここのテーマです。

兒玉知己

やりたいことのイメージ

まずはこのプロンプトを見てください。SQLは、dbtのチュートリアルで登場するjaffle_shopのcustomers.sqlです。

dbtによるデータマート向けに書かれたSQLをテストしようとしています。
SQLの単体テストデータをCSV形式で生成してください。
結果はCSVデータだけを出力してください。

## テスト対象SQL

with customers as (
    select * from {{ ref('stg_customers') }}
),
orders as (
    select * from {{ ref('stg_orders') }}
),
payments as (
    select * from {{ ref('stg_payments') }}
),
customer_orders as (
    select
        customer_id,
        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders
    group by customer_id
),
customer_payments as (
    select
        orders.customer_id,
        sum(amount) as total_amount
    from payments
    left join orders on
         payments.order_id = orders.order_id
    group by orders.customer_id
),
final as (
    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value
    from customers
    left join customer_orders
        on customers.customer_id = customer_orders.customer_id
    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id
)

select * from final

## 以下の条件を満たすテストデータを作成してください

- SQLで利用されている列はすべてCSVに含める
- データのバリエーションは以下の通り
    - 日付型カラムでは、「月初日」「暦2日」「月末」
    - 時間型カラムでは、「0:00」「0:01」「23:59」
    - 集計操作の場合、集計対象の行が「0行」「1行」かつ、その値は連番を使う
    - 各列の値がそれぞれNullの場合
- SQLに含まれる参照テーブルと、加工の期待値をCSV化する

兒玉知己

雑にChatGPTに投げてみます。結果は以下の通り。

#### stg_customers.csv
customer_id,first_name,last_name
1,John,Doe
2,Jane,Smith
3,Bob,Johnson
4,NULL,NULL
5,Alice,Williams
6,Charlie,Brown
#### stg_orders.csv
order_id,customer_id,order_date
1,1,2024-05-01
2,1,2024-05-02
3,2,2024-05-01
4,2,2024-05-02
5,3,2024-04-30
6,3,2024-05-31
7,4,NULL
8,5,2024-05-15
9,6,2024-05-01
10,6,2024-05-31
#### stg_payments.csv
payment_id,order_id,amount
1,1,100.0
2,2,200.0
3,3,150.0
4,4,300.0
5,5,250.0
6,6,350.0
7,7,NULL
8,8,400.0
9,9,500.0
10,10,600.0
11,NULL,NULL
#### final.csv (期待される結果)
customer_id,first_name,last_name,first_order,most_recent_order,number_of_orders,customer_lifetime_value
1,John,Doe,2024-05-01,2024-05-02,2,300.0
2,Jane,Smith,2024-05-01,2024-05-02,2,450.0
3,Bob,Johnson,2024-04-30,2024-05-31,2,600.0
4,NULL,NULL,NULL,NULL,0,NULL
5,Alice,Williams,2024-05-15,2024-05-15,1,400.0
6,Charlie,Brown,2024-05-01,2024-05-31,2,1100.0

それっぽいデータが返ってきてそう。

兒玉知己

実際にこれをdbt のunit testにかけてみました。

unit_tests:
  - name: logic_test_customers
    model: customers
    given:
      - input: ref('stg_customers')
        format: csv
        rows: |
          customer_id,first_name,last_name
          1,John,Doe
          2,Jane,Smith
          3,Bob,Johnson
          4,NULL,NULL
          5,Alice,Williams
          6,Charlie,Brown

      - input: ref('stg_orders')
        format: csv
        rows: |
          order_id,customer_id,order_date
          1,1,2024-05-01
          2,1,2024-05-02
          3,2,2024-05-01
          4,2,2024-05-02
          5,3,2024-04-30
          6,3,2024-05-31
          7,4,NULL
          8,5,2024-05-15
          9,6,2024-05-01
          10,6,2024-05-31


      - input: ref('stg_payments')
        format: csv
        rows: |
          payment_id,order_id,amount
          1,1,100.0
          2,2,200.0
          3,3,150.0
          4,4,300.0
          5,5,250.0
          6,6,350.0
          7,7,NULL
          8,8,400.0
          9,9,500.0
          10,10,600.0
          11,NULL,NULL

    expect:
      format: csv
      rows: |
        customer_id,first_name,last_name,first_order,most_recent_order,number_of_orders,customer_lifetime_value
        1,John,Doe,2024-05-01,2024-05-02,2,300.0
        2,Jane,Smith,2024-05-01,2024-05-02,2,450.0
        3,Bob,Johnson,2024-04-30,2024-05-31,2,600.0
        4,NULL,NULL,NULL,NULL,0,NULL
        5,Alice,Williams,2024-05-15,2024-05-15,1,400.0
        6,Charlie,Brown,2024-05-01,2024-05-31,2,1100.0

$ dbt test
# 長いので出力中略
...
14:38:20  21 of 21 FAIL 1 customers::logic_test_customers ................................ [FAIL 1 in 2.37s]
14:38:20  
14:38:20  Finished running 20 data tests, 1 unit test in 0 hours 0 minutes and 8.72 seconds (8.72s).
14:38:20  
14:38:20  Completed with 1 error and 0 warnings:
14:38:20  
14:38:20  Failure in unit_test logic_test_customers (models/unit_test.yml)
14:38:20    

actual differs from expected:

@@ ,CUSTOMER_ID,FIRST_NAME,...,MOST_RECENT_ORDER,NUMBER_OF_ORDERS,CUSTOMER_LIFETIME_VALUE
...,...        ,...       ,...,...              ,...             ,...
   ,3          ,Bob       ,...,2024-05-31       ,2               ,600.000000
→  ,4          ,NULL      ,...,NULL             ,0→1             ,NULL
   ,5          ,Alice     ,...,2024-05-15       ,1               ,400.000000
...,...        ,...       ,...,...              ,...             ,...

4行目の NUMBER_OF_ORDERS のみ間違っているようですが、他は入力に対して正しい出力を作れている様子です。

このような形で、SQLをもとにテストデータを簡易に（少なくともヒトが自力でゼロから用意しなくてもいい程度に）LLMで作っていきたいです。

兒玉知己

もうちょっと高度化していきたいです。
自分がテストデータを作るときの段取り通りに、LLMに逐次的に作業をしてもらえないか考えます。

SQLのテストに使えるデータとは何か？考えてみる

SQLのロジックを網羅的にカバーするには、観点がおおきく２つあるはずです。

1. 一つのテーブルで完結する程度の条件

境界値テストっぽい観点

数値型 ... 整数 or 小数点つき、正,負,ゼロ
日付 ... 月初 or 月末、年始 or 年末、今年 or 去年
null or not null

みたいな話です。これはシンプルそうです。

2. 他テーブルとの結合性を意識しなければならない条件

↑のcustomers.sqlで言えば

注文テーブルにある顧客IDをもつ顧客が、実際に顧客テーブルにある
支払いテーブルにある注文IDをもつ注文が、実際に注文テーブルにある

といった話です。

さらに、顧客ごとの支払額合計を集計していますが、この「総支払い額」列に対して上記観点１の境界値テストらしいデータを用意しようとすると、

支払い額の合計が0の顧客
支払い額の合計が1の顧客

をそれぞれ用意しなければなりません。

兒玉知己

上の話をわかりやすく言語化してくださっている方がいらっしゃいました。

参考になる、、、

兒玉知己

プロンプトだとこんな感じでしょうか？

SQLのロジックを網羅的にカバーするために、以下の3つの観点でテストデータを作成してください。

# 1. 単一テーブルの条件カバレッジ
一つのテーブル内の列に対するさまざまな条件を考慮してデータを作成します。
以下の条件を満たすデータを用意してください。

- 数値型：
  - 整数：正の数、負の数、ゼロ、最大値、最小値
  - 小数：正の数、負の数、ゼロ、小数点以下の桁数の違い
- 日付型：
  - 月初、月末、年始、年末、今年、去年
  - 将来の日付、過去の日付、無効な日付
- 時間型：
  - 0:00、0:01、23:59、午前、午後
- 文字列型：
  - 空文字、単語、長い文字列（最大長）、特殊文字、マルチバイト文字（例：漢字、絵文字）
- ブール型：
  - TRUE、FALSE
- NULL値：
  - NULLと非NULL
# 2. 関連テーブルの行数カバレッジ
複数のテーブル間における行数の組み合わせを考慮してデータを作成します。
以下のパターンを含めてください。

- 関連テーブルに該当する行が存在しないデータ
- 関連テーブルに該当する行が1件だけ存在するデータ
- 関連テーブルに該当する行が複数存在するデータ

# 3. 関連テーブルの条件カバレッジ
複数のテーブル間における条件の組み合わせを考慮してデータを作成します。
以下のパターンを含めてください。

- 関連テーブルの条件を満たすデータ
- 関連テーブルの条件を満たさないデータ

これらのカバレッジを組み合わせて、多様なテストデータを作成してください。
ただし、すべての組み合わせを網羅するのではなく、効率的に多くのケースをカバーすることを目指してください。

# 注意点
- 条件に優先順位をつけて、重要な条件から考慮すること
- 関連テーブルの行数カバレッジでは、特に重要な条件に焦点を当てること
- 関連テーブルの条件カバレッジは、複数の条件を一つにまとめることも検討すること

レアケースのバグよりも、一般的なケースで確実に動作することを重視し、
シンプルかつ効果的なテストデータを作成してください。

# 対象のSQL
{ここにSQLはる}

# 出力形式
- 参照している各テーブル毎と出力の期待値をCSV形式で出力

兒玉知己

ChatGPTの結果は以下。
空文字が「空文字」になってたり、特殊文字が「特殊文字」だったり、マルチバイトの条件が「Multi Byte」だったりするのが気になります。
結果的には空文字以外の文字種は含まれたのですが、ちょっと使えないなという印象ですね。

以下にテストデータのCSVを示します。

stg_customers.csv

customer_id,first_name,last_name
1,John,Doe
2,Jane,Smith
3,Max,Mustermann
4,Anna,Ivanova
5,空文字,特殊文字
6,長い文字列が続く顧客,ラストネームも長い文字列が続く
7,マルチバイト,文字例
8,Special,Characters!@#
9,Multi,Byte🌟
10,Null,NonNull
11,Positive,Case
12,Negative,Case
13,Zero,Case

stg_orders.csv

order_id,customer_id,order_date
1,1,2023-01-01
2,1,2023-02-15
3,2,2023-01-01
4,3,2024-05-17
5,4,2022-12-31
6,5,2023-06-30
7,6,2023-01-01
8,7,2023-01-01
9,8,2023-01-01
10,9,2023-01-01
11,10,2023-01-01
12,11,2023-01-01
13,12,2023-01-01
14,13,2023-01-01
15,13,2023-01-01
16,13,2023-01-01
17,3,2023-06-30
18,4,2024-05-17
19,5,2022-12-31
20,5,2023-06-30
21,6,2023-01-01
22,6,2023-02-01
23,7,2023-01-01
24,7,2023-02-01
25,8,2023-01-01
26,8,2023-02-01
27,9,2023-01-01
28,9,2023-02-01
29,10,2023-01-01
30,10,2023-02-01
31,11,2023-01-01
32,11,2023-02-01
33,12,2023-01-01
34,12,2023-02-01
35,1,2023-01-01
36,1,2023-02-01
37,2,2023-01-01
38,2,2023-02-01

stg_payments.csv

payment_id,order_id,amount
1,1,100.00
2,2,200.50
3,3,150.00
4,4,250.75
5,5,300.00
6,6,50.00
7,7,400.00
8,8,500.00
9,9,600.00
10,10,700.00
11,11,800.00
12,12,900.00
13,13,1000.00
14,14,1100.00
15,15,1200.00
16,16,1300.00
17,17,1400.00
18,18,1500.00
19,19,1600.00
20,20,1700.00
21,21,1800.00
22,22,1900.00
23,23,2000.00
24,24,2100.00
25,25,2200.00
26,26,2300.00
27,27,2400.00
28,28,2500.00
29,29,2600.00
30,30,2700.00
31,31,2800.00
32,32,2900.00
33,33,3000.00
34,34,3100.00
35,35,3200.00
36,36,3300.00
37,37,3400.00
38,38,3500.00

final.csv

customer_id,first_name,last_name,first_order,most_recent_order,number_of_orders,customer_lifetime_value
1,John,Doe,2023-01-01,2023-02-15,4,650.50
2,Jane,Smith,2023-01-01,2023-02-15,4,500.50
3,Max,Mustermann,2023-06-30,2024-05-17,2,1400.00
4,Anna,Ivanova,2022-12-31,2024-05-17,2,1750.75
5,空文字,特殊文字,2022-12-31,2023-06-30,2,4600.00
6,長い文字列が続く顧客,ラストネームも長い文字列が続く,2023-01-01,2023-02-01,3,3450.00
7,マルチバイト,文字例,2023-01-01,2023-02-01,3,3000.00
8,Special,Characters!@#,2023-01-01,2023-02-01,3,3400.00
9,Multi,Byte🌟,2023-01-01,2023-02-01,3,5000.00
10,Null,NonNull,2023-01-01,2023-02-01,3,5300.00
11,Positive,Case,2023-01-01,2023-02-01,3,5800.00
12,Negative,Case,2023-01-01,2023-02-01,3,3100.00
13,Zero,Case,2023-01-01,2023-01-01,3,5500.00

兒玉知己

こちらテストに使うと以下の通り。
集計の間違いが目立つが、計算は複雑なので、これはしょうがないかなという感じ。

actual differs from expected:

@@,CUSTOMER_ID,FIRST_NAME,...,FIRST_ORDER,MOST_RECENT_ORDER,NUMBER_OF_ORDERS,CUSTOMER_LIFETIME_VALUE
→ ,1          ,John      ,...,2023-01-01 ,2023-02-15       ,4               ,650.500000→6800.500000
→ ,2          ,Jane      ,...,2023-01-01 ,2023-02-15→2023-02-01,4→3         ,500.500000→7050.000000
→ ,3          ,Max       ,...,2023-06-30 ,2024-05-17       ,2               ,1400.000000→1650.750000
→ ,4          ,Anna      ,...,2022-12-31 ,2024-05-17       ,2               ,1750.750000→1800.000000
→ ,5          ,空文字       ,...,2022-12-31 ,2023-06-30       ,2→3             ,4600.000000→3350.000000
→ ,6          ,長い文字列が続く顧客,...,2023-01-01 ,2023-02-01       ,3               ,3450.000000→4100.000000
→ ,7          ,マルチバイト    ,...,2023-01-01 ,2023-02-01       ,3               ,3000.000000→4600.000000
→ ,8          ,Special   ,...,2023-01-01 ,2023-02-01       ,3               ,3400.000000→5100.000000
→ ,9          ,Multi     ,...,2023-01-01 ,2023-02-01       ,3               ,5000.000000→5600.000000
→ ,10         ,Null      ,...,2023-01-01 ,2023-02-01       ,3               ,5300.000000→6100.000000
→ ,11         ,Positive  ,...,2023-01-01 ,2023-02-01       ,3               ,5800.000000→6600.000000
→ ,12         ,Negative  ,...,2023-01-01 ,2023-02-01       ,3               ,3100.000000→7100.000000
→ ,13         ,Zero      ,...,2023-01-01 ,2023-01-01       ,3               ,5500.000000→3600.000000

兒玉知己

WIP

もっと空文字とかNullとか特殊文字とかの具体例を添えないと難しそう
Not Null制約とかも意識させたい
メタデータをRAGっぽく与えればよいのでは

兒玉知己

Difyでワークフローにしてみた。

screenshot

SQLを受け取る
事前にアップロード＆埋め込みしてVector DBに保存してあるdbtのymlファイルから、テーブルやカラム定義を抜いてくる
LLMに投げる
結果出力

いったんケチケチして、安いAPIを使っている。

Embedding ... OpenAI text-embedding-3-large
LLM ... Gemini 1.5 Pro

投げるプロンプトはこんな感じにした。
あたえるymlが英語なので、プロンプトも英語にしてみています。

To comprehensively cover the SQL logic, please create test data based on the following three perspectives:

### 1. Single Table Condition Coverage
Consider various conditions for columns within a single table and prepare data that meets the following criteria:

- **Numeric Types**:
  - Integers: Positive, negative, zero, maximum value, minimum value
  - Decimals: Positive, negative, zero, varying decimal places
- **Date Types**:
  - Beginning of the month, end of the month, beginning of the year, end of the year, this year, last year
  - Future dates, past dates, invalid dates
- **Time Types**:
  - 0:00, 0:01, 23:59, AM, PM
- **String Types**:
  - Empty string, single word, long string (maximum length), special characters, multibyte characters (e.g., kanji, emojis)
- **Boolean Types**:
  - TRUE, FALSE
- **NULL Values**:
  - NULL and non-NULL

### 2. Row Count Coverage for Related Tables
Consider combinations of row counts across multiple tables and include the following patterns:

- Data with no corresponding rows in related tables
- Data with exactly one corresponding row in related tables
- Data with multiple corresponding rows in related tables

### 3. Condition Coverage for Related Tables
Consider combinations of conditions across multiple tables and include the following patterns:

- Data that meets the conditions of the related tables
- Data that does not meet the conditions of the related tables

Combine these coverages to create diverse test data. Aim to efficiently cover many cases rather than exhaustively covering all possible combinations.

### Notes
- Prioritize conditions and consider the most important ones first
- Focus on particularly important conditions for row count coverage in related tables
- Consider combining multiple conditions into one for condition coverage in related tables

Emphasize ensuring functionality in general cases over rare case bugs, and create simple yet effective test data.

### Target SQL
{Input SQL statements}

### Reference Table Information
{contexts}

### Output Format
- Output the expected values in CSV format for each referenced table
- No explanations needed. Please output only the CSV.

兒玉知己

実行した結果がこちら。

モデルの性能が全然違う同士の比較なので、なんとも言い難いのかな、、、
あとは、期待値を出してくれなくなりました。

customer_id,first_name,last_name,email,phone,address,city,state,zip_code,country
1,John,Doe,johndoe@example.com,123-456-7890,123 Main St,Anytown,CA,12345,USA
2,Jane,Doe,janedoe@example.com,987-654-3210,456 Oak Ave,Springfield,IL,67890,USA
3,David,Smith,davidsmith@example.com,555-123-4567,789 Pine Ln,Sometown,NY,54321,USA
4,Emily,Jones,emilyjones@example.com,111-222-3333,1010 Birch Rd,Othertown,FL,34567,USA
5,Michael,Brown,michaelbrown@example.com,444-555-6666,1212 Cedar St,Newtown,TX,78901,USA
6,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL

order_id,customer_id,order_date,status
1,1,2023-01-01,shipped
2,2,2023-01-15,pending
3,1,2023-02-28,shipped
4,3,2023-03-15,cancelled
5,4,2023-04-01,shipped
6,2,2023-04-30,shipped
7,5,2023-05-15,pending
8,1,2023-06-30,shipped
9,3,2023-07-15,shipped
10,2,2023-08-31,shipped
11,4,2023-09-15,cancelled
12,5,2023-10-31,shipped
13,1,2023-11-15,pending
14,3,2023-12-31,shipped
15,6,2023-01-01,shipped

payment_id,order_id,payment_date,amount
1,1,2023-01-01,100.00
2,2,2023-01-15,50.50
3,3,2023-02-28,200.00
4,4,2023-03-15,75.25
5,5,2023-04-01,150.00
6,6,2023-04-30,60.75
7,7,2023-05-15,250.00
8,8,2023-06-30,120.50
9,9,2023-07-15,80.25
10,10,2023-08-31,90.00
11,11,2023-09-15,110.75
12,12,2023-10-31,130.00
13,13,2023-11-15,40.50
14,14,2023-12-31,170.25
15,NULL,2023-01-01,20.00

兒玉知己

と思ったけど、英訳がおかしいな。。

- Output the expected values in CSV format for each referenced table

入力と期待値をそれぞれ出してほしいので、以下の通り修正。

Output CSV files for each table referenced in the SQL and the expected result of the SQL execution.

兒玉知己

修正したプロンプトで再実行したところ、期待値も出してくれるようになった。

気になる点

空文字がEmptyになったり、NullをNullというキーワードにしている
stg_customers.csv で、存在するはずのないemail,phone,created_at,is_active列が勝手に増えている

## stg_customers.csv
customer_id,first_name,last_name,email,phone,created_at,is_active
1,John,Doe,john.doe@example.com,123-456-7890,2023-01-01,TRUE
2,Jane,Doe,jane.doe@example.com,987-654-3210,2023-06-30,FALSE
3,Peter,Pan,peter.pan@example.com,555-123-4567,2023-12-31,TRUE
4,Mary,Poppins,mary.poppins@example.com,111-222-3333,2022-12-31,TRUE
5,Empty,,empty@example.com,,2024-01-01,FALSE
6,Long,Stringgggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg,long.string@example.com,999-888-7777,2023-07-15,TRUE
7,Special,Chars,special.chars@example.com,!@#$%^&*()_+=-`~<>?,./:";'{}[],2023-04-01,FALSE
8,日本語,漢字,nihongo.kanji@example.com,123-456-7890,2023-09-30,TRUE
9,Emoji,😁,emoji@example.com,123-456-7890,2023-10-15,TRUE
10,NULL,NULL,null@example.com,NULL,NULL,NULL

## stg_orders.csv
order_id,customer_id,order_date,status
1,1,2023-01-15,shipped
2,1,2023-03-20,completed
3,2,2023-07-01,cancelled
4,3,2023-10-01,pending
5,4,2022-12-25,shipped
6,4,2022-12-31,completed
7,6,2023-07-16,shipped
8,8,2023-10-01,completed
9,9,2023-10-16,pending

## stg_payments.csv
payment_id,order_id,amount,payment_date
1,1,100.00,2023-01-15
2,2,50.50,2023-03-20
3,4,200.00,2023-10-01
4,5,75.25,2022-12-25
5,6,120.75,2022-12-31
6,7,500.00,2023-07-16
7,8,300.00,2023-10-01
8,9,150.00,2023-10-16

## final.csv
customer_id,first_name,last_name,first_order,most_recent_order,number_of_orders,customer_lifetime_value
1,John,Doe,2023-01-15,2023-03-20,2,150.50
2,Jane,Doe,2023-07-01,2023-07-01,1,
3,Peter,Pan,2023-10-01,2023-10-01,1,200.00
4,Mary,Poppins,2022-12-25,2022-12-31,2,196.00
5,Empty,,,,,
6,Long,Stringgggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg,2023-07-16,2023-07-16,1,500.00
7,Special,Chars,,,,,
8,日本語,漢字,2023-10-01,2023-10-01,1,300.00
9,Emoji,😁,2023-10-16,2023-10-16,1,150.00
10,NULL,NULL,,,,,

兒玉知己

おそらく、雑にSQLとYMLを投げるよりも、

SQL中に含まれるカラム
その意味と、代表的な値

くらいは事前にPythonコード中で加工して説明文を生成した方が良い気がする。

兒玉知己

sqlglotを使って、与えられるSQLのパースをする。

https://github.com/tobymao/sqlglot

兒玉知己

WIP

glotを使って、SQLに含まれるテーブルとカラムを取り出す。
それぞれの説明文をymlからつくる
プロンプトに仕込む。

追記
DifyでPython依存関係を追加するトコロでハマりました。
調査が面倒なので、LangChainであとで試します。

DifyでPythonの依存関係を追加する方法

ドキュメントには解説がないのだが、

ここに書いてあるとおり、Sandboxサービスのrequirements.txtに追記したら、追加されるらしい。
sqlglotを書いたら、確かに追加はされたのだが、

from sqlglot import parse_one, exp, parse

は通るのに、

from sqlglot.optimizer.scope import build_scope

が通らない。

このスクラップは2024/05/29にクローズされました