iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐱

[Hands-on RAG] (3/6) Creating Q/A Pairs & Building the QA Pipeline

に公開

make_qa.py - Q/A Pair Generation CLI Entry Point Documentation

Version 2.1 | Last Updated: 2025-02-07


Table of Contents

  1. Overview
  2. Architecture Diagram
  3. Class and Function List
  4. Module Structure Diagram
  5. Class and Function IPO Details
  6. Settings and Constants
  7. Usage Examples
  8. Change History
  9. Appendix: Dependency Diagram
  10. Appendix: Execution Flowchart
  11. Appendix: CLI Argument Specifications

Overview

make_qa.py is the CLI entry point for automatically generating Q/A pairs from chunked CSV files or predefined datasets. It calls QAPipeline to generate Q/A pairs using Celery parallel processing or synchronous processing.

The full source code is available on GitHub. Please clone it to your local environment for inspection.

Main Responsibilities

  • CLI argument parsing and validation (input sources, models, parallelism, etc.)
  • Verification of input file existence and format (CSV only)
  • Verification of environment variable (GOOGLE_API_KEY) existence
  • Initialization and execution control of QAPipeline
  • Output of execution summary logs

Key Features List

Feature Description
main() CLI entry point. Handles argument parsing, validation, pipeline execution, and result output

Prerequisites

  • Input CSV files must already be chunked (processed via csv_text_to_chunks_text_csv.py)
  • The GOOGLE_API_KEY environment variable must be set

1. Architecture Diagram

1.1 Overall System Configuration

1.2 Data Flow

  1. User executes make_qa.py with CLI arguments
  2. main() parses arguments and validates the input file/dataset and API key
  3. Initializes QAPipeline and executes pipeline.run()
  4. The pipeline reads the CSV and calls the Gemini API to generate Q/A pairs
  5. Saves results to the qa_output/pipeline directory
  6. Logs a summary (number of generated Q/A pairs, coverage rate, etc.)

2. Class and Function List

2.1 Class List

There are no class definitions in this module.

2.2 Function List (Name, Overview, Configuration)

Name Overview Configuration
main() CLI entry point. Handles argument parsing, validation, pipeline execution, and result output Refer to "main() Internal Configuration" below

2.3 Constant List

Name Overview Configuration
PROJECT_ROOT Absolute path to the project root Calculated via os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
logger Module logger Obtained via logging.getLogger(__name__)

2.4 main() Internal Configuration

main() is a single function, but it is composed of the following 6 processing blocks:

# Processing Block Overview Line Range Key Processing Tasks
1 Argument Parsing Define and parse CLI arguments 60-183 Defines 6 categories of arguments via argparse and parses them using parse_args()
2 API Key Check Environment variable check 188-190 Executes sys.exit(1) if GOOGLE_API_KEY is not set
3 Input File Validation Check file existence and format 195-205 Confirms file existence and checks for .csv extension
4 Setting Log Display Output execution settings to log 210-233 Logs input source, model, generation mode, and parallel settings
5 Pipeline Execution Run Q/A generation 239-258 Initialize QAPipeline → execute pipeline.run()
6 Result Display Output summary log 263-272 Outputs generated file path, Q/A count, and coverage rate

📝 Note: If an error occurs (blocks 5, 6), it displays a traceback and terminates with sys.exit(1).

2.5 Argument Definition Categories (Block 1 Details)

The argparse argument definitions within main() are classified into the following 6 categories:

# Category Count Arguments Included
1 Input Source (Mutually Exclusive/Required) 2 --dataset, --input-file
2 Common Parameters 3 --model, --output, --max-docs
3 Coverage Analysis 2 --analyze-coverage, --coverage-threshold
4 Q/A Generation 3 --batch-chunks, --use-smart-generation, --no-smart-generation
5 Celery Parallel Processing 3 --use-celery, -c/--concurrency, --celery-workers
Total 13

3. Module Structure Diagram

3.1 Internal Module Structure

3.2 External Dependencies

Library Version Purpose
argparse Standard CLI argument parsing
logging Standard Log output
os Standard Environment variables and path operations
sys Standard Path additions and exit code control

3.3 Internal Dependency Modules

Module Purpose
qa_generation.pipeline.QAPipeline Q/A generation pipeline class
config.DATASET_CONFIGS Configuration dictionary for predefined datasets

4. Class and Function IPO Details

4.1 Entry Point Function

main

Overview: An entry point function that parses CLI arguments and initializes/executes the QAPipeline. Includes input validation, logging, and error handling.

def main() -> None
Parameter Type Default Description
- - - No parameters (obtained via sys.argv from CLI arguments)
Item Content
Input CLI arguments (via sys.argv): --dataset or --input-file, --model, --output, --max-docs, --use-celery, -c, --batch-chunks, --use-smart-generation / --no-smart-generation, --analyze-coverage, --coverage-threshold, --celery-workers
Process 1. Parse arguments with argparse (input source is mutually exclusive and required)
2. Verify the existence of the GOOGLE_API_KEY environment variable
3. Validate file existence and extension (only .csv)
4. Log settings (input, model, generation mode, parallel settings)
5. Initialize QAPipeline (dataset_name, input_file, model, output_dir, max_docs)
6. Run pipeline.run() (passing settings for Celery/sync, smart generation, etc.)
7. Log results summary (file path, Q/A count, coverage rate)
8. Display traceback and terminate with sys.exit(1) on error
Output None (logs to standard output; file output is handled by QAPipeline)

Exit Codes:

Code Condition
0 Successful termination
1 GOOGLE_API_KEY not set / Input file missing / Non-CSV input / Runtime error
# Usage example (executed from the command line)
# Generate Q/A from chunked CSV (synchronous processing)
# $ python qa_qdrant/make_qa.py \
#     --input-file output_chunked/data_chunks.csv \
#     --analyze-coverage

# Generate Q/A using Celery parallel processing
# $ python qa_qdrant/make_qa.py \
#     --input-file output_chunked/data_chunks.csv \
#     --use-celery \
#     -c 8 \
#     --use-smart-generation \
#     --analyze-coverage

5. Settings and Constants

5.1 PROJECT_ROOT

Absolute path to the project root directory. Used as the basis for default values of output directories.

PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Constant Name Value Description
PROJECT_ROOT Directory two levels above make_qa.py Basis for the default output path ({PROJECT_ROOT}/qa_output/pipeline)

5.2 List of Default Values

Item Default Value Description
--model gemini-3.0-flash Gemini model to use
--output {PROJECT_ROOT}/qa_output/pipeline Output directory
--batch-chunks 3 Number of chunks to process in one API call (1-5)
--concurrency 8 Number of parallel tasks
--use-smart-generation True Smart Q/A generation (dynamic Q/A count determined by LLM)

6. Usage Examples

6.1 Basic Workflow (Synchronous Processing)

# Generate Q/A from chunked CSV (synchronous processing)
# $ python qa_qdrant/make_qa.py \
#     --input-file output_chunked/data_chunks.csv \
#     --analyze-coverage

6.2 Celery Parallel Processing

# 1. Start Celery workers (in a separate terminal)
# $ ./start_celery.sh -c 8

# 2. Generate Q/A using Celery parallel processing
# $ python qa_qdrant/make_qa.py \
#     --input-file output_chunked/data_chunks.csv \
#     --use-celery \
#     -c 8 \
#     --use-smart-generation \
#     --analyze-coverage

6.3 Using a Predefined Dataset

# Process the wikipedia_ja dataset
# $ python qa_qdrant/make_qa.py \
#     --dataset wikipedia_ja \
#     --use-celery \
#     -c 4

6.4 Limiting Number of Chunks (For Testing)

# Process only the first 100 chunks
# $ python qa_qdrant/make_qa.py \
#     --input-file output_chunked/large_data.csv \
#     --max-docs 100 \
#     --analyze-coverage

6.5 Traditional Q/A Generation Method

# Disable smart generation (fixed Q/A count based on token count)
# $ python qa_qdrant/make_qa.py \
#     --input-file output_chunked/data_chunks.csv \
#     --no-smart-generation \
#     --analyze-coverage

7. Change Log

Version Changes
1.0 Initial release
2.0 Completely recreated to comply with the a_class_method_md_format.md specification
2.1 Added "2. Class and Function List" section (table of names, overviews, structures, main() internal structure, and argument definition categories). Re-numbered sections
3.0 Support for pipeline.py v3.0. Unified --input-chunks to --input-file. Removed chunk-related arguments. Added -c, --concurrency argument. Added --use-smart-generation / --no-smart-generation arguments

Appendix: Dependency Diagram


Appendix: Execution Flowchart


Appendix: CLI Argument Specifications

Input Source (Mutually Exclusive, Required)

Argument Type Description
--dataset str Name of predefined dataset (key in DATASET_CONFIGS)
--input-file str Path to chunked CSV file

📝 Note: --dataset and --input-file are mutually exclusive. One of them must be specified.

Common Parameters

Argument Type Default Description
--model str gemini-3.0-flash Gemini model to use
--output str {PROJECT_ROOT}/qa_output/pipeline Output directory
--max-docs int None Maximum number of chunks to process

Coverage Analysis Parameters

Argument Type Default Description
--analyze-coverage flag False Run coverage analysis
--coverage-threshold float None Similarity threshold for coverage evaluation

Q/A Generation Parameters

Argument Type Default Description
--batch-chunks int 3 Number of chunks to process in one API call (1-5)
--use-smart-generation flag True Use smart Q/A generation (dynamic count determined by LLM)
--no-smart-generation flag - Use traditional Q/A generation (based on token count)

Celery Parallel Processing Parameters

Argument Type Default Description
--use-celery flag False Use asynchronous parallel processing via Celery
-c, --concurrency int 8 Number of parallel tasks. Recommended to match start_celery.sh -c
--celery-workers int 1 Check for number of Celery worker processes

⚠️ Deprecated: --celery-workers is deprecated. Please use --concurrency instead.

Removed Arguments (v3.0)

Removed Argument Replacement
--input-chunks Unified into --input-file
--merge-chunks Removed (handled in prior chunking stage)
--min-tokens Removed
--max-tokens Removed
--overlap-tokens Removed
--use-similarity Removed
--similarity-threshold Removed

Discussion