make_qa.py - Q/A Pair Generation CLI Entry Point Documentation
Version 2.1 | Last Updated: 2025-02-07
Table of Contents
- Overview
- Architecture Diagram
- Class and Function List
- Module Structure Diagram
- Class and Function IPO Details
- Settings and Constants
- Usage Examples
- Change History
- Appendix: Dependency Diagram
- Appendix: Execution Flowchart
- Appendix: CLI Argument Specifications
Overview
make_qa.py is the CLI entry point for automatically generating Q/A pairs from chunked CSV files or predefined datasets. It calls QAPipeline to generate Q/A pairs using Celery parallel processing or synchronous processing.
The full source code is available on GitHub. Please clone it to your local environment for inspection.
Main Responsibilities
- CLI argument parsing and validation (input sources, models, parallelism, etc.)
- Verification of input file existence and format (CSV only)
- Verification of environment variable (
GOOGLE_API_KEY) existence
- Initialization and execution control of
QAPipeline
- Output of execution summary logs
Key Features List
| Feature |
Description |
main() |
CLI entry point. Handles argument parsing, validation, pipeline execution, and result output |
Prerequisites
- Input CSV files must already be chunked (processed via
csv_text_to_chunks_text_csv.py)
- The
GOOGLE_API_KEY environment variable must be set
1. Architecture Diagram
1.1 Overall System Configuration
1.2 Data Flow
- User executes
make_qa.py with CLI arguments
-
main() parses arguments and validates the input file/dataset and API key
- Initializes
QAPipeline and executes pipeline.run()
- The pipeline reads the CSV and calls the Gemini API to generate Q/A pairs
- Saves results to the
qa_output/pipeline directory
- Logs a summary (number of generated Q/A pairs, coverage rate, etc.)
2. Class and Function List
2.1 Class List
There are no class definitions in this module.
2.2 Function List (Name, Overview, Configuration)
| Name |
Overview |
Configuration |
main() |
CLI entry point. Handles argument parsing, validation, pipeline execution, and result output |
Refer to "main() Internal Configuration" below |
2.3 Constant List
| Name |
Overview |
Configuration |
PROJECT_ROOT |
Absolute path to the project root |
Calculated via os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
logger |
Module logger |
Obtained via logging.getLogger(__name__)
|
2.4 main() Internal Configuration
main() is a single function, but it is composed of the following 6 processing blocks:
| # |
Processing Block |
Overview |
Line Range |
Key Processing Tasks |
| 1 |
Argument Parsing |
Define and parse CLI arguments |
60-183 |
Defines 6 categories of arguments via argparse and parses them using parse_args()
|
| 2 |
API Key Check |
Environment variable check |
188-190 |
Executes sys.exit(1) if GOOGLE_API_KEY is not set |
| 3 |
Input File Validation |
Check file existence and format |
195-205 |
Confirms file existence and checks for .csv extension |
| 4 |
Setting Log Display |
Output execution settings to log |
210-233 |
Logs input source, model, generation mode, and parallel settings |
| 5 |
Pipeline Execution |
Run Q/A generation |
239-258 |
Initialize QAPipeline → execute pipeline.run()
|
| 6 |
Result Display |
Output summary log |
263-272 |
Outputs generated file path, Q/A count, and coverage rate |
📝 Note: If an error occurs (blocks 5, 6), it displays a traceback and terminates with sys.exit(1).
2.5 Argument Definition Categories (Block 1 Details)
The argparse argument definitions within main() are classified into the following 6 categories:
| # |
Category |
Count |
Arguments Included |
| 1 |
Input Source (Mutually Exclusive/Required) |
2 |
--dataset, --input-file
|
| 2 |
Common Parameters |
3 |
--model, --output, --max-docs
|
| 3 |
Coverage Analysis |
2 |
--analyze-coverage, --coverage-threshold
|
| 4 |
Q/A Generation |
3 |
--batch-chunks, --use-smart-generation, --no-smart-generation
|
| 5 |
Celery Parallel Processing |
3 |
--use-celery, -c/--concurrency, --celery-workers
|
|
Total |
13 |
|
3. Module Structure Diagram
3.1 Internal Module Structure
3.2 External Dependencies
| Library |
Version |
Purpose |
argparse |
Standard |
CLI argument parsing |
logging |
Standard |
Log output |
os |
Standard |
Environment variables and path operations |
sys |
Standard |
Path additions and exit code control |
3.3 Internal Dependency Modules
| Module |
Purpose |
qa_generation.pipeline.QAPipeline |
Q/A generation pipeline class |
config.DATASET_CONFIGS |
Configuration dictionary for predefined datasets |
4. Class and Function IPO Details
4.1 Entry Point Function
main
Overview: An entry point function that parses CLI arguments and initializes/executes the QAPipeline. Includes input validation, logging, and error handling.
| Parameter |
Type |
Default |
Description |
| - |
- |
- |
No parameters (obtained via sys.argv from CLI arguments) |
| Item |
Content |
| Input |
CLI arguments (via sys.argv): --dataset or --input-file, --model, --output, --max-docs, --use-celery, -c, --batch-chunks, --use-smart-generation / --no-smart-generation, --analyze-coverage, --coverage-threshold, --celery-workers
|
| Process |
1. Parse arguments with argparse (input source is mutually exclusive and required) 2. Verify the existence of the GOOGLE_API_KEY environment variable 3. Validate file existence and extension (only .csv) 4. Log settings (input, model, generation mode, parallel settings) 5. Initialize QAPipeline (dataset_name, input_file, model, output_dir, max_docs) 6. Run pipeline.run() (passing settings for Celery/sync, smart generation, etc.) 7. Log results summary (file path, Q/A count, coverage rate) 8. Display traceback and terminate with sys.exit(1) on error |
| Output |
None (logs to standard output; file output is handled by QAPipeline) |
Exit Codes:
| Code |
Condition |
0 |
Successful termination |
1 |
GOOGLE_API_KEY not set / Input file missing / Non-CSV input / Runtime error |
# Usage example (executed from the command line)
# Generate Q/A from chunked CSV (synchronous processing)
# $ python qa_qdrant/make_qa.py \
# --input-file output_chunked/data_chunks.csv \
# --analyze-coverage
# Generate Q/A using Celery parallel processing
# $ python qa_qdrant/make_qa.py \
# --input-file output_chunked/data_chunks.csv \
# --use-celery \
# -c 8 \
# --use-smart-generation \
# --analyze-coverage
5. Settings and Constants
5.1 PROJECT_ROOT
Absolute path to the project root directory. Used as the basis for default values of output directories.
PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
| Constant Name |
Value |
Description |
PROJECT_ROOT |
Directory two levels above make_qa.py
|
Basis for the default output path ({PROJECT_ROOT}/qa_output/pipeline) |
5.2 List of Default Values
| Item |
Default Value |
Description |
--model |
gemini-3.0-flash |
Gemini model to use |
--output |
{PROJECT_ROOT}/qa_output/pipeline |
Output directory |
--batch-chunks |
3 |
Number of chunks to process in one API call (1-5) |
--concurrency |
8 |
Number of parallel tasks |
--use-smart-generation |
True |
Smart Q/A generation (dynamic Q/A count determined by LLM) |
6. Usage Examples
6.1 Basic Workflow (Synchronous Processing)
# Generate Q/A from chunked CSV (synchronous processing)
# $ python qa_qdrant/make_qa.py \
# --input-file output_chunked/data_chunks.csv \
# --analyze-coverage
6.2 Celery Parallel Processing
# 1. Start Celery workers (in a separate terminal)
# $ ./start_celery.sh -c 8
# 2. Generate Q/A using Celery parallel processing
# $ python qa_qdrant/make_qa.py \
# --input-file output_chunked/data_chunks.csv \
# --use-celery \
# -c 8 \
# --use-smart-generation \
# --analyze-coverage
6.3 Using a Predefined Dataset
# Process the wikipedia_ja dataset
# $ python qa_qdrant/make_qa.py \
# --dataset wikipedia_ja \
# --use-celery \
# -c 4
6.4 Limiting Number of Chunks (For Testing)
# Process only the first 100 chunks
# $ python qa_qdrant/make_qa.py \
# --input-file output_chunked/large_data.csv \
# --max-docs 100 \
# --analyze-coverage
6.5 Traditional Q/A Generation Method
# Disable smart generation (fixed Q/A count based on token count)
# $ python qa_qdrant/make_qa.py \
# --input-file output_chunked/data_chunks.csv \
# --no-smart-generation \
# --analyze-coverage
7. Change Log
| Version |
Changes |
| 1.0 |
Initial release |
| 2.0 |
Completely recreated to comply with the a_class_method_md_format.md specification |
| 2.1 |
Added "2. Class and Function List" section (table of names, overviews, structures, main() internal structure, and argument definition categories). Re-numbered sections |
| 3.0 |
Support for pipeline.py v3.0. Unified --input-chunks to --input-file. Removed chunk-related arguments. Added -c, --concurrency argument. Added --use-smart-generation / --no-smart-generation arguments |
Appendix: Dependency Diagram
Appendix: Execution Flowchart
Appendix: CLI Argument Specifications
| Argument |
Type |
Description |
--dataset |
str |
Name of predefined dataset (key in DATASET_CONFIGS) |
--input-file |
str |
Path to chunked CSV file |
📝 Note: --dataset and --input-file are mutually exclusive. One of them must be specified.
Common Parameters
| Argument |
Type |
Default |
Description |
--model |
str |
gemini-3.0-flash |
Gemini model to use |
--output |
str |
{PROJECT_ROOT}/qa_output/pipeline |
Output directory |
--max-docs |
int |
None |
Maximum number of chunks to process |
Coverage Analysis Parameters
| Argument |
Type |
Default |
Description |
--analyze-coverage |
flag |
False |
Run coverage analysis |
--coverage-threshold |
float |
None |
Similarity threshold for coverage evaluation |
Q/A Generation Parameters
| Argument |
Type |
Default |
Description |
--batch-chunks |
int |
3 |
Number of chunks to process in one API call (1-5) |
--use-smart-generation |
flag |
True |
Use smart Q/A generation (dynamic count determined by LLM) |
--no-smart-generation |
flag |
- |
Use traditional Q/A generation (based on token count) |
Celery Parallel Processing Parameters
| Argument |
Type |
Default |
Description |
--use-celery |
flag |
False |
Use asynchronous parallel processing via Celery |
-c, --concurrency
|
int |
8 |
Number of parallel tasks. Recommended to match start_celery.sh -c
|
--celery-workers |
int |
1 |
Check for number of Celery worker processes |
⚠️ Deprecated: --celery-workers is deprecated. Please use --concurrency instead.
Removed Arguments (v3.0)
| Removed Argument |
Replacement |
--input-chunks |
Unified into --input-file
|
--merge-chunks |
Removed (handled in prior chunking stage) |
--min-tokens |
Removed |
--max-tokens |
Removed |
--overlap-tokens |
Removed |
--use-similarity |
Removed |
--similarity-threshold |
Removed |
Discussion