iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐱

[RAG Implementation: Q/A Generation] (4/6) Building a Smart Q/A Generator

に公開

Complete Guide to smart_qa_generator.py (v2.5)

Overview

qa_generation/smart_qa_generator.py is an intelligent Q/A generation system that considers content context. Unlike traditional systems that generate a fixed number of Q/A pairs, this system performs chunk analysis using an LLM to dynamically determine the optimal number of Q/A pairs based on the information density, importance, and complexity of each chunk.

Full source code is available on GitHub. Please clone it to your local environment to explore.

Table of Contents

  1. Advantages of SmartQAGenerator
  2. Architecture
  3. Class and Function List
  4. IPO Details (Input/Process/Output)
  5. Usage
  6. Decision Criteria and Q/A Count Determination Logic
  7. Error Handling
  8. Settings and Parameters

Advantages of SmartQAGenerator

Comparison with Traditional Methods

Perspective Traditional Method SmartQAGenerator
Q/A Count Determination Fixed (e.g., 3 per chunk) Dynamic (0–5 per chunk)
Content Consideration None Analyzes info density, importance, complexity
Meta-info Processing Generates useless Q/A 0 (Skipped)
High-Density Info Potential info loss Covers comprehensively with 4–5 pairs
Quality Uniform (Low to Medium) Optimized for content (High)

Key Advantages

1. Content-Adaptive Q/A Count Determination

Traditional: Every chunk → Fixed 3 Q/A pairs
Smart: Chunk analysis → Optimal Q/A count (0–5)
  • Meta-info chunks (e.g., "See appendix for details") → 0 (Eliminates waste)
  • Simple facts (e.g., "The product is red") → 1
  • Standard descriptions (Multiple related info) → 2–3
  • High-density technical info (API specs, encryption details) → 4–5

2. Explicit Highlighting of Key Topics

By passing key_topics extracted during the analysis phase to the generation phase, important information is prioritized for Q/A creation.

# Analysis result example
{
    'qa_count': 4,
    'key_topics': ['Encryption method', 'Key length', 'Block size', 'Usage mode'],
    'importance_score': 0.9,
    'complexity': 'high'
}

3. Quality Improvement via Two-Stage Processing

  • Analysis Phase: Low temperature (0.1) for stable decisions
  • Generation Phase: Medium temperature (0.3) for natural text generation

4. Fallback Mechanism

Even in the event of API failure, processing continues with a character-count-based simplified determination.

# Fallback criteria
token_count < 500
token_count < 1001
token_count < 2002
token_count >= 2003

5. Statistical Analysis Function

Allows you to grasp processing result quality in numerical terms.

{
    'total_chunks': 100,
    'total_qa_pairs': 245,
    'avg_qa_per_chunk': 2.45,
    'avg_importance_score': 0.72,
    'qa_distribution': {0: 5, 1: 15, 2: 30, 3: 35, 4: 12, 5: 3}
}

Architecture

Overall Configuration

┌─────────────────────────────────────────────────────────────┐
│                   smart_qa_generator.py                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              SmartQAGenerator Class                 │    │
│  ├─────────────────────────────────────────────────────┤    │
│  │  __init__()           # Initialization/API settings │    │
│  │  _generate_content()  # LLM call (internal)         │    │
│  │  analyze_chunk()      # Chunk analysis              │    │
│  │  generate_qa_pairs()  # Q/A pair generation         │    │
│  │  process_chunk()      # Batch processing (main)     │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │           Utility Functions                         │    │
│  ├─────────────────────────────────────────────────────┤    │
│  │  analyze_qa_statistics()  # Statistical analysis    │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                    Google Gemini API                        │
│  ├─ google.genai (New API, recommended)                     │
│  └─ google.generativeai (Old API, fallback)                 │
└─────────────────────────────────────────────────────────────┘

Processing Flow


Class and Function Overview

Class List

Class Name Function Overview
SmartQAGenerator Main class for intelligent Q/A generation that considers content. Provides both chunk analysis and Q/A generation capabilities.

Method List (SmartQAGenerator)

Method Name Visibility Function Overview
__init__ public Instance initialization. Configures the Gemini API client.
_generate_content private Sends prompts to LLM and retrieves responses. Supports both new and old APIs.
analyze_chunk public Analyzes chunk information density, importance, and complexity to determine the optimal Q/A count.
generate_qa_pairs public Generates Q/A pairs based on analysis results.
process_chunk public Main method that executes analysis and generation together.

Utility Function List

Function Name Function Overview
analyze_qa_statistics Performs statistical analysis on processing results from multiple chunks, calculating Q/A distribution, average importance, etc.

IPO Details (Input/Process/Output)

SmartQAGenerator.init()

IPO

Category Content
Input model: str (model name, default: "gemini-2.0-flash")
api_key: Optional[str] (API key, retrieved from environment variables if None)
Process 1. API version determination (New API/Old API)
2. Client instance generation
3. Save model name
Output SmartQAGenerator instance

Process Flow


SmartQAGenerator.analyze_chunk()

IPO

Category Content
Input chunk_text: str (text to analyze)
Process 1. Build analysis prompt
2. Call LLM (temperature=0.1)
3. Parse JSON response
4. Validation (qa_count: 0-5, importance_score: 0.0-1.0)
5. Fallback on error
Output Dict: {qa_count, key_topics, importance_score, complexity, reasoning}

Process Flow

Output Structure

{
    'qa_count': int,           # Target Q/A count (0-5)
    'key_topics': List[str],   # Key topics
    'importance_score': float, # Importance (0.0-1.0)
    'complexity': str,         # Complexity (low/medium/high)
    'reasoning': str           # Reasoning for decision
}

SmartQAGenerator.generate_qa_pairs()

IPO

Category Content
Input chunk_text: str
analysis: Optional[Dict] (If None, performs auto-analysis)
Process 1. Execute analyze_chunk if analysis is missing
2. Skip if qa_count=0
3. Build topic and importance hints
4. Build Q/A generation prompt
5. Call LLM (temperature=0.3)
6. Parse JSON response
7. Complete missing topics
Output List[Dict]: [{question, answer, topic}, ...]

Process Flow

Output Structure

[
    {
        'question': str,  # Question text
        'answer': str,    # Answer text
        'topic': str      # Topic (1-3 words)
    },
    ...
]

SmartQAGenerator.process_chunk()

IPO

Category Content
Input chunk_text: str
Process 1. Execute analyze_chunk
2. Execute generate_qa_pairs (passing analysis)
3. Integrate results
4. Return failure result on error
Output Dict: {analysis, qa_pairs, success}

Process Flow

Output Structure

{
    'analysis': Dict,        # Result from analyze_chunk()
    'qa_pairs': List[Dict],  # Result from generate_qa_pairs()
    'success': bool          # Processing success flag
}

analyze_qa_statistics()

IPO

Category Content
Input results: List[Dict] (Result list from process_chunk())
Process 1. Count total chunks
2. Count total Q/A
3. Calculate Q/A distribution
4. Calculate average Q/A count
5. Calculate average importance
Output Dict: {total_chunks, total_qa_pairs, avg_qa_per_chunk, avg_importance_score, qa_distribution}

Process Flow

Output Structure

{
    'total_chunks': int,           # Total chunk count
    'total_qa_pairs': int,         # Total Q/A count
    'avg_qa_per_chunk': float,     # Avg Q/A per chunk
    'avg_importance_score': float, # Avg importance score
    'qa_distribution': Dict[int, int]  # Q/A count distribution {0: 5, 1: 15, ...}
}

How to Use

Basic Example

from qa_generation.smart_qa_generator import SmartQAGenerator

# Initialize
generator = SmartQAGenerator(model="gemini-2.0-flash")

# Single chunk processing
result = generator.process_chunk(chunk_text)

if result['success']:
    print(f"Analysis result: {result['analysis']}")
    print(f"Generated Q/A count: {len(result['qa_pairs'])}")
    for qa in result['qa_pairs']:
        print(f"Q: {qa['question']}")
        print(f"A: {qa['answer']}")

Batch Processing Multiple Chunks

from qa_generation.smart_qa_generator import SmartQAGenerator, analyze_qa_statistics

generator = SmartQAGenerator()

# Multiple chunk processing
results = []
for chunk in chunks:
    result = generator.process_chunk(chunk['text'])
    results.append(result)

# Statistical analysis
stats = analyze_qa_statistics(results)
print(f"Total Q/A count: {stats['total_qa_pairs']}")
print(f"Avg Q/A count/chunk: {stats['avg_qa_per_chunk']:.2f}")

Separating Analysis and Generation

# Step 1: Analysis only
analysis = generator.analyze_chunk(chunk_text)
print(f"Recommended Q/A count: {analysis['qa_count']}")
print(f"Key topics: {analysis['key_topics']}")

# Step 2: Generate using analysis result
if analysis['qa_count'] > 0:
    qa_pairs = generator.generate_qa_pairs(chunk_text, analysis)

Decision Criteria and Q/A Count Logic

Q/A Count Criteria

Q/A Count Criteria Example
0 Supplementary info only, meta-info, meaningless repetition "See appendix for details", "Page number: 42"
1 Simple factual description (single information) "This product is red."
2 Two related facts "The product is red, and the size is M."
3 Multiple related information, standard explanatory paragraph General product description, summary
4-5 High-density technical info, multiple independent points, warnings/cautions API specs, encryption details, safety notes

Perspective of Analysis Prompt

  1. Information Density: Number of independent information/facts in a chunk
  2. Importance: Importance of information (critical/high/medium/low)
  3. Complexity: Detail level required for explanation (high/medium/low)
  4. Independence: Can each piece of information be understood without other context?

Error Handling

Fallback Mechanism

If API calls fail, a simplified character-count-based determination is used.

# Fallback logic
token_count = len(chunk_text) // 4

if token_count < 50:
    fallback_count = 0
elif token_count < 100:
    fallback_count = 1
elif token_count < 200:
    fallback_count = 2
else:
    fallback_count = 3

Return Values on Error

# Error in analyze_chunk
{
    'qa_count': <fallback_count>,
    'key_topics': [],
    'importance_score': 0.5,
    'complexity': 'medium',
    'reasoning': 'Determined based on char count due to analysis error: <error content>'
}

# Error in process_chunk
{
    'analysis': {},
    'qa_pairs': [],
    'success': False
}

Settings & Parameters

Initialization Parameters

Parameter Type Default Description
model str "gemini-2.0-flash" Gemini model to use
api_key Optional[str] None Google API Key (If None, retrieves from GOOGLE_API_KEY env var)

Internal Settings

Item Value Purpose
Analysis temperature 0.1 Low temp for stable decisions
Generation temperature 0.3 Medium temp for natural text generation
Q/A max count 5 Maximum Q/A count
Q/A min count 0 Minimum Q/A count (skip)
Max importance_score 1.0 Maximum importance
Min importance_score 0.0 Minimum importance

API Support

API Package Status
New API google.genai Recommended
Old API google.generativeai Fallback (deprecated)

Module Relationship
qa_generation/pipeline.py Executes Q/A generation using SmartQAGenerator
qa_generation/evaluation.py Analyzes coverage of generated Q/A
qa_generation/models.py Defines Q/A pair data models
celery_tasks.py Calls SmartQAGenerator during parallel processing

Creation Date: 2025-01-27
Target File: qa_generation/smart_qa_generator.py
Version: v2.5

Discussion