iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
😸

[Practical RAG Chunking] (2/5) Hierarchical Splitting (Step 1)

に公開

Step 1: Hierarchical Split


This project tackles the most cutting-edge and high-demand challenges in current LLM development: Agent and RAG systems. Without relying on libraries like LangChain, it implements "Autonomous Agent Control (Plan-and-Execute)," "Confidence Evaluation," RAG data creation, vector DB registration, and search systems.

Production chunking is executed via the csv_text_to_chunks_text_csv.py command. This document explains Step 1 (Hierarchical Split), the first of the four RAG [chunking] stages implemented by this command.

  • Step 1 (Hierarchical Split)
  • Step 2 (Semantic Chunking)
  • Step 3 (Contextual Continuity Check)
  • Asynchronous/Parallel Processing
Full source code: Available on GitHub.
  • URL: https://github.com/nakashima2toshio/gemini_grace_agent
    [Environment Setup (Ultra-simplified version)]:
  • Clone the repository above into your environment.
  • Install libraries with: pip install -r requirements.txt
  • Used from Step 2 onwards: [docker compose] Under the docker-compose/ directory in the project root, refer to memo.txt and launch Redis and Qdrant using docker compose.
  • Obtain your Gemini API Key and register it in your .env file and environment variables.

Development Environment (Verification Environment): For those using other environments, please verify with Anthropic (Claude code), Gemini, or ChatGPT.
  • MacBook Air M2, 24GB Memory (I'd like an M6 next!)
  • PyCharm Pro, Gemini API, Python, Streamlit, docker compose (Redis, Qdrant)

📋 Table of Contents

  1. Overall Picture
  2. Explanation of Step 1 Method
  3. Explanation of step1.py
  4. Concrete Example
  5. Key Design Decisions

1. Overall Picture

Introduction

Chunking is a critical preprocessing step that determines the search accuracy of RAG. In the production code csv_text_to_chunks_text_csv.py, chunking is performed in three stages: Step 1, Step 2, and Step 3.

[Crucial]

Since chunks are split using an LLM (Gemini API), the most important factor is the prompt defined in prompts.py. Functionality and performance are determined by this prompt.

# Prompt 1: Hierarchical Splitting (Revised)
PARAGRAPH_SEPARATION_PROMPT

Processing is structured in three stages:

  • Step 1 (Hierarchical Split): Splits input text into paragraph units based on empty lines (\n\n). Headings and body text are kept as a single paragraph unit.
  • Step 2 (Semantic Chunking): Further splits within paragraphs at "topic transition points." This uses semantic coherence rather than physical line breaks.
  • Step 3 (Contextual Continuity Check): Judges dependency between adjacent chunks. If there are referential markers ("this," "it") or undefined term references, they are joined; if they can be understood independently, they are separated.

Objective: Too small chunks lead to context loss, while too large chunks increase search noise. This three-stage process generates semantically complete chunks of appropriate granularity to improve RAG search accuracy.

[Crucial]:

Functionality verification method: Let's verify the functionality using the verification program for each step.
[Verification]:

  • Production chunking uses the CLI command format of csv_text_to_chunks_text_csv.py.
  • Final verification is performed using the Streamlit [GUI] in agent_rag.py.
  • To verify functionality, I extracted the specific processing for step 1, step 2, and step 3 from the csv_text_to_chunks_text_csv.py program.
  • I created "step1.py, step2.py, step3.py" for verifying the functionality of each step separately.
  • Additionally, for execution through steps 1, 2, and 3, I prepared check_async.py for asynchronous/parallel processing.
  • Use this program to check the processing and methods of each step.

1.1 Positioning of Step 1 in the 3-Stage Process

1.2 Data Flow

Stage Content
Input "RAG (Retrieval-Augmented Generation) is...\n\nSemantic chunking is...\n\nKyoto autumn leaves are..."
Step 1: Hierarchical Split 1. Split text into blocks
2. Send each block to LLM
3. Extract paragraph structure
4. Concatenate results
Output (5 paragraphs) Paragraph 1: "RAG is... (definition+benefits)"
Paragraph 2: "Semantic chunking is... (definition+usage)"
Paragraph 3: "Kyoto autumn leaves are... Okinawa's sea is... (Kyoto+Okinawa)"
Paragraph 4: "Vector databases are... (definition+application)"
Paragraph 5: "Chapter 1 Intro to ML... Chapter 2 Basics of Deep Learning... (Ch1+Ch2)"

1.3 Flowchart of Processing

  • The block size may need to be adjusted based on the nature (field, characteristics) of the original text.

2. Explanation of Step 1 Method

2.1 Objective

We perform splitting that respects the physical structure of the text (paragraph division by empty lines).

  • Splitting into paragraphs can be done via regex, but this is an exercise in using the LLM (Gemini in this case) API.
  • We use an LLM here to allow for future extensibility and improvement.

[Important Premise]:

  • The input text defaults to CSV format.
  • It assumes that when humans create this CSV, "the lines in the CSV must have some meaning."
  • Text without any line breaks can also be input (via an option), but since splitting this into paragraphs with the current LLM (model) incurs significant computational power (and cost), we recommend CSV format for input text.

⚠️ Important: Step 1 uses only empty lines (\n\n) as the split criteria.
It does not split at chapter transitions (e.g., Chapter 1 to Chapter 2).
[Note] Depending on the input text (e.g., "there are chapter structures," "there are Chapters or Sections"), you may need to adjust the [Prompt -> prompts.py] corresponding to this part based on the characteristics of your input text.

Simple character-count based splitting leads to the following issues:

Issue Concrete Example
Heading and body separation "Chapter 1 Introduction to Machine Learning" and the body are in separate chunks
Cutting off in the middle of a sentence Cut off at "The greatest advantage of this method is that it can reflect the latest information"
Destruction of semantic coherence RAG definition and benefits are in separate chunks

In Step 1, we leverage an LLM (Gemini API) to solve these issues.

2.2 Splitting Rules (Most Important)

【Step 1 Splitting Rules】

Category Rule
Split Only where empty lines (\n\n) exist
Do not split Headings (e.g., "Chapter X") and the immediately following body text → If there are no empty lines, they are the same paragraph
Do not split Even if the chapter changes (Chapter 1 -> Chapter 2) → If there are no empty lines, do not split
Do not split Do not split based solely on line breaks (\n)

2.3 Algorithm

Step Processing Details
1 Split input text by block_size (default: 2000 characters)
Example: 5000 characters → 3 blocks (2000, 2000, 1000 characters)
2 Send each block to the LLM (Gemini API)
・JSON format (response_mime_type="application/json")
・Pydantic schema (response_schema=StructuralResult)
3 LLM structures the text based on the following rules:
・Rule 1: Split paragraphs only by empty lines (\n\n)
・Rule 2: Keep headings and body text together in one paragraph
・Rule 3: Split sentences by periods (。) or line breaks
4 Combine all block results to generate a paragraph list

2.4 Prompt for the LLM

The PARAGRAPH_SEPARATION_PROMPT defined in chunking/prompts.py:

PARAGRAPH_SEPARATION_PROMPT = """
You are a text structuring engine. Parse the input text according to the following [Splitting Rules] and convert it into a hierarchical structure (Paragraph > Sentence).

[Splitting Rules]
Structure the input text according to the following rules.
The goal is to divide the text into "large meaningful blocks (Paragraphs)" and decompose them into "Sentences."

[Rule 1: Paragraph Splitting Criteria (Follow strictly)]
**When to split (Do not split otherwise):**
- Split where empty lines (\\n\\n) exist.

**When NOT to split (Important):**
- Headings (e.g., "Chapter X") and the immediately following body text must be included in the same Paragraph **if there are no empty lines**.
- Even if the chapter changes (e.g., Chapter 1 to Chapter 2), do not split **if there are no empty lines**.
- Do not split solely by line breaks (\\n).

[Rule 2: Handling Headings]
- If there are headings such as "Chapter X":
  - Do not create a Paragraph for the heading alone.
  - Combine the heading and the immediately following body text into **one single Paragraph**.
  - However, if there is an empty line (\\n\\n) before the heading, split there.

[Rule 3: Sentence Splitting]
- Decompose the contents of a Paragraph into a sentences list by splitting at periods (。) or line breaks.
- Treat the heading part as one sentence as well.

[Concrete Example]

Input text:
Paragraph A. Paragraph A continued.

Paragraph B. Paragraph B continued.
Chapter 1 Title
Body 1. Body 2.
Chapter 2 Title
Body 3. Body 4.

Paragraph C.


Expected splitting:
- Paragraph 1: "Paragraph A. Paragraph A continued."
- Paragraph 2: "Paragraph B. Paragraph B continued. Chapter 1 Title Body 1. Body 2. Chapter 2 Title Body 3. Body 4."
  (Since there are no empty lines, Chapter 1 and Chapter 2 are included in the same Paragraph)
- Paragraph 3: "Paragraph C."

[Output Requirements]
- Output in a structure containing a sentences list within a paragraphs list, following the JSON schema.
- Do not omit or summarize the content of the original text; **maintain the exact string**.
- **Use only empty lines (\\n\\n) as the splitting criteria, and do not split at chapter transitions.**
"""

2.5 Response Schema (Pydantic Model)

Defined in chunking/models.py:

class SentenceUnit(BaseModel):
    """A single sentence or minimum unit of meaning"""
    text: str = Field(description="A single sentence or minimum unit of meaning")


class ParagraphUnit(BaseModel):
    """Paragraph unit"""
    id: int = Field(description="Paragraph ID")
    sentences: List[SentenceUnit] = Field(description="List of sentences contained in this paragraph")

    @property
    def full_text(self) -> str:
        """Return the full text of the paragraph combined

        Note: Sentences are joined by line breaks (\n).
        This preserves the original text structure and improves
        processing accuracy in Step 2 and Step 3.

        Especially important for CSV input:
        - Line breaks within CSV cells are preserved
        - Readability is enhanced
        - Semantic splitting becomes accurate
        """
        return "\n".join([s.text for s in self.sentences])  # Join with line breaks


class StructuralResult(BaseModel):
    """Result of text structuring"""
    paragraphs: List[ParagraphUnit]

2 ⚠️ Important: The full_text property combines sentences using line breaks (\n).
2 This preserves the original text structure, improving processing accuracy in Step 2 and Step 3.

2.6 Why is Step 1 Necessary?

Issue Step 1 Solution
Separation of headings Keep heading and body as one paragraph
Splitting by character count LLM understands meaning and splits
Lack of structure Maintain physical structure by splitting based on empty lines
Misrecognition of line breaks Distinguish between empty lines (\n\n) and line breaks (\n)

3. Explaining step1.py

3.1 Purpose of the File

step1.py is a test program designed to independently verify the operation of Step 1 (Hierarchical Structuring).

  • Implemented with synchronous processing (easy to debug and understand)
  • Designed for unit testing (verifies Step 1 in isolation)
  • Contains test data to verify integration with Step 2 and Step 3

3.2 Program Structure

# Structure of step1.py

# 1. Imports
import os
from google import genai
from google.genai import types
from chunking.models import StructuralResult
from chunking.prompts import PARAGRAPH_SEPARATION_PROMPT

# 2. Core function
def step1_hierarchical_split(text: str, api_key: str, block_size: int = 2000) -> list[str]:
    """Implements the core functionality of Step 1"""
    ...

# 3. Main processing
def main():
    """Execute tests"""
    ...

3.3 Program Flow

Processing flow of step1.py

Order Process Details
1 Retrieve API key api_key = os.getenv("GOOGLE_API_KEY")
2 Prepare test text test_text = """RAG (Retrieval-Augmented Generation)..."""
※ Expected to split into 5 paragraphs by empty lines (\n\n)
3 Call function paragraphs = step1_hierarchical_split(test_text, api_key)
4 Display/Verify results Confirm paragraph count (Expected: 5 paragraphs)
Display content of each paragraph
Verify validation points

3.4 Detailed Explanation of the Core Function

def step1_hierarchical_split(text: str, api_key: str, block_size: int = 2000) -> list[str]:
    """
    Splits text into paragraph units (Core functionality of Step 1)

    Args:
        text: Input text
        api_key: Gemini API key
        block_size: Block size (number of characters)

    Returns:
        List of paragraphs
    """
    # 1. Initialize Gemini API client
    client = genai.Client(api_key=api_key)

    # 2. Split text into blocks
    #    Example: 5000 characters of text -> 3 blocks (2000, 2000, 1000 characters)
    blocks = [text[i:i + block_size] for i in range(0, len(text), block_size)]
    print(f"Input: {len(text)} chars -> {len(blocks)} blocks")

    paragraphs = []

    # 3. Process each block
    for i, block in enumerate(blocks):
        print(f"Processing block {i + 1}/{len(blocks)}...")

        # 4. Create prompt (Prompt + Input text)
        prompt = f"{PARAGRAPH_SEPARATION_PROMPT}\n\n【Input Text】\n{block}"

        # 5. Call Gemini API (Synchronous)
        #    - gemini-2.5-flash: Latest stable version with high rate limits and performance
        #    - response_mime_type: Specify JSON format
        #    - response_schema: Specify Pydantic model
        response = client.models.generate_content(
            model="gemini-2.5-flash",
            contents=prompt,
            config=types.GenerateContentConfig(
                response_mime_type="application/json",
                response_schema=StructuralResult
            )
        )

        # 6. Parse response
        result = StructuralResult.model_validate_json(response.text)

        # 7. Extract paragraphs (using full_text property to join with newline)
        for para in result.paragraphs:
            paragraphs.append(para.full_text)

        print(f"  -> Extracted {len(result.paragraphs)} paragraphs")

    return paragraphs

3.5 Details on API Calls

  • Although "gemini-2.5-flash" is used as the default due to its cost-performance, you can try using the latest models and compare results. (Though there is little difference in paragraph splitting performance.)
Model (Model ID) Input (1M tokens) Output (1M tokens) Tendency/Cost Characteristics
gemini-2.5-flash $0.075 $0.30 [Cheapest]
Same price range as the previous 1.5 Flash. Ideal for high-volume document processing or RAG search phases where quantity matters.
gemini-3-flash $0.15 $0.60 [Balanced]
Twice the cost of 2.5 Flash, but has recognition capabilities close to 3 Pro. Very cost-effective as a "decision-maker" for complex agents.
gemini-3-pro $2.00 $12.00 [High Cost/High Performance]
High unit price plus thinking tokens (Thinking process) are counted as output, so actual billing can be 20-50 times higher than Flash models. Should be reserved for critical final reasoning.
response = client.models.generate_content(
    model="gemini-2.5-flash",           # Model name
    contents=prompt,                     # Prompt
    config=types.GenerateContentConfig(
        response_mime_type="application/json",  # Specify JSON format
        response_schema=StructuralResult        # Specify Pydantic schema
    )
)
Parameter Value Description
model "gemini-2.5-flash" Latest stable version, high rate limits and performance
response_mime_type "application/json" Request JSON response
response_schema StructuralResult Specify schema via Pydantic model

4. Concrete Example

  • Run step1.py to see the results and deepen your understanding.
  • The end-to-end test from Step 1 to Step 3 is in check_async.py.

4.1 Test Input Text

Paragraph Content Note
1 RAG (Retrieval-Augmented Generation) is a technique that combines retrieval and generation.
It retrieves relevant information from an external knowledge base and passes it as context to the LLM.
Announced by Facebook in 2020, it is now adopted by many systems.
The greatest advantage of this method is its ability to reflect the latest information.
This allows it to answer time-sensitive questions that an LLM alone cannot handle.
It is also reported to have the effect of reducing hallucinations.
↑ Empty Line ↓ Split point
2 Semantic chunking is a technique for splitting text into semantic units.
A "chunk" refers to each block of divided text.
An "Embedding" is a text converted into a numerical vector.
Chunk size significantly impacts search accuracy.
If it is too small, context is lost and embedding quality decreases.
If it is too large, search noise increases and irrelevant information gets mixed in.
↑ Empty Line ↓ Split point
3 Kyoto's autumn leaves are best viewed from mid-to-late November.
Kiyomizu-dera and Arashiyama are known as particularly popular spots.
To avoid crowds, early weekday mornings are recommended.
Okinawa's sea has high clarity, making it ideal for snorkeling.
Beautiful beaches are scattered around Onna Village, about an hour by car from Naha.
While caution is needed for typhoons in summer, it is warm and pleasant in other seasons too.
↑ Empty Line ↓ Split point
4 A vector database is a system for efficiently storing and searching high-dimensional vectors.
Representative products include Pinecone, Weaviate, and Chroma, etc.
It achieves fast similarity searches through ANN (Approximate Nearest Neighbor) algorithms.
ANN accuracy and speed are in a trade-off relationship.
You can adjust this balance by choosing indexing methods like HNSW or IVF.
Scalability and cost are also important criteria when selecting a vector database.
↑ Empty Line ↓ Split point
5 Chapter 1 Introduction to Machine Learning
Machine learning is a collective term for algorithms that learn patterns from data.
It is broadly categorized into supervised learning, unsupervised learning, and reinforcement learning.
In this chapter, we explained these basic concepts.
Chapter 2 Fundamentals of Deep Learning
Deep learning is a machine learning method that uses multi-layered neural networks.
It has achieved revolutionary results in image recognition and natural language processing.
In this chapter, we will explain the basic architectures of CNN and RNN.
⚠️ No empty line between Ch 1 and Ch 2 -> Same paragraph

4.2 Visualization of Split Points

Paragraph Content Split Reason
Paragraph 1 Explanation of RAG
RAG (Retrieval-Augmented Generation) is...
...it is also reported to have the effect of reducing hallucinations.
Empty line (\n\n) present Split
Paragraph 2 Explanation of Semantic Chunking
Semantic chunking is...
...and irrelevant information gets mixed in.
Empty line (\n\n) present Split
Paragraph 3 Tourism Information
Kyoto's autumn leaves are...
...it is warm and pleasant in other seasons too.
Empty line (\n\n) present Split
Paragraph 4 Explanation of Vector DB
A vector database is...
...Scalability and cost are also important criteria.
Empty line (\n\n) present Split
Paragraph 5 Chapter Structure (Ch 1 + Ch 2)
Chapter 1 Introduction to Machine Learning
...In this chapter, we explained these basic concepts.
Chapter 2 Fundamentals of Deep Learning
...In this chapter, we will explain the basic architectures of CNN and RNN.
⚠️ No empty line, so do not split

4.3 Expected Output

# Step 1 Output (5 paragraphs)
[
  # Paragraph 1: RAG explanation (Definition + Advantages)
  "RAG (Retrieval-Augmented Generation) is a technique that combines retrieval and generation.\n"
  "It retrieves relevant information from an external knowledge base and passes it as context to the LLM.\n"
  "Announced by Facebook in 2020, it is now adopted by many systems.\n"
  "The greatest advantage of this method is its ability to reflect the latest information.\n"
  "This allows it to answer time-sensitive questions that an LLM alone cannot handle.\n"
  "It is also reported to have the effect of reducing hallucinations.",

  # Paragraph 2: Semantic chunking explanation (Definition + Usage)
  "Semantic chunking is a technique for splitting text into semantic units.\n"
  "A \"chunk\" refers to each block of divided text.\n"
  "An \"Embedding\" is a text converted into a numerical vector.\n"
  "Chunk size significantly impacts search accuracy.\n"
  "If it is too small, context is lost and embedding quality decreases.\n"
  "If it is too large, search noise increases and irrelevant information gets mixed in.",

  # Paragraph 3: Tourism information (Kyoto + Okinawa)
  "Kyoto's autumn leaves are best viewed from mid-to-late November.\n"
  "Kiyomizu-dera and Arashiyama are known as particularly popular spots.\n"
  "To avoid crowds, early weekday mornings are recommended.\n"
  "Okinawa's sea has high clarity, making it ideal for snorkeling.\n"
  "Beautiful beaches are scattered around Onna Village, about an hour by car from Naha.\n"
  "While caution is needed for typhoons in summer, it is warm and pleasant in other seasons too.",

  # Paragraph 4: Vector DB explanation (Definition + Usage)
  "A vector database is a system for efficiently storing and searching high-dimensional vectors.\n"
  "Representative products include Pinecone, Weaviate, and Chroma, etc.\n"
  "It achieves fast similarity searches through ANN (Approximate Nearest Neighbor) algorithms.\n"
  "ANN accuracy and speed are in a trade-off relationship.\n"
  "You can adjust this balance by choosing indexing methods like HNSW or IVF.\n"
  "Scalability and cost are also important criteria when selecting a vector database.",

  # Paragraph 5: Chapter structure (Ch 1 + Ch 2) ← Same paragraph since there is no empty line!
  "Chapter 1 Introduction to Machine Learning\n"
  "Machine learning is a collective term for algorithms that learn patterns from data.\n"
  "It is broadly categorized into supervised learning, unsupervised learning, and reinforcement learning.\n"
  "In this chapter, we explained these basic concepts.\n"
  "Chapter 2 Fundamentals of Deep Learning\n"
  "Deep learning is a machine learning method that uses multi-layered neural networks.\n"
  "It has achieved revolutionary results in image recognition and natural language processing.\n"
  "In this chapter, we will explain the basic architectures of CNN and RNN."
]

4.4 Verification Points

Check Item Expected Result Verification Method
Number of Paragraphs 5 paragraphs len(paragraphs) == 5
Empty Line Split Split by (\n\n) Check boundaries of each paragraph
No Chapter Split Ch 1 and Ch 2 in same paragraph Check contents of Paragraph 5
Text Retention No omission Compare character count/content
Line Break Join Join sentences with \n Check output of full_text

4.5 Integration with Step 2 and Step 3 (Test Data Flow)

[Step 1] 1 Text → 5 Paragraphs

Paragraph Content
Para 1 RAG explanation (Definition + Advantages)
Para 2 Semantic chunking explanation (Definition + Usage)
Para 3 Tourism info (Kyoto + Okinawa)
Para 4 Vector DB explanation (Definition + Usage)
Para 5 Chapter structure (Ch 1 + Ch 2)

[Step 2] 5 Paragraphs → 10 Chunks (Semantically Split)

Input Output
Para 1 Chunk 1 (RAG def) + Chunk 2 (RAG advantages)
Para 2 Chunk 3 (Term def) + Chunk 4 (Term usage)
Para 3 Chunk 5 (Kyoto tourism) + Chunk 6 (Okinawa tourism)
Para 4 Chunk 7 (Vector DB def) + Chunk 8 (Usage)
Para 5 Chunk 9 (Ch 1) + Chunk 10 (Ch 2)

[Step 3] 10 Chunks → 7 Chunks (Merge/Split based on Continuity)

Process Result Reason
Chunk 1+2 Merge Forward dependency: "This method", "It"
Chunk 3+4 Merge Backward dependency: "Chunk", "Embedding" undefined
Chunk 5 Independent Kyoto tourism is understandable alone
Chunk 6 Independent Okinawa tourism is understandable alone
Chunk 7+8 Merge Backward dependency: "ANN", "Vector DB" undefined
Chunk 9 Independent Ch 1 is complete
Chunk 10 Independent Ch 2 is understandable alone

Detailed Verification Patterns

Pattern Description Example Judgment in Step 3
Forward Dependency References the past with pronouns "This method", "It" → Merge (True)
Backward Dependency Technical term used without definition "Chunk", "Embedding", "ANN" → Merge (True)
Independent Judgment Understandable alone despite same topic Kyoto tourism / Okinawa tourism → Separate (False)
Chapter Structure When chapter changes Ch 1 / Ch 2 → Separate (False)

5. Important Design Decisions

5.1 Why use only empty lines as the split criteria?

[Problem] Splitting at chapter changes makes it difficult for Step 2 to determine if it is a chapter change or not.

[Example: Input Text]

Chapter 1 Introduction to Machine Learning
Machine learning is...
Chapter 2 Fundamentals of Deep Learning
Deep learning is...
Method Step 1 Output Problem in Step 2
❌ Split by Chapter Para 1: Chapter 1...\nMachine learning is...
Para 2: Chapter 2...\nDeep learning is...
Since Para 1 only has "Chapter 1", it is difficult to judge semantic splitting
Chapter structure info is lost
✅ Split by Empty Line Para 1: Chapter 1...\nMachine learning...Chapter 2...\nDeep learning... Para 1 contains both "Ch 1" and "Ch 2", allowing detection of chapter turn points
→ Chunk 1: Ch 1
→ Chunk 2: Ch 2

5.2 Why join sentences with line breaks (\n)?

# full_text property in models.py

@property
def full_text(self) -> str:
    """Returns the full text of the paragraph joined by line breaks.

    Note: Sentences are joined with line breaks (\n).
    This maintains original text structure, improving processing accuracy in Step 2 and Step 3.

    Especially important for CSV input:
    - Preserves line breaks inside CSV cells
    - Increases readability
    - Semantic splitting becomes accurate
    """
    return "\n".join([s.text for s in self.sentences])  # ← Joined by line break
Joining Method Result Problem
Empty string "".join() "Sentence 1.Sentence 2.Sentence 3." Original line break structure is lost
Line break "\n".join() "Sentence 1.\nSentence 2.\nSentence 3." ✅ Original structure is preserved

Benefits of line break joining:

  • Easier for Step 2 to recognize sentence breaks
  • Improved accuracy of continuity judgment in Step 3
  • High readability during debugging
  • Preserves line breaks within cells when inputting CSV

5.3 Reason for selecting block size

block_size: int = 2000  # Default value
Size Pros Cons
500 chars Fast processing Paragraphs are prone to being split
2000 chars Good balance ✅ Recommended
5000 chars Paragraphs are well-maintained Close to API limits

Summary

Role of Step 1

Item Content
Input Text (string)
Output Paragraph list (list[str])
Purpose Split while maintaining physical structure (paragraphs by empty lines)
Method Structure recognition via LLM (Gemini API)

Key Points

  1. Split only by empty lines: Only \n\n is the split criterion; do not split at chapter changes.
  2. Maintain headings and body text: Treat "Chapter X" and the subsequent body text as the same paragraph.
  3. Join sentences with line breaks: full_text joins sentences with \n to maintain structure.
  4. Perfect retention of text: No omissions or summaries.

Next Step

The output of Step 1 (paragraph list) is used as input for Step 2 (Semantic Chunking).

In Step 2, semantic turn points within each paragraph are detected and split into finer chunks.
Specifically, chapter changes (Ch 1 → Ch 2) are split in Step 2.

In Step 3, continuity between adjacent chunks is judged, and they are merged/separated based on the following patterns:

Pattern Judgment Result
Forward Dependency Reference the past with pronouns ("This", "It") Merge
Backward Dependency Technical term used without definition Merge
Independent Judgment Understandable alone despite same topic Separate
Chapter Structure When chapter changes Separate

Discussion