iTranslated by AI
[Practical RAG Chunking] (2/5) Hierarchical Splitting (Step 1)
Step 1: Hierarchical Split

This project tackles the most cutting-edge and high-demand challenges in current LLM development: Agent and RAG systems. Without relying on libraries like LangChain, it implements "Autonomous Agent Control (Plan-and-Execute)," "Confidence Evaluation," RAG data creation, vector DB registration, and search systems.
Production chunking is executed via the csv_text_to_chunks_text_csv.py command. This document explains Step 1 (Hierarchical Split), the first of the four RAG [chunking] stages implemented by this command.
- Step 1 (Hierarchical Split)
- Step 2 (Semantic Chunking)
- Step 3 (Contextual Continuity Check)
- Asynchronous/Parallel Processing
Full source code: Available on GitHub.
- URL: https://github.com/nakashima2toshio/gemini_grace_agent
[Environment Setup (Ultra-simplified version)]: - Clone the repository above into your environment.
- Install libraries with:
pip install -r requirements.txt - Used from Step 2 onwards: [docker compose] Under the
docker-compose/directory in the project root, refer tomemo.txtand launch Redis and Qdrant usingdocker compose. - Obtain your Gemini API Key and register it in your
.envfile and environment variables.
Development Environment (Verification Environment): For those using other environments, please verify with Anthropic (Claude code), Gemini, or ChatGPT.
- MacBook Air M2, 24GB Memory (I'd like an M6 next!)
- PyCharm Pro, Gemini API, Python, Streamlit, docker compose (Redis, Qdrant)
📋 Table of Contents
- Overall Picture
- Explanation of Step 1 Method
- Explanation of step1.py
- Concrete Example
- Key Design Decisions
1. Overall Picture
Introduction
Chunking is a critical preprocessing step that determines the search accuracy of RAG. In the production code csv_text_to_chunks_text_csv.py, chunking is performed in three stages: Step 1, Step 2, and Step 3.
[Crucial]
Since chunks are split using an LLM (Gemini API), the most important factor is the prompt defined in prompts.py. Functionality and performance are determined by this prompt.
# Prompt 1: Hierarchical Splitting (Revised)
PARAGRAPH_SEPARATION_PROMPT
Processing is structured in three stages:
- Step 1 (Hierarchical Split): Splits input text into paragraph units based on empty lines (\n\n). Headings and body text are kept as a single paragraph unit.
- Step 2 (Semantic Chunking): Further splits within paragraphs at "topic transition points." This uses semantic coherence rather than physical line breaks.
- Step 3 (Contextual Continuity Check): Judges dependency between adjacent chunks. If there are referential markers ("this," "it") or undefined term references, they are joined; if they can be understood independently, they are separated.
Objective: Too small chunks lead to context loss, while too large chunks increase search noise. This three-stage process generates semantically complete chunks of appropriate granularity to improve RAG search accuracy.
[Crucial]:
Functionality verification method: Let's verify the functionality using the verification program for each step.
[Verification]:
- Production chunking uses the CLI command format of
csv_text_to_chunks_text_csv.py. - Final verification is performed using the Streamlit [GUI] in
agent_rag.py. - To verify functionality, I extracted the specific processing for step 1, step 2, and step 3 from the
csv_text_to_chunks_text_csv.pyprogram. - I created "step1.py, step2.py, step3.py" for verifying the functionality of each step separately.
- Additionally, for execution through steps 1, 2, and 3, I prepared
check_async.pyfor asynchronous/parallel processing. - Use this program to check the processing and methods of each step.
1.1 Positioning of Step 1 in the 3-Stage Process
1.2 Data Flow
| Stage | Content |
|---|---|
| Input | "RAG (Retrieval-Augmented Generation) is...\n\nSemantic chunking is...\n\nKyoto autumn leaves are..." |
| ↓ | |
| Step 1: Hierarchical Split | 1. Split text into blocks 2. Send each block to LLM 3. Extract paragraph structure 4. Concatenate results |
| ↓ | |
| Output (5 paragraphs) | Paragraph 1: "RAG is... (definition+benefits)"Paragraph 2: "Semantic chunking is... (definition+usage)"Paragraph 3: "Kyoto autumn leaves are... Okinawa's sea is... (Kyoto+Okinawa)"Paragraph 4: "Vector databases are... (definition+application)"Paragraph 5: "Chapter 1 Intro to ML... Chapter 2 Basics of Deep Learning... (Ch1+Ch2)"
|
1.3 Flowchart of Processing
- The block size may need to be adjusted based on the nature (field, characteristics) of the original text.
2. Explanation of Step 1 Method
2.1 Objective
We perform splitting that respects the physical structure of the text (paragraph division by empty lines).
- Splitting into paragraphs can be done via regex, but this is an exercise in using the LLM (Gemini in this case) API.
- We use an LLM here to allow for future extensibility and improvement.
[Important Premise]:
- The input text defaults to CSV format.
- It assumes that when humans create this CSV, "the lines in the CSV must have some meaning."
- Text without any line breaks can also be input (via an option), but since splitting this into paragraphs with the current LLM (model) incurs significant computational power (and cost), we recommend CSV format for input text.
⚠️ Important: Step 1 uses only empty lines (
\n\n) as the split criteria.
It does not split at chapter transitions (e.g., Chapter 1 to Chapter 2).
[Note] Depending on the input text (e.g., "there are chapter structures," "there are Chapters or Sections"), you may need to adjust the [Prompt -> prompts.py] corresponding to this part based on the characteristics of your input text.
Simple character-count based splitting leads to the following issues:
| Issue | Concrete Example |
|---|---|
| Heading and body separation | "Chapter 1 Introduction to Machine Learning" and the body are in separate chunks |
| Cutting off in the middle of a sentence | Cut off at "The greatest advantage of this method is that it can reflect the latest information" |
| Destruction of semantic coherence | RAG definition and benefits are in separate chunks |
In Step 1, we leverage an LLM (Gemini API) to solve these issues.
2.2 Splitting Rules (Most Important)
【Step 1 Splitting Rules】
| Category | Rule |
|---|---|
| ✅ Split | Only where empty lines (\n\n) exist |
| ❌ Do not split | Headings (e.g., "Chapter X") and the immediately following body text → If there are no empty lines, they are the same paragraph |
| ❌ Do not split | Even if the chapter changes (Chapter 1 -> Chapter 2) → If there are no empty lines, do not split |
| ❌ Do not split | Do not split based solely on line breaks (\n) |
2.3 Algorithm
| Step | Processing Details |
|---|---|
| 1 | Split input text by block_size (default: 2000 characters)Example: 5000 characters → 3 blocks (2000, 2000, 1000 characters) |
| 2 | Send each block to the LLM (Gemini API) ・JSON format ( response_mime_type="application/json")・Pydantic schema ( response_schema=StructuralResult) |
| 3 | LLM structures the text based on the following rules: ・Rule 1: Split paragraphs only by empty lines ( \n\n)・Rule 2: Keep headings and body text together in one paragraph ・Rule 3: Split sentences by periods (。) or line breaks |
| 4 | Combine all block results to generate a paragraph list |
2.4 Prompt for the LLM
The PARAGRAPH_SEPARATION_PROMPT defined in chunking/prompts.py:
PARAGRAPH_SEPARATION_PROMPT = """
You are a text structuring engine. Parse the input text according to the following [Splitting Rules] and convert it into a hierarchical structure (Paragraph > Sentence).
[Splitting Rules]
Structure the input text according to the following rules.
The goal is to divide the text into "large meaningful blocks (Paragraphs)" and decompose them into "Sentences."
[Rule 1: Paragraph Splitting Criteria (Follow strictly)]
**When to split (Do not split otherwise):**
- Split where empty lines (\\n\\n) exist.
**When NOT to split (Important):**
- Headings (e.g., "Chapter X") and the immediately following body text must be included in the same Paragraph **if there are no empty lines**.
- Even if the chapter changes (e.g., Chapter 1 to Chapter 2), do not split **if there are no empty lines**.
- Do not split solely by line breaks (\\n).
[Rule 2: Handling Headings]
- If there are headings such as "Chapter X":
- Do not create a Paragraph for the heading alone.
- Combine the heading and the immediately following body text into **one single Paragraph**.
- However, if there is an empty line (\\n\\n) before the heading, split there.
[Rule 3: Sentence Splitting]
- Decompose the contents of a Paragraph into a sentences list by splitting at periods (。) or line breaks.
- Treat the heading part as one sentence as well.
[Concrete Example]
Input text:
Paragraph A. Paragraph A continued.
Paragraph B. Paragraph B continued.
Chapter 1 Title
Body 1. Body 2.
Chapter 2 Title
Body 3. Body 4.
Paragraph C.
Expected splitting:
- Paragraph 1: "Paragraph A. Paragraph A continued."
- Paragraph 2: "Paragraph B. Paragraph B continued. Chapter 1 Title Body 1. Body 2. Chapter 2 Title Body 3. Body 4."
(Since there are no empty lines, Chapter 1 and Chapter 2 are included in the same Paragraph)
- Paragraph 3: "Paragraph C."
[Output Requirements]
- Output in a structure containing a sentences list within a paragraphs list, following the JSON schema.
- Do not omit or summarize the content of the original text; **maintain the exact string**.
- **Use only empty lines (\\n\\n) as the splitting criteria, and do not split at chapter transitions.**
"""
2.5 Response Schema (Pydantic Model)
Defined in chunking/models.py:
class SentenceUnit(BaseModel):
"""A single sentence or minimum unit of meaning"""
text: str = Field(description="A single sentence or minimum unit of meaning")
class ParagraphUnit(BaseModel):
"""Paragraph unit"""
id: int = Field(description="Paragraph ID")
sentences: List[SentenceUnit] = Field(description="List of sentences contained in this paragraph")
@property
def full_text(self) -> str:
"""Return the full text of the paragraph combined
Note: Sentences are joined by line breaks (\n).
This preserves the original text structure and improves
processing accuracy in Step 2 and Step 3.
Especially important for CSV input:
- Line breaks within CSV cells are preserved
- Readability is enhanced
- Semantic splitting becomes accurate
"""
return "\n".join([s.text for s in self.sentences]) # Join with line breaks
class StructuralResult(BaseModel):
"""Result of text structuring"""
paragraphs: List[ParagraphUnit]
2 ⚠️ Important: The full_text property combines sentences using line breaks (\n).
2 This preserves the original text structure, improving processing accuracy in Step 2 and Step 3.
2.6 Why is Step 1 Necessary?
| Issue | Step 1 Solution |
|---|---|
| Separation of headings | Keep heading and body as one paragraph |
| Splitting by character count | LLM understands meaning and splits |
| Lack of structure | Maintain physical structure by splitting based on empty lines |
| Misrecognition of line breaks | Distinguish between empty lines (\n\n) and line breaks (\n) |
3. Explaining step1.py
3.1 Purpose of the File
step1.py is a test program designed to independently verify the operation of Step 1 (Hierarchical Structuring).
- Implemented with synchronous processing (easy to debug and understand)
- Designed for unit testing (verifies Step 1 in isolation)
- Contains test data to verify integration with Step 2 and Step 3
3.2 Program Structure
# Structure of step1.py
# 1. Imports
import os
from google import genai
from google.genai import types
from chunking.models import StructuralResult
from chunking.prompts import PARAGRAPH_SEPARATION_PROMPT
# 2. Core function
def step1_hierarchical_split(text: str, api_key: str, block_size: int = 2000) -> list[str]:
"""Implements the core functionality of Step 1"""
...
# 3. Main processing
def main():
"""Execute tests"""
...
3.3 Program Flow
Processing flow of step1.py
| Order | Process | Details |
|---|---|---|
| 1 | Retrieve API key | api_key = os.getenv("GOOGLE_API_KEY") |
| ↓ | ||
| 2 | Prepare test text |
test_text = """RAG (Retrieval-Augmented Generation)..."""※ Expected to split into 5 paragraphs by empty lines ( \n\n) |
| ↓ | ||
| 3 | Call function | paragraphs = step1_hierarchical_split(test_text, api_key) |
| ↓ | ||
| 4 | Display/Verify results | Confirm paragraph count (Expected: 5 paragraphs) Display content of each paragraph Verify validation points |
3.4 Detailed Explanation of the Core Function
def step1_hierarchical_split(text: str, api_key: str, block_size: int = 2000) -> list[str]:
"""
Splits text into paragraph units (Core functionality of Step 1)
Args:
text: Input text
api_key: Gemini API key
block_size: Block size (number of characters)
Returns:
List of paragraphs
"""
# 1. Initialize Gemini API client
client = genai.Client(api_key=api_key)
# 2. Split text into blocks
# Example: 5000 characters of text -> 3 blocks (2000, 2000, 1000 characters)
blocks = [text[i:i + block_size] for i in range(0, len(text), block_size)]
print(f"Input: {len(text)} chars -> {len(blocks)} blocks")
paragraphs = []
# 3. Process each block
for i, block in enumerate(blocks):
print(f"Processing block {i + 1}/{len(blocks)}...")
# 4. Create prompt (Prompt + Input text)
prompt = f"{PARAGRAPH_SEPARATION_PROMPT}\n\n【Input Text】\n{block}"
# 5. Call Gemini API (Synchronous)
# - gemini-2.5-flash: Latest stable version with high rate limits and performance
# - response_mime_type: Specify JSON format
# - response_schema: Specify Pydantic model
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=prompt,
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=StructuralResult
)
)
# 6. Parse response
result = StructuralResult.model_validate_json(response.text)
# 7. Extract paragraphs (using full_text property to join with newline)
for para in result.paragraphs:
paragraphs.append(para.full_text)
print(f" -> Extracted {len(result.paragraphs)} paragraphs")
return paragraphs
3.5 Details on API Calls
- Although "gemini-2.5-flash" is used as the default due to its cost-performance, you can try using the latest models and compare results. (Though there is little difference in paragraph splitting performance.)
| Model (Model ID) | Input (1M tokens) | Output (1M tokens) | Tendency/Cost Characteristics |
|---|---|---|---|
| gemini-2.5-flash | $0.075 | $0.30 |
[Cheapest] Same price range as the previous 1.5 Flash. Ideal for high-volume document processing or RAG search phases where quantity matters. |
| gemini-3-flash | $0.15 | $0.60 |
[Balanced] Twice the cost of 2.5 Flash, but has recognition capabilities close to 3 Pro. Very cost-effective as a "decision-maker" for complex agents. |
| gemini-3-pro | $2.00 | $12.00 |
[High Cost/High Performance] High unit price plus thinking tokens (Thinking process) are counted as output, so actual billing can be 20-50 times higher than Flash models. Should be reserved for critical final reasoning. |
response = client.models.generate_content(
model="gemini-2.5-flash", # Model name
contents=prompt, # Prompt
config=types.GenerateContentConfig(
response_mime_type="application/json", # Specify JSON format
response_schema=StructuralResult # Specify Pydantic schema
)
)
| Parameter | Value | Description |
|---|---|---|
model |
"gemini-2.5-flash" |
Latest stable version, high rate limits and performance |
response_mime_type |
"application/json" |
Request JSON response |
response_schema |
StructuralResult |
Specify schema via Pydantic model |
4. Concrete Example
- Run
step1.pyto see the results and deepen your understanding. - The end-to-end test from Step 1 to Step 3 is in
check_async.py.
4.1 Test Input Text
| Paragraph | Content | Note |
|---|---|---|
| 1 | RAG (Retrieval-Augmented Generation) is a technique that combines retrieval and generation. It retrieves relevant information from an external knowledge base and passes it as context to the LLM. Announced by Facebook in 2020, it is now adopted by many systems. The greatest advantage of this method is its ability to reflect the latest information. This allows it to answer time-sensitive questions that an LLM alone cannot handle. It is also reported to have the effect of reducing hallucinations. |
|
| ↑ Empty Line ↓ | Split point | |
| 2 | Semantic chunking is a technique for splitting text into semantic units. A "chunk" refers to each block of divided text. An "Embedding" is a text converted into a numerical vector. Chunk size significantly impacts search accuracy. If it is too small, context is lost and embedding quality decreases. If it is too large, search noise increases and irrelevant information gets mixed in. |
|
| ↑ Empty Line ↓ | Split point | |
| 3 | Kyoto's autumn leaves are best viewed from mid-to-late November. Kiyomizu-dera and Arashiyama are known as particularly popular spots. To avoid crowds, early weekday mornings are recommended. Okinawa's sea has high clarity, making it ideal for snorkeling. Beautiful beaches are scattered around Onna Village, about an hour by car from Naha. While caution is needed for typhoons in summer, it is warm and pleasant in other seasons too. |
|
| ↑ Empty Line ↓ | Split point | |
| 4 | A vector database is a system for efficiently storing and searching high-dimensional vectors. Representative products include Pinecone, Weaviate, and Chroma, etc. It achieves fast similarity searches through ANN (Approximate Nearest Neighbor) algorithms. ANN accuracy and speed are in a trade-off relationship. You can adjust this balance by choosing indexing methods like HNSW or IVF. Scalability and cost are also important criteria when selecting a vector database. |
|
| ↑ Empty Line ↓ | Split point | |
| 5 | Chapter 1 Introduction to Machine Learning Machine learning is a collective term for algorithms that learn patterns from data. It is broadly categorized into supervised learning, unsupervised learning, and reinforcement learning. In this chapter, we explained these basic concepts. Chapter 2 Fundamentals of Deep Learning Deep learning is a machine learning method that uses multi-layered neural networks. It has achieved revolutionary results in image recognition and natural language processing. In this chapter, we will explain the basic architectures of CNN and RNN. |
⚠️ No empty line between Ch 1 and Ch 2 -> Same paragraph |
4.2 Visualization of Split Points
| Paragraph | Content | Split Reason |
|---|---|---|
| Paragraph 1 | Explanation of RAGRAG (Retrieval-Augmented Generation) is......it is also reported to have the effect of reducing hallucinations.
|
|
| ↓ | Empty line (\n\n) present |
Split |
| Paragraph 2 | Explanation of Semantic ChunkingSemantic chunking is......and irrelevant information gets mixed in.
|
|
| ↓ | Empty line (\n\n) present |
Split |
| Paragraph 3 | Tourism InformationKyoto's autumn leaves are......it is warm and pleasant in other seasons too.
|
|
| ↓ | Empty line (\n\n) present |
Split |
| Paragraph 4 | Explanation of Vector DBA vector database is......Scalability and cost are also important criteria.
|
|
| ↓ | Empty line (\n\n) present |
Split |
| Paragraph 5 | Chapter Structure (Ch 1 + Ch 2)Chapter 1 Introduction to Machine Learning...In this chapter, we explained these basic concepts.Chapter 2 Fundamentals of Deep Learning...In this chapter, we will explain the basic architectures of CNN and RNN.
|
⚠️ No empty line, so do not split |
4.3 Expected Output
# Step 1 Output (5 paragraphs)
[
# Paragraph 1: RAG explanation (Definition + Advantages)
"RAG (Retrieval-Augmented Generation) is a technique that combines retrieval and generation.\n"
"It retrieves relevant information from an external knowledge base and passes it as context to the LLM.\n"
"Announced by Facebook in 2020, it is now adopted by many systems.\n"
"The greatest advantage of this method is its ability to reflect the latest information.\n"
"This allows it to answer time-sensitive questions that an LLM alone cannot handle.\n"
"It is also reported to have the effect of reducing hallucinations.",
# Paragraph 2: Semantic chunking explanation (Definition + Usage)
"Semantic chunking is a technique for splitting text into semantic units.\n"
"A \"chunk\" refers to each block of divided text.\n"
"An \"Embedding\" is a text converted into a numerical vector.\n"
"Chunk size significantly impacts search accuracy.\n"
"If it is too small, context is lost and embedding quality decreases.\n"
"If it is too large, search noise increases and irrelevant information gets mixed in.",
# Paragraph 3: Tourism information (Kyoto + Okinawa)
"Kyoto's autumn leaves are best viewed from mid-to-late November.\n"
"Kiyomizu-dera and Arashiyama are known as particularly popular spots.\n"
"To avoid crowds, early weekday mornings are recommended.\n"
"Okinawa's sea has high clarity, making it ideal for snorkeling.\n"
"Beautiful beaches are scattered around Onna Village, about an hour by car from Naha.\n"
"While caution is needed for typhoons in summer, it is warm and pleasant in other seasons too.",
# Paragraph 4: Vector DB explanation (Definition + Usage)
"A vector database is a system for efficiently storing and searching high-dimensional vectors.\n"
"Representative products include Pinecone, Weaviate, and Chroma, etc.\n"
"It achieves fast similarity searches through ANN (Approximate Nearest Neighbor) algorithms.\n"
"ANN accuracy and speed are in a trade-off relationship.\n"
"You can adjust this balance by choosing indexing methods like HNSW or IVF.\n"
"Scalability and cost are also important criteria when selecting a vector database.",
# Paragraph 5: Chapter structure (Ch 1 + Ch 2) ← Same paragraph since there is no empty line!
"Chapter 1 Introduction to Machine Learning\n"
"Machine learning is a collective term for algorithms that learn patterns from data.\n"
"It is broadly categorized into supervised learning, unsupervised learning, and reinforcement learning.\n"
"In this chapter, we explained these basic concepts.\n"
"Chapter 2 Fundamentals of Deep Learning\n"
"Deep learning is a machine learning method that uses multi-layered neural networks.\n"
"It has achieved revolutionary results in image recognition and natural language processing.\n"
"In this chapter, we will explain the basic architectures of CNN and RNN."
]
4.4 Verification Points
| Check Item | Expected Result | Verification Method |
|---|---|---|
| Number of Paragraphs | 5 paragraphs | len(paragraphs) == 5 |
| Empty Line Split | Split by (\n\n) |
Check boundaries of each paragraph |
| No Chapter Split | Ch 1 and Ch 2 in same paragraph | Check contents of Paragraph 5 |
| Text Retention | No omission | Compare character count/content |
| Line Break Join | Join sentences with \n
|
Check output of full_text
|
4.5 Integration with Step 2 and Step 3 (Test Data Flow)
[Step 1] 1 Text → 5 Paragraphs
| Paragraph | Content |
|---|---|
| Para 1 | RAG explanation (Definition + Advantages) |
| Para 2 | Semantic chunking explanation (Definition + Usage) |
| Para 3 | Tourism info (Kyoto + Okinawa) |
| Para 4 | Vector DB explanation (Definition + Usage) |
| Para 5 | Chapter structure (Ch 1 + Ch 2) |
[Step 2] 5 Paragraphs → 10 Chunks (Semantically Split)
| Input | Output |
|---|---|
| Para 1 | Chunk 1 (RAG def) + Chunk 2 (RAG advantages) |
| Para 2 | Chunk 3 (Term def) + Chunk 4 (Term usage) |
| Para 3 | Chunk 5 (Kyoto tourism) + Chunk 6 (Okinawa tourism) |
| Para 4 | Chunk 7 (Vector DB def) + Chunk 8 (Usage) |
| Para 5 | Chunk 9 (Ch 1) + Chunk 10 (Ch 2) |
[Step 3] 10 Chunks → 7 Chunks (Merge/Split based on Continuity)
| Process | Result | Reason |
|---|---|---|
| Chunk 1+2 | Merge | Forward dependency: "This method", "It" |
| Chunk 3+4 | Merge | Backward dependency: "Chunk", "Embedding" undefined |
| Chunk 5 | Independent | Kyoto tourism is understandable alone |
| Chunk 6 | Independent | Okinawa tourism is understandable alone |
| Chunk 7+8 | Merge | Backward dependency: "ANN", "Vector DB" undefined |
| Chunk 9 | Independent | Ch 1 is complete |
| Chunk 10 | Independent | Ch 2 is understandable alone |
Detailed Verification Patterns
| Pattern | Description | Example | Judgment in Step 3 |
|---|---|---|---|
| Forward Dependency | References the past with pronouns | "This method", "It" | → Merge (True) |
| Backward Dependency | Technical term used without definition | "Chunk", "Embedding", "ANN" | → Merge (True) |
| Independent Judgment | Understandable alone despite same topic | Kyoto tourism / Okinawa tourism | → Separate (False) |
| Chapter Structure | When chapter changes | Ch 1 / Ch 2 | → Separate (False) |
5. Important Design Decisions
5.1 Why use only empty lines as the split criteria?
[Problem] Splitting at chapter changes makes it difficult for Step 2 to determine if it is a chapter change or not.
[Example: Input Text]
Chapter 1 Introduction to Machine Learning
Machine learning is...
Chapter 2 Fundamentals of Deep Learning
Deep learning is...
| Method | Step 1 Output | Problem in Step 2 |
|---|---|---|
| ❌ Split by Chapter | Para 1: Chapter 1...\nMachine learning is...Para 2: Chapter 2...\nDeep learning is...
|
Since Para 1 only has "Chapter 1", it is difficult to judge semantic splitting Chapter structure info is lost |
| ✅ Split by Empty Line | Para 1: Chapter 1...\nMachine learning...Chapter 2...\nDeep learning...
|
Para 1 contains both "Ch 1" and "Ch 2", allowing detection of chapter turn points → Chunk 1: Ch 1 → Chunk 2: Ch 2 |
5.2 Why join sentences with line breaks (\n)?
# full_text property in models.py
@property
def full_text(self) -> str:
"""Returns the full text of the paragraph joined by line breaks.
Note: Sentences are joined with line breaks (\n).
This maintains original text structure, improving processing accuracy in Step 2 and Step 3.
Especially important for CSV input:
- Preserves line breaks inside CSV cells
- Increases readability
- Semantic splitting becomes accurate
"""
return "\n".join([s.text for s in self.sentences]) # ← Joined by line break
| Joining Method | Result | Problem |
|---|---|---|
Empty string "".join()
|
"Sentence 1.Sentence 2.Sentence 3." |
Original line break structure is lost |
Line break "\n".join()
|
"Sentence 1.\nSentence 2.\nSentence 3." |
✅ Original structure is preserved |
Benefits of line break joining:
- Easier for Step 2 to recognize sentence breaks
- Improved accuracy of continuity judgment in Step 3
- High readability during debugging
- Preserves line breaks within cells when inputting CSV
5.3 Reason for selecting block size
block_size: int = 2000 # Default value
| Size | Pros | Cons |
|---|---|---|
| 500 chars | Fast processing | Paragraphs are prone to being split |
| 2000 chars | Good balance | ✅ Recommended |
| 5000 chars | Paragraphs are well-maintained | Close to API limits |
Summary
Role of Step 1
| Item | Content |
|---|---|
| Input | Text (string) |
| Output | Paragraph list (list[str]) |
| Purpose | Split while maintaining physical structure (paragraphs by empty lines) |
| Method | Structure recognition via LLM (Gemini API) |
Key Points
-
Split only by empty lines: Only
\n\nis the split criterion; do not split at chapter changes. - Maintain headings and body text: Treat "Chapter X" and the subsequent body text as the same paragraph.
-
Join sentences with line breaks:
full_textjoins sentences with\nto maintain structure. - Perfect retention of text: No omissions or summaries.
Next Step
The output of Step 1 (paragraph list) is used as input for Step 2 (Semantic Chunking).
In Step 2, semantic turn points within each paragraph are detected and split into finer chunks.
Specifically, chapter changes (Ch 1 → Ch 2) are split in Step 2.
In Step 3, continuity between adjacent chunks is judged, and they are merged/separated based on the following patterns:
| Pattern | Judgment | Result |
|---|---|---|
| Forward Dependency | Reference the past with pronouns ("This", "It") | Merge |
| Backward Dependency | Technical term used without definition | Merge |
| Independent Judgment | Understandable alone despite same topic | Separate |
| Chapter Structure | When chapter changes | Separate |
Discussion