iTranslated by AI
Act 5: Summarizing Conversation Logs with Local SLMs — Measurement and Convergence
About This Article
Claude Code drafted this based on AI conversation logs and experimental records, which I then edited and polished.
The conclusions in this article are observations from summarizing conversation logs in my local environment (qwen2.5:14b / Ollama) and do not claim to be a general optimization.
In the previous article, I introduced the trial-and-error process of category redesign, the explosion of duplicates caused by expanding to 16 categories, and the collapse of "Final Reduce."
At the end of Act IV, I wrote:
"Looks like it could be integrated" and "Integrating it won't lower quality" are two different things. I needed a mechanism to make decisions based on data, not intuition.
This time, I will record the design of that "data-based decision mechanism"—the Eval-Kit—and the process of integrating and converging categories based on measurements.
Introduction
What I Will Do
- Explain the design philosophy and structure of the quality evaluation framework (Eval-Kit)
- Show the duplicate structure of semantic groups revealed by "measuring"
- Present the design and verification results of the three integrated categories (YARUKOTO / YATTAKOTO / WAKATTAKOTO)
- Explain the intent and effects of the new perspective categories (SHUSSHISHA / KOKISHIN)
- Show the convergence from 21 categories to 8 and the numerical effects
- Explain operational improvements through externalizing category settings
What I Will Not Do
- Discuss releasing Eval-Kit as a general-purpose tool (it is a mechanism specific to this pipeline)
- Present statistically significant benchmark results (the number of test files is limited)
- Provide a comprehensive comparative evaluation of SLM models
Making Failure "Measurable"
Why Quality Evaluation Became Necessary
Let's look back at the problems encountered in Act IV.
- Expanding to 16 categories achieved 94% GT coverage, but the output swelled to 2,996 items (35.7x multiplier)
- Final Reduce collapsed due to template echoback (coverage 94% → 20%)
- While the duplicate structure between categories was visible, I had no choice but to judge "if quality would drop after integration" based on intuition
If I proceed to design integrated categories like this, I cannot make decisions beyond "it feels better." Manually reading and comparing outputs after every improvement is inefficient and relies on subjectivity.
I needed a repeatable benchmark.
Eval-Kit Design Philosophy
Eval-Kit is a mechanism for comparatively evaluating the output quality of the SLM pipeline against Claude Code's output.
What I prioritized most in the design was fair comparison conditions. When comparing SLM output with "ground truth," if conditions differ, it's impossible to identify the cause of discrepancies. I prepared four types of benchmarks (T1 to T4) so that I could isolate whether issues were due to "SLM capability deficiency, input differences, or category definition problems."
T1–T4: Four Types of Benchmarks
| ID | Input | Summarizer | Category Constraints | Purpose |
|---|---|---|---|---|
| T1 | Raw Log | Claude Code | Free | Maximum coverage (Upper baseline) |
| T2 | Compressed Log | Claude Code | Free | Pre-processing info loss check |
| T3 | Raw Log | Claude Code | 8 Categories Only | Category classification validity check |
| T4 | Compressed Log | Claude Code | 8 Categories Only | Fair comparison with SLM |
| TS | Compressed Log | SLM (14b) | 8 Categories + Full Prompt | Evaluation Target |
Here, the GT (Ground Truth) is not strictly correct data. It is a summary created by Claude Code reading the same log. Claude Code is not an "absolute truth" but a reference model assumed to have higher performance under the same conditions. I use it as a reference point for relative evaluation to "compare SLM output with the output of a larger model."
What to Compare
I differentiate between different problems using four comparative axes.
T1 vs TS → Exhaustive check of information dropped by SLM
T4 vs TS → Quality difference under identical conditions (same input/same category)
T1 vs T2 → Quantification of info loss due to pre-processing
T3 vs T4 → Effect of pre-processing on category classification
Most important is T4 vs TS. Since T4 and TS are created with the same input (compressed log) and the same category definitions, I can measure the pure performance gap of "Claude Code vs SLM."
Evaluation Method
For evaluation, I verify each item in the GT against the SLM output one by one and judge it in three stages.
| Judgment | Meaning |
|---|---|
| ○ | Captured by SLM |
| △ | Partially captured (part of information missing) |
| ✗ | Not captured by SLM |
This is not automated. It requires judging the semantic consistency of items, so I verify them manually. The evaluator is always the same person, and the judgment criteria (definitions of ○/△/✗) are fixed. The reproducibility of the evaluation is limited, but it is consistent as a comparison before and after improvement. It takes 30 minutes to an hour to evaluate one file, but by using it repeatedly, I have been able to objectively measure the difference between "before vs. after improvement."
Discovering Duplicate Structures by "Measuring"
Category Overlap Analysis of a 170KB File
In Act IV, I knew qualitatively that "there are duplicates." The following data is what I quantified with Eval-Kit.
I analyzed 210 items output after per-category Reduce for a 170KB file (10.8KB after pre-processing, 11 chunks) across 16 categories.
| Metric | Value |
|---|---|
| Total Items | 210 |
| Exact Duplicate | 12 items (6%) |
| Duplicate (First 40 chars match) | 18 items (9%) |
9% seems low at first glance. However, this is the result of duplicates extracted by different categories from the same chunk of the same file. As the file grows and the number of chunks increases, duplicates between chunks are added, causing an explosive increase (the 255KB → 2,996 items, 35.7x multiplier seen in Act IV is the result).
Discovering Semantic Groups
Organizing the duplicate pairs revealed three semantic groups.
Categories with many duplicates:
| Pair | Duplicate Count | Reason |
|---|---|---|
| ACTION ↔ NEXT | 4 | "To-do" and "Next to-do" are synonymous |
| NEXT ↔ PROPOSAL | 4 | "Next to-do" and "Proposal" overlap |
| ACTION ↔ PROPOSAL | 2 | Same as above |
| CHANGE ↔ DETAIL | 2 | "Changes" and "Details" are the same |
Categories with high overlap rates: NEXT (41%), CHANGE (38%), DETAIL (38%), PROPOSAL (24%)
Zero overlap (Highly independent): DECISION (0%), DID (0%), TECH (0%)
Grouping these by meaning:
| Group | Included Categories | Count |
|---|---|---|
| To-do group | ACTION / NEXT / TODO / PROPOSAL | 4 |
| Done group | CHANGE / DETAIL / DID / TECH | 4 |
| Understanding group | FINDING / FOUND / INSIGHT | 3 |
| Independent | DECISION, DATA, IDEA | 1 each |
Corroboration with a 255KB File
The same analysis on a 255KB file (66 chunks) clarifies the duplicate multiplier for each group.
| Group | SLM Total | Claude Code (T4) | Multiplier |
|---|---|---|---|
| Done group (5 categories) | 689 | 26 | 26.5x |
| To-do group (5 categories) | 664 | 17 | 39.1x |
| Understanding group (4 categories) | 756 | 14 | 54.0x |
Categories within each group are independently extracting the same information. The groups that should be integrated were clearly visible in the data.
Category Integration: 16 → 8
Integration Policy
The integration approach chosen was not "post-hoc dedup" (duplicate removal after output) but restarting from the Map prompt.
Reason:
- Post-hoc dedup can only handle exact matches via
sort -u, leaving semantic duplicates behind. - If the SLM is made to extract information as "one category covering the scope of four," duplicates are eliminated from the Map stage itself.
- Per-category Reduce also becomes unnecessary (Map output can be used as-is).
YARUKOTO (To-do group 4→1)
Integrated 4 categories—ACTION / NEXT / TODO / PROPOSAL—into one YARUKOTO.
The definition.md comprehensively describes the extraction targets for the original 4 categories.
【Include (Information corresponding to YARUKOTO)】
- Explicit TODOs ("Need to do ~", "Do ~ next")
- Planned actions ("Plan to do ~ next")
- Proposals/Considerations ("Consider ~", "There is also a method of ~")
- For comparison proposals of multiple plans, extract each plan individually
The last point, "extract each plan individually for comparison proposals," was crucial. In GT verification, while the old categories dispersed information (Plan A in PROPOSAL, Plan B in ACTION), YARUKOTO aggregates it into one category while keeping each plan as an individual item.
Verification Results:
| Metric | Total of 4 Old Categories | YARUKOTO |
|---|---|---|
| Map Item Count | 87 | 30-34 |
| Compression Rate | — | 62% reduction |
| GT Coverage | Equivalent | Equivalent + Individual capture of Plan A/Plan B |
With a 62% reduction in item count, GT coverage remains equivalent or better. The integration is a success.
Furthermore, I set NO_REDUCE=1 to skip per-category Reduce. Since the integrated category eliminates duplicates at the Map stage, we avoid the risk of Reduce SLM fluctuation (loss of information).
YATTAKOTO (Done group 4→1)
Integrated 4 categories—CHANGE / DETAIL / DID / TECH—into one YATTAKOTO.
For YATTAKOTO, I compared 3 versions of the prompt.
| v1 | v2 | v3 | |
|---|---|---|---|
| Map Item Count | 28 | 34 | 23 |
| GT ○ | 6 | 9 | 7 |
| GT ○+△ (%) | 12 (57%) | 14 (67%) | 11 (52%) |
| Internal Duplicates | 4 sets | 5 sets | 4 sets |
| YARUKOTO Leakage | 2 items | 2 items | 1 item |
| [Reference]/[Edit] Contamination | 0 items | 3 items | 0 items |
I adopted v2. It had the highest GT coverage at 67%. Although v3 had zero [Reference]/[Edit] contamination and less noise, its GT coverage dropped to 52%. The priority of this pipeline is "Coverage > Noise Suppression." The worst outcome in a summary is "dropping important information," and noise can be handled by post-processing (normalization + filtering) during the merge.
Changes in the definition.md of v2 that had a major effect:
[Added to Recommendations]:
- Always keep specific numbers
- Always keep before/after numeric pairs ("37→24 chunks", "149→97 times", etc.)
- Keep repetition counts/version numbers ("Improved 10 times from v1 to v10", etc.)
[Added to Exclusions]:
- Future-tense tasks → Do not include at all
The instruction to "keep numbers" is the main reason for the improvement in GT coverage. qwen2.5:14b has a weak ability to retain numbers, and without instructions, it tends to abstract specific figures.
Uncaptured items common to all versions
In GT verification, there were 8 items that could not be captured by any of v1, v2, or v3.
| GT Item | Leakage Type | Necessary for Summary? |
|---|---|---|
| Explicit "Prohibitions" | Minor instruction change | No |
| Instruction to "Pick broadly" | Minor instruction change | No |
| Explicit Japanese output | Minor instruction change | No |
| v1-v10 10 repetitions | Number (count) | △ |
| Reduction rate 6.5%→38.7% | Number (rate) | △ |
| 149→97 times (35% reduction) | Before/after number | △ |
| 25→17 times (32% reduction) | Before/after number | △ |
| ssh://wk-01.local/... | Environment meta-info | No |
7 out of the 8 items were not captured by the old 4 categories either. In other words, this is not a degradation caused by integration but the capability limit of the SLM. The loss of numeric data is a known weakness of qwen2.5:14b and cannot be solved by category design.
Why this analysis was important: It eliminates the risk of falsely concluding that "quality dropped after integration." By confirming that the uncaptured items in the GT existed before the integration, I was able to gain confidence in the judgment that "integration is safe."
WAKATTAKOTO (Understanding group 3→1)
Integrated 3 categories—FINDING / FOUND / INSIGHT—into one WAKATTAKOTO.
| Metric | Total of 3 Old Categories | WAKATTAKOTO |
|---|---|---|
| Map Item Count | 67 | 43-47 |
| Compression Rate | — | 30-36% reduction |
| GT Coverage | — | 80% (○5/△3/✗2) |
Similar to YARUKOTO and YATTAKOTO, I used NO_REDUCE=1 to use the Map output as-is.
New Perspective Categories
While I consolidated 11 categories into 3, I also added categories from new perspectives.
SHUSSHISHA (What the Investor Wants to Know)
A category with the perspective: "If the investor of this development read these logs, what would they want to know?"
It appropriately excludes technical details and focuses on progress, decisions, and risks. This results in an output similar to an executive summary.
| Metric | Value |
|---|---|
| Map Item Count | 20-21 |
| "Not Applicable" Rate | High (6/11 chunks) |
| Features | Appropriately filters out technical details. Compact and focused output |
The high "Not Applicable" rate is expected behavior. It correctly determines that there is no information the investor would want to know in chunks centered on technical discussions.
KOKISHIN (Curiosity / Interesting Points)
The "Huh!" test—a category that extracts surprising information, counter-intuitive facts, or unexpected discoveries.
| Metric | Value |
|---|---|
| Map Item Count | 18-19 |
| Features | Specializes in unexpected discoveries and counter-intuitive facts. Unique angle |
Example output of KOKISHIN:
- "The Chinese text included in the SLM output was not prompt injection, but a language switch by the multilingual model."
- "Even after removing perfectly matched duplicates with
sort -u, no semantic duplicates were removed (the SLM's output is consistent at the character level)."
These are extractions based on "surprise," which cannot be captured by TECH or WAKATTAKOTO.
JIYUU (Key Points Chosen by the SLM)
I also tested a category that instructs the SLM to "please choose freely the points you think are important."
Result: There were extractions in all chunks and a high item count (35 items), but the overlap with other categories (TECH / IDEA) was significant, and the unique perspective was limited. The "SLM free perspective" tends to be a degraded copy of existing categories.
JIYUU was ultimately set to SKIP=1 (disabled).
Convergence: 21 Categories → 8 Categories
Final Configuration
After integration, new additions, and disabling, it converged to the following 8 categories.
| # | Category | LABEL | Type | Note |
|---|---|---|---|---|
| 1 | YARUKOTO | Tasks / Next Steps | Integrated | Absorbed ACTION/NEXT/TODO/PROPOSAL |
| 2 | YATTAKOTO | Actions Taken / Implementation | Integrated | Absorbed CHANGE/DETAIL/DID/TECH |
| 3 | WAKATTAKOTO | Understandings / Insights | Integrated | Absorbed FINDING/FOUND/INSIGHT |
| 4 | DATA | Numerical / Config Data | Independent | |
| 5 | DECISION | Decisions Made | Independent | |
| 6 | IDEA | New Projects / Ideas | Independent | |
| 7 | KOKISHIN | Curiosity / Interesting Points | New | "Huh!" test |
| 8 | SHUSSHISHA | Investor Interest | New | Executive Summary |
The 14 categories set to SKIP=1 (11 old categories + HIGHLIGHT + KEYWORD + JIYUU) were not deleted; they remain in the categories/ directory. This is because they can be bulk-restored with SKIP_DISABLE=1 when performing comparison verification with old categories using Eval-Kit.
Effect on Small Files (29KB)
The results of processing the same file (29KB) with 21 categories (5th Act) and 8 categories (7th Act):
| Metric | 21 Categories | 8 Categories | Improvement |
|---|---|---|---|
| Category Count | 21 | 8 | -62% |
| Total TS Items | 328 | 91 | -72% |
| T4 Comparison Ratio | 7.0x | 2.1x | -70% |
| Validation Failures | 8 | 0 | -100% |
| Number of Jobs | 231 | 97 | -58% |
Ratio: 7.0x → 2.1x (-70%). Validation failures are zero.
By reducing the number of categories, the load on the SLM was reduced, and validation failures (format collapse, header mismatch, etc.) were completely eliminated.
Effect on Large Files (255KB)
The results of processing the same file (255KB) with 21 categories (6th Act) and 8 categories (8th Act):
| Metric | 21 Categories | 8 Categories | Improvement |
|---|---|---|---|
| Total TS Items | 2,996 | 1,147 | -62% |
| T4 Comparison Ratio | 35.7x | 14.9x | -58% |
| Validation Failures | 14 (5 unrecoverable) | 2 (fully recovered) | -86% |
| Number of Jobs | 1,386 | 594 | -57% |
Even for large files, item count decreased by -62% and the ratio by -58%, showing significant improvement. Validation failures also plummeted from 14 (5 unrecoverable) to 2 (all automatically recovered).
However, the 14.9x ratio is still high. This problem will be solved in the next installment (Sixth Act: GroupReduce).
Ratio Trend by File Size
| File | Size | Ratio (21cat) | Ratio (8cat) | Improvement |
|---|---|---|---|---|
| 02-12 | 29K | 7.0x | 2.1x | -70% |
| 02-09 | 255K | 35.7x | 14.9x | -58% |
Improvement is greater for small files (-70%), already reaching a practical ratio (2.1x). Improvement is slightly smaller for large files (-58%), as not only category overlap but also semantic overlap between chunks remains.
Externalizing Category Settings
Limits of Hardcoding
At the time of the third act, the category-specific prompt configurations (role / instruction / definition / format / rules / example) were externalized in the categories/ directory. However, settings that control the behavior of categories—whether to Reduce, SKIP flag, validation header definitions—were hardcoded in case statements within the script.
It was not a problem when there were only 4 categories, but once they increased to 16, the case statements became unmaintainable. Every time a new category was added, it was necessary to edit multiple parts of the script.
Externalization via config.sh
I changed it to a system where config.sh is placed in each category's directory to declare category-specific settings.
categories/
├── _common/
│ └── config.sh # Default values common to all categories
├── YARUKOTO/
│ ├── config.sh # NO_REDUCE="1", SKIP="0"
│ ├── definition.md
│ └── rules.md
├── ACTION/
│ └── config.sh # SKIP="1" (Absorbed by integrated category)
└── ...
Default values are defined in _common/config.sh, and overridden by config.sh in each category.
# _common/config.sh (Default)
NO_REDUCE="0"
SKIP="0"
# YARUKOTO/config.sh (Integrated category)
NO_REDUCE="1" # Use Map output as-is (No Reduce needed)
# ACTION/config.sh (Old category)
SKIP="1" # Absorbed by YARUKOTO
Selective Execution via SKIP and ONLY_CATEGORY
With the transition to the 8-category configuration, I set the 14 old categories to SKIP=1. They are skipped during normal execution, but if comparison with old categories is needed for eval-kit, they can be bulk-restored with SKIP_DISABLE=1.
Conversely, if you only want to test a specific category, you can run only that one with ONLY_CATEGORY=YARUKOTO.
# Normal execution (8 categories)
bash main.sh --force
# Execute all categories (for eval-kit)
SKIP_DISABLE=1 bash main.sh --force
# Only 1 category (when adjusting prompts)
ONLY_CATEGORY=YATTAKOTO bash main.sh --force
With this externalization, I can now add, remove, and change category settings without editing the script.
Pipeline at This Point
This is the current pipeline configuration after the improvements of the fifth act.
| Parameter | Value | Change from Third Act |
|---|---|---|
| Chunk Size | 2KB | 10KB → 2KB (Adjusted for prompt enrichment) |
| Model | qwen2.5:14b | No change |
| Pre-processing | Enabled | No change |
| Categories | 8 types | 4 → 16 → 8 |
| per-category Reduce | Disabled for integrated categories | NO_REDUCE=1 |
| Final Reduce | Forbidden | Abolished in Fourth Act |
Summary of Processing Flow
[raw logs]
↓ Pre-processing (clean_dialog.py)
[cleaned logs]
↓ Chunking (2KB)
[chunk 1] [chunk 2] ... [chunk N]
↓ Multi-Pass (8 categories × N chunks)
↓ ※ Independent SLM call for each category × each chunk
[Category-specific Map output]
↓ Merge (sort -u + normalized dedup + filter)
[Category-specific merged output]
↓ Concatenate
[Final Summary: 8-category structured Markdown]
Differences from the third act pipeline:
- Chunk size reduced from 10KB to 2KB
- Categories increased from 4 to 8 (but overlap reduced due to integration)
- per-category Reduce disabled for integrated categories
- Final Reduce completely abolished
- Normalized dedup and filtering added to Merge
Conclusion
In this article, I introduced the process of designing Eval-Kit and integrating/converging categories based on measurements.
- Eval-Kit is a system that measures quality with four types of benchmarks, T1–T4. The most important comparison is T4 vs TS (Claude Code vs SLM under the same conditions).
- "Measuring" revealed three semantic groups: Tasks (41% overlap), Actions Taken (38% overlap), and Understandings (38% overlap).
- Eliminated overlap with 3 integrated categories: YARUKOTO (62% reduction), YATTAKOTO (v2 adopted, 67% GT), WAKATTAKOTO (30-36% reduction, 80% GT).
- New perspective categories: Multi-faceted extraction with SHUSSHISHA (investor perspective) and KOKISHIN (curiosity perspective).
- Convergence: 21 categories → 8 categories; ratio decreased from 7.0x to 2.1x (-70%) for small files, and 35.7x to 14.9x (-58%) for large files. Validation failures also went to zero.
- Externalizing category settings: Flexible operations with config.sh + SKIP/ONLY_CATEGORY.
"Increasing leading to failure" in the Fourth Act and "Measuring leading to reduction" in the Fifth Act are two sides of the same coin. I noticed the need for measurement because there was failure, and I could integrate with peace of mind because there was measurement. A quality evaluation mechanism is the foundation for pipeline design decisions.
Remaining Challenges
With integration into 8 categories, the ratio for small files improved to 2.1x, but it remains high at 14.9x for large files.
The cause is clear. Merge's sort -u only removes perfectly matched duplicates and leaves a large amount of semantic duplication (paraphrasing, items stating the same fact with different expressions).
However, the pipeline so far already has components to "remove semantic duplication." "Short text + clear instructions → SLM acts accurately"—can this principle discovered in the third act be applied to duplication removal?
Next time, I will talk about GroupReduce, which solves a new problem by reusing "parts" of the pipeline.
Discussion