iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🙃

Act 5: Summarizing Conversation Logs with Local SLMs — Measurement and Convergence

に公開

About This Article

Claude Code drafted this based on AI conversation logs and experimental records, which I then edited and polished.
The conclusions in this article are observations from summarizing conversation logs in my local environment (qwen2.5:14b / Ollama) and do not claim to be a general optimization.

In the previous article, I introduced the trial-and-error process of category redesign, the explosion of duplicates caused by expanding to 16 categories, and the collapse of "Final Reduce."

At the end of Act IV, I wrote:

"Looks like it could be integrated" and "Integrating it won't lower quality" are two different things. I needed a mechanism to make decisions based on data, not intuition.

This time, I will record the design of that "data-based decision mechanism"—the Eval-Kit—and the process of integrating and converging categories based on measurements.


Introduction

What I Will Do

  • Explain the design philosophy and structure of the quality evaluation framework (Eval-Kit)
  • Show the duplicate structure of semantic groups revealed by "measuring"
  • Present the design and verification results of the three integrated categories (YARUKOTO / YATTAKOTO / WAKATTAKOTO)
  • Explain the intent and effects of the new perspective categories (SHUSSHISHA / KOKISHIN)
  • Show the convergence from 21 categories to 8 and the numerical effects
  • Explain operational improvements through externalizing category settings

What I Will Not Do

  • Discuss releasing Eval-Kit as a general-purpose tool (it is a mechanism specific to this pipeline)
  • Present statistically significant benchmark results (the number of test files is limited)
  • Provide a comprehensive comparative evaluation of SLM models

Making Failure "Measurable"

Why Quality Evaluation Became Necessary

Let's look back at the problems encountered in Act IV.

  • Expanding to 16 categories achieved 94% GT coverage, but the output swelled to 2,996 items (35.7x multiplier)
  • Final Reduce collapsed due to template echoback (coverage 94% → 20%)
  • While the duplicate structure between categories was visible, I had no choice but to judge "if quality would drop after integration" based on intuition

If I proceed to design integrated categories like this, I cannot make decisions beyond "it feels better." Manually reading and comparing outputs after every improvement is inefficient and relies on subjectivity.

I needed a repeatable benchmark.

Eval-Kit Design Philosophy

Eval-Kit is a mechanism for comparatively evaluating the output quality of the SLM pipeline against Claude Code's output.

What I prioritized most in the design was fair comparison conditions. When comparing SLM output with "ground truth," if conditions differ, it's impossible to identify the cause of discrepancies. I prepared four types of benchmarks (T1 to T4) so that I could isolate whether issues were due to "SLM capability deficiency, input differences, or category definition problems."

T1–T4: Four Types of Benchmarks

ID Input Summarizer Category Constraints Purpose
T1 Raw Log Claude Code Free Maximum coverage (Upper baseline)
T2 Compressed Log Claude Code Free Pre-processing info loss check
T3 Raw Log Claude Code 8 Categories Only Category classification validity check
T4 Compressed Log Claude Code 8 Categories Only Fair comparison with SLM
TS Compressed Log SLM (14b) 8 Categories + Full Prompt Evaluation Target

Here, the GT (Ground Truth) is not strictly correct data. It is a summary created by Claude Code reading the same log. Claude Code is not an "absolute truth" but a reference model assumed to have higher performance under the same conditions. I use it as a reference point for relative evaluation to "compare SLM output with the output of a larger model."

What to Compare

I differentiate between different problems using four comparative axes.

T1 vs TS  → Exhaustive check of information dropped by SLM
T4 vs TS  → Quality difference under identical conditions (same input/same category)
T1 vs T2  → Quantification of info loss due to pre-processing
T3 vs T4  → Effect of pre-processing on category classification

Most important is T4 vs TS. Since T4 and TS are created with the same input (compressed log) and the same category definitions, I can measure the pure performance gap of "Claude Code vs SLM."

Evaluation Method

For evaluation, I verify each item in the GT against the SLM output one by one and judge it in three stages.

Judgment Meaning
Captured by SLM
Partially captured (part of information missing)
Not captured by SLM

This is not automated. It requires judging the semantic consistency of items, so I verify them manually. The evaluator is always the same person, and the judgment criteria (definitions of ○/△/✗) are fixed. The reproducibility of the evaluation is limited, but it is consistent as a comparison before and after improvement. It takes 30 minutes to an hour to evaluate one file, but by using it repeatedly, I have been able to objectively measure the difference between "before vs. after improvement."


Discovering Duplicate Structures by "Measuring"

Category Overlap Analysis of a 170KB File

In Act IV, I knew qualitatively that "there are duplicates." The following data is what I quantified with Eval-Kit.

I analyzed 210 items output after per-category Reduce for a 170KB file (10.8KB after pre-processing, 11 chunks) across 16 categories.

Metric Value
Total Items 210
Exact Duplicate 12 items (6%)
Duplicate (First 40 chars match) 18 items (9%)

9% seems low at first glance. However, this is the result of duplicates extracted by different categories from the same chunk of the same file. As the file grows and the number of chunks increases, duplicates between chunks are added, causing an explosive increase (the 255KB → 2,996 items, 35.7x multiplier seen in Act IV is the result).

Discovering Semantic Groups

Organizing the duplicate pairs revealed three semantic groups.

Categories with many duplicates:

Pair Duplicate Count Reason
ACTION ↔ NEXT 4 "To-do" and "Next to-do" are synonymous
NEXT ↔ PROPOSAL 4 "Next to-do" and "Proposal" overlap
ACTION ↔ PROPOSAL 2 Same as above
CHANGE ↔ DETAIL 2 "Changes" and "Details" are the same

Categories with high overlap rates: NEXT (41%), CHANGE (38%), DETAIL (38%), PROPOSAL (24%)

Zero overlap (Highly independent): DECISION (0%), DID (0%), TECH (0%)

Grouping these by meaning:

Group Included Categories Count
To-do group ACTION / NEXT / TODO / PROPOSAL 4
Done group CHANGE / DETAIL / DID / TECH 4
Understanding group FINDING / FOUND / INSIGHT 3
Independent DECISION, DATA, IDEA 1 each

Corroboration with a 255KB File

The same analysis on a 255KB file (66 chunks) clarifies the duplicate multiplier for each group.

Group SLM Total Claude Code (T4) Multiplier
Done group (5 categories) 689 26 26.5x
To-do group (5 categories) 664 17 39.1x
Understanding group (4 categories) 756 14 54.0x

Categories within each group are independently extracting the same information. The groups that should be integrated were clearly visible in the data.


Category Integration: 16 → 8

Integration Policy

The integration approach chosen was not "post-hoc dedup" (duplicate removal after output) but restarting from the Map prompt.

Reason:

  • Post-hoc dedup can only handle exact matches via sort -u, leaving semantic duplicates behind.
  • If the SLM is made to extract information as "one category covering the scope of four," duplicates are eliminated from the Map stage itself.
  • Per-category Reduce also becomes unnecessary (Map output can be used as-is).

YARUKOTO (To-do group 4→1)

Integrated 4 categories—ACTION / NEXT / TODO / PROPOSAL—into one YARUKOTO.

The definition.md comprehensively describes the extraction targets for the original 4 categories.

【Include (Information corresponding to YARUKOTO)】
- Explicit TODOs ("Need to do ~", "Do ~ next")
- Planned actions ("Plan to do ~ next")
- Proposals/Considerations ("Consider ~", "There is also a method of ~")
- For comparison proposals of multiple plans, extract each plan individually

The last point, "extract each plan individually for comparison proposals," was crucial. In GT verification, while the old categories dispersed information (Plan A in PROPOSAL, Plan B in ACTION), YARUKOTO aggregates it into one category while keeping each plan as an individual item.

Verification Results:

Metric Total of 4 Old Categories YARUKOTO
Map Item Count 87 30-34
Compression Rate 62% reduction
GT Coverage Equivalent Equivalent + Individual capture of Plan A/Plan B

With a 62% reduction in item count, GT coverage remains equivalent or better. The integration is a success.

Furthermore, I set NO_REDUCE=1 to skip per-category Reduce. Since the integrated category eliminates duplicates at the Map stage, we avoid the risk of Reduce SLM fluctuation (loss of information).

YATTAKOTO (Done group 4→1)

Integrated 4 categories—CHANGE / DETAIL / DID / TECH—into one YATTAKOTO.

For YATTAKOTO, I compared 3 versions of the prompt.

v1 v2 v3
Map Item Count 28 34 23
GT ○ 6 9 7
GT ○+△ (%) 12 (57%) 14 (67%) 11 (52%)
Internal Duplicates 4 sets 5 sets 4 sets
YARUKOTO Leakage 2 items 2 items 1 item
[Reference]/[Edit] Contamination 0 items 3 items 0 items

I adopted v2. It had the highest GT coverage at 67%. Although v3 had zero [Reference]/[Edit] contamination and less noise, its GT coverage dropped to 52%. The priority of this pipeline is "Coverage > Noise Suppression." The worst outcome in a summary is "dropping important information," and noise can be handled by post-processing (normalization + filtering) during the merge.

Changes in the definition.md of v2 that had a major effect:

[Added to Recommendations]:
- Always keep specific numbers
- Always keep before/after numeric pairs ("37→24 chunks", "149→97 times", etc.)
- Keep repetition counts/version numbers ("Improved 10 times from v1 to v10", etc.)

[Added to Exclusions]:
- Future-tense tasks → Do not include at all

The instruction to "keep numbers" is the main reason for the improvement in GT coverage. qwen2.5:14b has a weak ability to retain numbers, and without instructions, it tends to abstract specific figures.

Uncaptured items common to all versions

In GT verification, there were 8 items that could not be captured by any of v1, v2, or v3.

GT Item Leakage Type Necessary for Summary?
Explicit "Prohibitions" Minor instruction change No
Instruction to "Pick broadly" Minor instruction change No
Explicit Japanese output Minor instruction change No
v1-v10 10 repetitions Number (count)
Reduction rate 6.5%→38.7% Number (rate)
149→97 times (35% reduction) Before/after number
25→17 times (32% reduction) Before/after number
ssh://wk-01.local/... Environment meta-info No

7 out of the 8 items were not captured by the old 4 categories either. In other words, this is not a degradation caused by integration but the capability limit of the SLM. The loss of numeric data is a known weakness of qwen2.5:14b and cannot be solved by category design.

Why this analysis was important: It eliminates the risk of falsely concluding that "quality dropped after integration." By confirming that the uncaptured items in the GT existed before the integration, I was able to gain confidence in the judgment that "integration is safe."

WAKATTAKOTO (Understanding group 3→1)

Integrated 3 categories—FINDING / FOUND / INSIGHT—into one WAKATTAKOTO.

Metric Total of 3 Old Categories WAKATTAKOTO
Map Item Count 67 43-47
Compression Rate 30-36% reduction
GT Coverage 80% (○5/△3/✗2)

Similar to YARUKOTO and YATTAKOTO, I used NO_REDUCE=1 to use the Map output as-is.


New Perspective Categories

While I consolidated 11 categories into 3, I also added categories from new perspectives.

SHUSSHISHA (What the Investor Wants to Know)

A category with the perspective: "If the investor of this development read these logs, what would they want to know?"

It appropriately excludes technical details and focuses on progress, decisions, and risks. This results in an output similar to an executive summary.

Metric Value
Map Item Count 20-21
"Not Applicable" Rate High (6/11 chunks)
Features Appropriately filters out technical details. Compact and focused output

The high "Not Applicable" rate is expected behavior. It correctly determines that there is no information the investor would want to know in chunks centered on technical discussions.

KOKISHIN (Curiosity / Interesting Points)

The "Huh!" test—a category that extracts surprising information, counter-intuitive facts, or unexpected discoveries.

Metric Value
Map Item Count 18-19
Features Specializes in unexpected discoveries and counter-intuitive facts. Unique angle

Example output of KOKISHIN:

  • "The Chinese text included in the SLM output was not prompt injection, but a language switch by the multilingual model."
  • "Even after removing perfectly matched duplicates with sort -u, no semantic duplicates were removed (the SLM's output is consistent at the character level)."

These are extractions based on "surprise," which cannot be captured by TECH or WAKATTAKOTO.

JIYUU (Key Points Chosen by the SLM)

I also tested a category that instructs the SLM to "please choose freely the points you think are important."

Result: There were extractions in all chunks and a high item count (35 items), but the overlap with other categories (TECH / IDEA) was significant, and the unique perspective was limited. The "SLM free perspective" tends to be a degraded copy of existing categories.

JIYUU was ultimately set to SKIP=1 (disabled).


Convergence: 21 Categories → 8 Categories

Final Configuration

After integration, new additions, and disabling, it converged to the following 8 categories.

# Category LABEL Type Note
1 YARUKOTO Tasks / Next Steps Integrated Absorbed ACTION/NEXT/TODO/PROPOSAL
2 YATTAKOTO Actions Taken / Implementation Integrated Absorbed CHANGE/DETAIL/DID/TECH
3 WAKATTAKOTO Understandings / Insights Integrated Absorbed FINDING/FOUND/INSIGHT
4 DATA Numerical / Config Data Independent
5 DECISION Decisions Made Independent
6 IDEA New Projects / Ideas Independent
7 KOKISHIN Curiosity / Interesting Points New "Huh!" test
8 SHUSSHISHA Investor Interest New Executive Summary

The 14 categories set to SKIP=1 (11 old categories + HIGHLIGHT + KEYWORD + JIYUU) were not deleted; they remain in the categories/ directory. This is because they can be bulk-restored with SKIP_DISABLE=1 when performing comparison verification with old categories using Eval-Kit.

Effect on Small Files (29KB)

The results of processing the same file (29KB) with 21 categories (5th Act) and 8 categories (7th Act):

Metric 21 Categories 8 Categories Improvement
Category Count 21 8 -62%
Total TS Items 328 91 -72%
T4 Comparison Ratio 7.0x 2.1x -70%
Validation Failures 8 0 -100%
Number of Jobs 231 97 -58%

Ratio: 7.0x → 2.1x (-70%). Validation failures are zero.

By reducing the number of categories, the load on the SLM was reduced, and validation failures (format collapse, header mismatch, etc.) were completely eliminated.

Effect on Large Files (255KB)

The results of processing the same file (255KB) with 21 categories (6th Act) and 8 categories (8th Act):

Metric 21 Categories 8 Categories Improvement
Total TS Items 2,996 1,147 -62%
T4 Comparison Ratio 35.7x 14.9x -58%
Validation Failures 14 (5 unrecoverable) 2 (fully recovered) -86%
Number of Jobs 1,386 594 -57%

Even for large files, item count decreased by -62% and the ratio by -58%, showing significant improvement. Validation failures also plummeted from 14 (5 unrecoverable) to 2 (all automatically recovered).

However, the 14.9x ratio is still high. This problem will be solved in the next installment (Sixth Act: GroupReduce).

Ratio Trend by File Size

File Size Ratio (21cat) Ratio (8cat) Improvement
02-12 29K 7.0x 2.1x -70%
02-09 255K 35.7x 14.9x -58%

Improvement is greater for small files (-70%), already reaching a practical ratio (2.1x). Improvement is slightly smaller for large files (-58%), as not only category overlap but also semantic overlap between chunks remains.


Externalizing Category Settings

Limits of Hardcoding

At the time of the third act, the category-specific prompt configurations (role / instruction / definition / format / rules / example) were externalized in the categories/ directory. However, settings that control the behavior of categories—whether to Reduce, SKIP flag, validation header definitions—were hardcoded in case statements within the script.

It was not a problem when there were only 4 categories, but once they increased to 16, the case statements became unmaintainable. Every time a new category was added, it was necessary to edit multiple parts of the script.

Externalization via config.sh

I changed it to a system where config.sh is placed in each category's directory to declare category-specific settings.

categories/
├── _common/
│   └── config.sh          # Default values common to all categories
├── YARUKOTO/
│   ├── config.sh          # NO_REDUCE="1", SKIP="0"
│   ├── definition.md
│   └── rules.md
├── ACTION/
│   └── config.sh          # SKIP="1" (Absorbed by integrated category)
└── ...

Default values are defined in _common/config.sh, and overridden by config.sh in each category.

# _common/config.sh (Default)
NO_REDUCE="0"
SKIP="0"

# YARUKOTO/config.sh (Integrated category)
NO_REDUCE="1"  # Use Map output as-is (No Reduce needed)

# ACTION/config.sh (Old category)
SKIP="1"        # Absorbed by YARUKOTO

Selective Execution via SKIP and ONLY_CATEGORY

With the transition to the 8-category configuration, I set the 14 old categories to SKIP=1. They are skipped during normal execution, but if comparison with old categories is needed for eval-kit, they can be bulk-restored with SKIP_DISABLE=1.

Conversely, if you only want to test a specific category, you can run only that one with ONLY_CATEGORY=YARUKOTO.

# Normal execution (8 categories)
bash main.sh --force

# Execute all categories (for eval-kit)
SKIP_DISABLE=1 bash main.sh --force

# Only 1 category (when adjusting prompts)
ONLY_CATEGORY=YATTAKOTO bash main.sh --force

With this externalization, I can now add, remove, and change category settings without editing the script.


Pipeline at This Point

This is the current pipeline configuration after the improvements of the fifth act.

Parameter Value Change from Third Act
Chunk Size 2KB 10KB → 2KB (Adjusted for prompt enrichment)
Model qwen2.5:14b No change
Pre-processing Enabled No change
Categories 8 types 4 → 16 → 8
per-category Reduce Disabled for integrated categories NO_REDUCE=1
Final Reduce Forbidden Abolished in Fourth Act

Summary of Processing Flow

[raw logs]
    ↓ Pre-processing (clean_dialog.py)
[cleaned logs]
    ↓ Chunking (2KB)
[chunk 1] [chunk 2] ... [chunk N]
    ↓ Multi-Pass (8 categories × N chunks)
    ↓ ※ Independent SLM call for each category × each chunk
[Category-specific Map output]
    ↓ Merge (sort -u + normalized dedup + filter)
[Category-specific merged output]
    ↓ Concatenate
[Final Summary: 8-category structured Markdown]

Differences from the third act pipeline:

  • Chunk size reduced from 10KB to 2KB
  • Categories increased from 4 to 8 (but overlap reduced due to integration)
  • per-category Reduce disabled for integrated categories
  • Final Reduce completely abolished
  • Normalized dedup and filtering added to Merge

Conclusion

In this article, I introduced the process of designing Eval-Kit and integrating/converging categories based on measurements.

  • Eval-Kit is a system that measures quality with four types of benchmarks, T1–T4. The most important comparison is T4 vs TS (Claude Code vs SLM under the same conditions).
  • "Measuring" revealed three semantic groups: Tasks (41% overlap), Actions Taken (38% overlap), and Understandings (38% overlap).
  • Eliminated overlap with 3 integrated categories: YARUKOTO (62% reduction), YATTAKOTO (v2 adopted, 67% GT), WAKATTAKOTO (30-36% reduction, 80% GT).
  • New perspective categories: Multi-faceted extraction with SHUSSHISHA (investor perspective) and KOKISHIN (curiosity perspective).
  • Convergence: 21 categories → 8 categories; ratio decreased from 7.0x to 2.1x (-70%) for small files, and 35.7x to 14.9x (-58%) for large files. Validation failures also went to zero.
  • Externalizing category settings: Flexible operations with config.sh + SKIP/ONLY_CATEGORY.

"Increasing leading to failure" in the Fourth Act and "Measuring leading to reduction" in the Fifth Act are two sides of the same coin. I noticed the need for measurement because there was failure, and I could integrate with peace of mind because there was measurement. A quality evaluation mechanism is the foundation for pipeline design decisions.

Remaining Challenges

With integration into 8 categories, the ratio for small files improved to 2.1x, but it remains high at 14.9x for large files.

The cause is clear. Merge's sort -u only removes perfectly matched duplicates and leaves a large amount of semantic duplication (paraphrasing, items stating the same fact with different expressions).

However, the pipeline so far already has components to "remove semantic duplication." "Short text + clear instructions → SLM acts accurately"—can this principle discovered in the third act be applied to duplication removal?

Next time, I will talk about GroupReduce, which solves a new problem by reusing "parts" of the pipeline.

Discussion