iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📗

I Developed an OSS Library to Convert Troublesome Excel Documents into Semantically Structured JSON for RAG

に公開
5

Introduction

Hello! I'm someone working in a role similar to an internal SE in the manufacturing industry.

Recently, I've been hooked on the RAG field, but I've recently realized that Excel documents are just too much of a nuisance. On Zenn as well, I regularly see articles about how to handle these Excel documents in RAG.

https://zenn.dev/firstautomation/articles/aed95bce20e900

https://zenn.dev/satto_workspace/articles/c5faee51102ee1#おわりに:美味いragのための次のステップ

While preceding engineers have presented various effective solutions, I've also tackled this issue with a different approach, so I'd like to share it 🔥

TL;DR

  • "exstruct" is a self-made library that structures Excel files using COM + xlwings/openpyxl. It outputs not just cell values, but also shapes, charts, hyperlinks, print areas, cell background colors, merged cell ranges, SmartArt, and even tables derived from borders in JSON/YAML/TOON formats.

  • It can extract flowchart arrow directions and chart axes/series without image recognition, supplementing information often lost in RAG to improve LLM restoration accuracy.

  • In the demo, a table + line chart + flowchart is converted to JSON → and then reconstructed into Markdown/Mermaid with high precision.

  • Constraint: Advanced extraction depends on Excel COM, requiring Windows + Excel (other OS support only a simplified mode).

  • Developed with Codex/ChatGPT, with API references automatically deployed via mkdocs-material + GitHub Pages.

Intended Audience

  • Struggling to make LLMs/RAG understand Excel documents
  • OCR and VLM are too heavy, and you want to improve accuracy with a rule-based approach first
  • Windows + Excel environment is a given, assuming internal corporate use

Introducing the Library: "ExStruct"

I've personally created a Python library to convert Excel documents into structured data like JSON.

The library is named "ExStruct", short for Excel + Extract + Structure.

https://github.com/harumiWeb/exstruct

To put it simply, it's a library that thoroughly structures data using Excel COM reference libraries such as xlwings or openpyxl.

What it Can Do

  • Analyzes cell values, shapes, charts, hyperlinks, print areas, cell background colors, merged cell ranges, SmartArt, etc., within a workbook and converts them into structured data for LLMs.

  • Outputs structured data in formats such as JSON/YAML/TOON.

  • Infers table ranges based on cell borders within a sheet.

  • Supports data output per sheet or per print area.

  • Provides PDF/PNG output, which is useful for LLM inputs.

  • Runs easily on a local PC as it is based on a purely rule-based analysis.

  • Enables AI agents to autonomously analyze Excel documents via an MCP server (Added January 24, 2026).

How to Use

You can install it using the following commands:

uv add exstruct
pip install exstruct

I've provided a simple API, as the user typically just needs to load the Excel file and output the structured data.

from pathlib import Path
from exstruct as xs

# Selectable output verbosity: "light", "standard", "verbose"
wb = xs.extract("input.xlsx", mode="standard")
xs.export(wb, Path("out.json"), pretty=False)  # compact JSON

# Supports pandas-like or openpyxl-like model operations and saving
first_sheet = wb["Sheet1"]
for name, sheet in wb:
    print(name, len(sheet.rows))
wb.save("out.json", pretty=True)     # WorkbookData -> file
first_sheet.save("sheet.json")       # SheetData -> file
print(first_sheet.to_yaml())         # YAML output

Usage Demo 1

In RAG and similar applications, charts and text objects within Excel documents are very troublesome. These cannot be retrieved by existing structuring libraries like pandas, openpyxl, or excel2json, and require specialized methods to extract.

I've prepared a sheet containing table data, a line chart, and a flowchart created with shapes as shown below. Let's try structuring this using the engine I built.

Sheet with a table, chart, and shapes
Gathering the usual suspects found in Excel documents

ExStruct can also be executed via CLI. You can output the data with the following command:

exstruct sample.xlsx --pretty > sample.json

Here is the resulting output (some parts omitted for brevity):

sample.json
{
  "book_name": "sample.xlsx",
  "sheets": {
    "Sheet1": {
      "rows": [
        {
          "r": 3,
          "c": {
            "1": "Month",
            "2": "Product A",
            "3": "Product B",
            "4": "Product C"
          }
        },
        ...
      ],
      "shapes": [
        {
          "id": 1,
          "text": "Start",
          "l": 148,
          "t": 220,
          "kind": "shape",
          "type": "AutoShape-FlowchartProcess"
        },
        {
          "id": 2,
          "text": "Read Input Data",
          "l": 132,
          "t": 282,
          "kind": "shape",
          "type": "AutoShape-FlowchartProcess"
        },
        {
          "l": 193,
          "t": 246,
          "kind": "arrow",
          "begin_arrow_style": 1,
          "end_arrow_style": 2,
          "begin_id": 1,
          "end_id": 2,
          "direction": "N"
        },
        ...
      ],
      "charts": [
        {
          "name": "Chart 1",
          "chart_type": "Line",
          "title": "Sales Data",
          "y_axis_range": [
            0.0,
            200.0
          ],
          "series": [
            {
              "name": "Product A",
              "name_range": "Sheet1!$C$3",
              "x_range": "Sheet1!$B$4:$B$9",
              "y_range": "Sheet1!$C$4:$C$9"
            },
            ...
          ],
          "l": 377,
          "t": 25
        }
      ],
      "table_candidates": [
        "B3:E9"
      ]
    }
  }
}

What do you think?

In addition to cell values, the text and coordinates of text objects, angles and types of flowchart arrows, chart ranges, and ranges detected as tables are all comprehensively converted into JSON.

Letting the LLM Infer the Content of the Output JSON

The above is a structured Excel workbook in a unique JSON format. Please convert the content of this workbook into Markdown. Using Mermaid format is also acceptable.

Here is Copilot's response:

Copilot's Response
Below is the content of the provided Excel workbook reconstructed in Markdown format. The three elements—table, chart, and flowchart—are each appropriately represented.

---

## 📊 Sales Data Table

| Month      | Product A | Product B | Product C |
|------------|-----------|-----------|-----------|
| 2025-01-01 | 120       | 80        | 60        |
| 2025-02-01 | 135       | 90        | 64        |
| 2025-03-01 | 150       | 100       | 70        |
| 2025-04-01 | 170       | 110       | 72        |
| 2025-05-01 | 160       | 120       | 75        |
| 2025-06-01 | 180       | 130       | 80        |

---

## 📈 Sales Data (Line Chart)

- Chart Title: **Sales Data**
- Chart Type: Line Chart
- Y-axis Range: 0 to 200
- Data Series:
  - Product A: 120 → 135 → 150 → 170 → 160 → 180
  - Product B: 80 → 90 → 100 → 110 → 120 → 130
  - Product C: 60 → 64 → 70 → 72 → 75 → 80

---

## 🔄 Process Flow (Mermaid Flowchart)

```mermaid
flowchart TD
    A[Start]
    B[Read Input Data]
    C{Is Format Correct?}
    D[Process One Item]
    E{More Items Remaining?}
    F[Generate Output]
    G{Send Email?}
    H[Display Error]
    I[Send Email]
    J[End]

    A --> B
    B --> C
    C -->|Yes| D
    C --> H
    D --> E
    E --> F
    F --> G
    G -->|Yes| I
    G -->|No| J
    H --> J
    I --> J
```
  • After rendering the Mermaid flowchart

As you can see, various data types are inferred with high precision and detail.

It was particularly surprising to see that the flowchart could be perfectly understood without any image recognition.

Usage Demo 2

Next, let's analyze an Excel procedure manual, which is very common in industries like manufacturing.

Work Procedure Manual
A very common type of procedure manual, featuring many merged cells.

Here is the JSON version (this time with the pretty option turned OFF):

sample.json
{"book_name": "Sample_Procedure_Manual.xlsx", "sheets": {"Work Procedure Manual": {"rows": [{"r": 2, "c": {"1": "Work Procedure Manual"}}, {"r": 3, "c": {"1": "Task Name", "2": "Assembly of Parts A and B", "5": "Author", "6": "Hanako Sato"}}, {"r": 4, "c": {"1": "Start Time", "2": "2025-10-10 13:50:00", "5": "End Time", "6": "2025-10-10 17:30:00"}}, {"r": 5, "c": {"1": "Step No.", "2": "Work Content", "5": "Precautions", "8": "Required Tools", "9": "Check"}}, {"r": 6, "c": {"1": 1, "2": "Unpack Part A and place it in the center of the workbench.", "5": "Check for scratches or dirt on the surface", "8": "Work gloves"}}, {"r": 7, "c": {"2": "Visually inspect for surface scratches, deformation, or dirt.", "8": "Air duster"}}, {"r": 8, "c": {"8": "Cleaning cloth"}}, {"r": 11, "c": {"1": 2, "2": "Check the pin position and orientation of Part B,", "5": "Ensure the pin is not reversed.", "8": "Tweezers"}}, {"r": 12, "c": {"2": "and confirm fit with Part A.", "8": "Protection mat"}}, {"r": 16, "c": {"1": 3, "2": "Carefully insert Part B's pin into Part A's hole.", "5": "If there is resistance, do not force it; reposition and try again.", "8": "Alignment jig"}}, {"r": 17, "c": {"8": "Rubber mallet"}}, {"r": 21, "c": {"1": 4, "2": "After temporary fixing, set the specified screws and tighten them temporarily.", "5": "Check for any forgotten or over-tightened screws.", "8": "Torque wrench"}}, {"r": 22, "c": {"2": "Perform final tightening with the torque wrench according to the specified value."}}, {"r": 26, "c": {"1": 5, "2": "Set the assembly in the inspection device and perform an operation check."}}, {"r": 27, "c": {"2": "Record the inspection results in the inspection sheet and clearly state pass/fail."}}, {"r": 28, "c": {"2": "Pack passed products and move them to the designated storage area."}}], "table_candidates": ["B2:C7", "F3:G6"]}, "Blank": {"rows": [{"r": 2, "c": {"1": "Work Procedure Manual"}}, {"r": 3, "c": {"1": "Task Name", "5": "Author"}}, {"r": 4, "c": {"1": "Start Time", "5": "End Time"}}, {"r": 5, "c": {"1": "Step", "2": "Work Content", "5": "Notes (Required tools, materials, precautions)", "8": "Check"}}, {"r": 6, "c": {"1": 1, "8": "False"}}, {"r": 11, "c": {"1": 2, "8": "False"}}, {"r": 16, "c": {"1": 3, "8": "False"}}, {"r": 21, "c": {"1": 4, "8": "False"}}, {"r": 26, "c": {"1": 5, "8": "False"}}]}}}

Here is what the LLM inferred from the JSON:

## Work Procedure Manual

### Basic Information
- **Task Name:** Assembly of Parts A and B  
- **Author:** Hanako Sato  
- **Start Time:** 2025-10-10 13:50:00  
- **End Time:** 2025-10-10 17:30:00

---

### Step List

| **Step No.** | **Work Content** | **Precautions** | **Required Tools** | **Check** |
|---:|---|---|---|---|
| 1 | Unpack Part A and place it in the center of the workbench.; Visually inspect for surface scratches, deformation, or dirt. | Check for scratches or dirt on the surface | Work gloves; Air duster; Cleaning cloth | False |
| 2 | Check the pin position and orientation of Part B, and confirm fit with Part A. | Ensure the pin is not reversed. | Tweezers; Protection mat | False |
| 3 | Carefully insert Part B's pin into Part A's hole. | If there is resistance, do not force it; reposition and try again. | Alignment jig; Rubber mallet | False |
| 4 | After temporary fixing, set the specified screws and tighten them temporarily.; Perform final tightening with the torque wrench according to the specified value. | Check for any forgotten or over-tightened screws. | Torque wrench | False |
| 5 | Set the assembly in the inspection device and perform an operation check.; Record the inspection results in the inspection sheet and clearly state pass/fail.; Pack passed products and move them to the designated storage area. |  |  | False |

Since this document has no shapes or similar objects, it was understood without any trouble.

As you can see, the recognition accuracy can be improved significantly even with just JSON. By combining this with PDF or image versions and providing them to the LLM simultaneously, you can make it understand Excel documents almost perfectly.

Comparison with Existing Methods

Method Cell Values Shapes/Arrows Chart Structure Table Inference Cost Environment
pandas / openpyxl Free Cross-OS
OCR / VLM High GPU/Cloud
Azure Document Intelligence High Cloud
ExStruct Free Windows+Excel

RAG Use Cases

ExStruct is not a library intended to run as a resident server process.
It is intended to be used as an offline preprocessing step in a Windows environment where Excel is present, feeding the generated structured data into a RAG or search platform.

Of Course, There Are Drawbacks

I've emphasized the good points, but of course, there are also drawbacks.

Basically, advanced extraction requires a COM reference through xlwings, so the operating environment must be Windows + Excel. It won't work on Mac or Linux. (As a fallback, I have provided a "light mode" for simple structuring using only openpyxl).

However, environments where people are struggling with Excel documents usually already have Windows + Excel, so I decided to move forward with this implementation.

If I were to implement OpenXML analysis manually, I could move away from COM dependency, but that would be quite a chore, so I'll stick with this for now. I'm sure Big Tech will eventually develop Excel document analysis technology.

Results

Even with this somewhat tedious, non-AI approach of Excel analysis and thorough structuring, the result was a dramatic improvement in the LLM's inference accuracy for Excel documents.

With just this structuring process, I have personally felt the effects in areas such as:

  • Answering questions about internal company documents via LLM
  • Reducing the man-hours required for preprocessing for RAG data accumulation

Cloud services specialized in document analysis like Azure's Document Intelligence are also emerging, and Excel document analysis technology will likely continue to evolve. However, for companies that cannot afford a large budget, this kind of approach might be quite effective.

Addition: Added MCP Server functionality

I have implemented an MCP server function as an optional installation for ExStruct (January 24, 2026).
This allows AI agents to autonomously analyze Excel documents locally.

https://x.com/HarumiWebDesign/status/2014858787300208761?s=20

https://harumiweb.github.io/exstruct/mcp/

(Bonus) Library Development Story

The library was developed using uv × Codex × ChatGPT.

Library Use case
xlwings Provides API for Excel COM reference
pydantic Data model definitions
openpyxl, pandas Fast cell scanning
scipy Used for table detection algorithms
pytest, pytest-mock Testing
ruff, mypy Linter and formatter
mkdocs-material, mkdocstrings-python Automatic generation of API reference static sites from md files and docstrings

Internal Implementation

The structure of the data model looks something like this:

To be honest, it's just persistently referencing COM, performing rule-based analysis → extracting data, and thoroughly structuring it.

Below is the code for shape object analysis. It determines the chart type based on the shape number and the shape numbers published by Microsoft, and calculates the angle and direction for arrow-type objects from their properties.

https://github.com/harumiWeb/exstruct/blob/main/src/exstruct/core/shapes.py

Key Points of Focus

Since this structured data is designed with the premise of being input into LLMs, addressing context size and token count issues is essential.
I refined the structure of the data through repeated inference tests with LLMs and made it possible for users to freely adjust various output items.

example.py
from exstruct import ExStructEngine, StructOptions, OutputOptions, FormatOptions, FilterOptions, DestinationOptions

engine = ExStructEngine(
    options=StructOptions(mode="verbose"),  # Hyperlinks are included by default in verbose mode
    output=OutputOptions(
        format=FormatOptions(pretty=True),
        filters=FilterOptions(include_shapes=False),  # Exclude shapes from output
        destinations=DestinationOptions(sheets_dir=Path("out_sheets")),  # Save per sheet
    ),
)
wb2 = engine.extract("input.xlsx")
engine.export(wb2, Path("out_filtered.json"))  # Output after applying filters

The Combination of MkDocs × AI Agents Was Excellent

It's great to have an API reference for a library, isn't it? For this project, I've deployed an API reference site using GitHub Pages.

https://harumiweb.github.io/exstruct/

I used a library called MkDocs-Material.
This is a static site generator that allows you to generate a site with a single command after pip installing it and setting up a simple YAML configuration and Markdown files.

How to Use

Installation:

pip install mkdocs-material

Additionally, by using a library called mkdocstrings-python, you can automatically generate documentation from the program's docstrings. This means that once you set up the wireframe for the API reference section, you don't really have to do much else.

Install this as well:

pip install mkdocstrings-python

By default, MkDocs treats Markdown files under the docs/ directory as pages. Please create your site pages as Markdown files there.

Next, create mkdocs.yml, set the mkdocstrings-python extension in the plugins, and define the page hierarchy as follows:

https://github.com/harumiWeb/exstruct/blob/main/mkdocs.yml

After that, simply run:

mkdocs build

and the static site will be generated automatically.

What's more, by combining this with GitHub Actions, you can easily deploy to GitHub Pages simultaneously with a repository push.
A simple workflow like the one below is all you need:

https://github.com/harumiWeb/exstruct/blob/main/.github/workflows/gh-pages.yml

By combining this with Codex:

  1. Implement with Codex (ensuring it writes thorough docstrings).
  2. Push to the repository.
  3. CI automatically completes the reference update.

In this way, you can build an API reference site alongside your implementation.
Having an API reference makes it feel like a professional library (just my humble opinion).

https://squidfunk.github.io/mkdocs-material/

Future Roadmap (Under Consideration)

  • Cell background color extraction (Gantt chart support) Implemented
  • SmartArt analysis support Implemented
  • Text extraction via local OCR for images
  • Semantic clustering of shape groups
  • JSON → Excel reconstruction (partial)
  • (The Abyss) OpenXML analysis → Server support by removing COM dependency

Conclusion

In this article, I introduced an approach to "brute-force" the analysis and structuring of troublesome Excel documents to significantly boost 👆 an LLM's understanding of Excel.

I hope this serves as a reference for those in industries like manufacturing who are struggling with the digital transformation (DX) of internal knowledge.

Thank you for reading this far!
It would be very encouraging if you could give a Star to the GitHub repository!

See you later! 👋

  • I've also written an article on how to utilize ExStruct in combination with NotebookLM.

https://qiita.com/harumiWeb/items/7e68af1d777113a981b2

  • Reference article about TOON format:

https://zenn.dev/akasan/articles/1fa9ad262ac719

  • Azure Document Intelligence

https://azure.microsoft.com/en-us/products/ai-foundry/tools/document-intelligence

Discussion

surprissurpris

素晴らしい機能をありがとうございます。ぜひ使わせていただきます。MCP サーバーがあるとさらに便利になるだろうと感じました。単純なユースケースですが、「ExStruct で Excel ファイルを読む」→「その内容を分析・要約」を agentic に LLM に実行させる、ということを考えています。ご検討のほど、どうぞよろしくお願いいたします。

ハルミハルミ

コメントありがとうございます!
確かにMCPサーバー等のAIエージェント向けの機能があるとExStructが更に強力になりそうですね!検討します🙇‍♂️

surprissurpris

ありがとうございます!ぜひ使わせていただきます。
今後、ExStruct への feature requests などは GitHub リポジトリの方で Issue として投稿してよろしいでしょうか?そちらの方が開発上管理しやすいと思われますので。

ハルミハルミ

ぜひぜひ何かあれば Issue にお願いします!応援ありがとうございますm(_ _)m