iTranslated by AI
I Developed an OSS Library to Convert Troublesome Excel Documents into Semantically Structured JSON for RAG
Introduction
Hello! I'm someone working in a role similar to an internal SE in the manufacturing industry.
Recently, I've been hooked on the RAG field, but I've recently realized that Excel documents are just too much of a nuisance. On Zenn as well, I regularly see articles about how to handle these Excel documents in RAG.
While preceding engineers have presented various effective solutions, I've also tackled this issue with a different approach, so I'd like to share it 🔥
TL;DR
-
"
exstruct" is a self-made library that structures Excel files using COM +xlwings/openpyxl. It outputs not just cell values, but also shapes, charts, hyperlinks, print areas, cell background colors, merged cell ranges, SmartArt, and even tables derived from borders inJSON/YAML/TOONformats. -
It can extract flowchart arrow directions and chart axes/series without image recognition, supplementing information often lost in RAG to improve LLM restoration accuracy.
-
In the demo, a table + line chart + flowchart is converted to JSON → and then reconstructed into Markdown/Mermaid with high precision.
-
Constraint: Advanced extraction depends on Excel COM, requiring Windows + Excel (other OS support only a simplified mode).
-
Developed with Codex/ChatGPT, with API references automatically deployed via
mkdocs-material+ GitHub Pages.
Intended Audience
- Struggling to make LLMs/RAG understand Excel documents
- OCR and VLM are too heavy, and you want to improve accuracy with a rule-based approach first
- Windows + Excel environment is a given, assuming internal corporate use
Introducing the Library: "ExStruct"
I've personally created a Python library to convert Excel documents into structured data like JSON.
The library is named "ExStruct", short for Excel + Extract + Structure.
To put it simply, it's a library that thoroughly structures data using Excel COM reference libraries such as xlwings or openpyxl.
What it Can Do
-
Analyzes cell values, shapes, charts, hyperlinks, print areas, cell background colors, merged cell ranges, SmartArt, etc., within a workbook and converts them into structured data for LLMs.
-
Outputs structured data in formats such as JSON/YAML/TOON.
-
Infers table ranges based on cell borders within a sheet.
-
Supports data output per sheet or per print area.
-
Provides PDF/PNG output, which is useful for LLM inputs.
-
Runs easily on a local PC as it is based on a purely rule-based analysis.
-
Enables AI agents to autonomously analyze Excel documents via an MCP server (Added January 24, 2026).
How to Use
You can install it using the following commands:
uv add exstruct
pip install exstruct
I've provided a simple API, as the user typically just needs to load the Excel file and output the structured data.
from pathlib import Path
from exstruct as xs
# Selectable output verbosity: "light", "standard", "verbose"
wb = xs.extract("input.xlsx", mode="standard")
xs.export(wb, Path("out.json"), pretty=False) # compact JSON
# Supports pandas-like or openpyxl-like model operations and saving
first_sheet = wb["Sheet1"]
for name, sheet in wb:
print(name, len(sheet.rows))
wb.save("out.json", pretty=True) # WorkbookData -> file
first_sheet.save("sheet.json") # SheetData -> file
print(first_sheet.to_yaml()) # YAML output
Usage Demo 1
In RAG and similar applications, charts and text objects within Excel documents are very troublesome. These cannot be retrieved by existing structuring libraries like pandas, openpyxl, or excel2json, and require specialized methods to extract.
I've prepared a sheet containing table data, a line chart, and a flowchart created with shapes as shown below. Let's try structuring this using the engine I built.

Gathering the usual suspects found in Excel documents
ExStruct can also be executed via CLI. You can output the data with the following command:
exstruct sample.xlsx --pretty > sample.json
Here is the resulting output (some parts omitted for brevity):
{
"book_name": "sample.xlsx",
"sheets": {
"Sheet1": {
"rows": [
{
"r": 3,
"c": {
"1": "Month",
"2": "Product A",
"3": "Product B",
"4": "Product C"
}
},
...
],
"shapes": [
{
"id": 1,
"text": "Start",
"l": 148,
"t": 220,
"kind": "shape",
"type": "AutoShape-FlowchartProcess"
},
{
"id": 2,
"text": "Read Input Data",
"l": 132,
"t": 282,
"kind": "shape",
"type": "AutoShape-FlowchartProcess"
},
{
"l": 193,
"t": 246,
"kind": "arrow",
"begin_arrow_style": 1,
"end_arrow_style": 2,
"begin_id": 1,
"end_id": 2,
"direction": "N"
},
...
],
"charts": [
{
"name": "Chart 1",
"chart_type": "Line",
"title": "Sales Data",
"y_axis_range": [
0.0,
200.0
],
"series": [
{
"name": "Product A",
"name_range": "Sheet1!$C$3",
"x_range": "Sheet1!$B$4:$B$9",
"y_range": "Sheet1!$C$4:$C$9"
},
...
],
"l": 377,
"t": 25
}
],
"table_candidates": [
"B3:E9"
]
}
}
}
What do you think?
In addition to cell values, the text and coordinates of text objects, angles and types of flowchart arrows, chart ranges, and ranges detected as tables are all comprehensively converted into JSON.
Letting the LLM Infer the Content of the Output JSON
The above is a structured Excel workbook in a unique JSON format. Please convert the content of this workbook into Markdown. Using Mermaid format is also acceptable.
Here is Copilot's response:
Below is the content of the provided Excel workbook reconstructed in Markdown format. The three elements—table, chart, and flowchart—are each appropriately represented.
---
## 📊 Sales Data Table
| Month | Product A | Product B | Product C |
|------------|-----------|-----------|-----------|
| 2025-01-01 | 120 | 80 | 60 |
| 2025-02-01 | 135 | 90 | 64 |
| 2025-03-01 | 150 | 100 | 70 |
| 2025-04-01 | 170 | 110 | 72 |
| 2025-05-01 | 160 | 120 | 75 |
| 2025-06-01 | 180 | 130 | 80 |
---
## 📈 Sales Data (Line Chart)
- Chart Title: **Sales Data**
- Chart Type: Line Chart
- Y-axis Range: 0 to 200
- Data Series:
- Product A: 120 → 135 → 150 → 170 → 160 → 180
- Product B: 80 → 90 → 100 → 110 → 120 → 130
- Product C: 60 → 64 → 70 → 72 → 75 → 80
---
## 🔄 Process Flow (Mermaid Flowchart)
```mermaid
flowchart TD
A[Start]
B[Read Input Data]
C{Is Format Correct?}
D[Process One Item]
E{More Items Remaining?}
F[Generate Output]
G{Send Email?}
H[Display Error]
I[Send Email]
J[End]
A --> B
B --> C
C -->|Yes| D
C --> H
D --> E
E --> F
F --> G
G -->|Yes| I
G -->|No| J
H --> J
I --> J
```
- After rendering the Mermaid flowchart
As you can see, various data types are inferred with high precision and detail.
It was particularly surprising to see that the flowchart could be perfectly understood without any image recognition.
Usage Demo 2
Next, let's analyze an Excel procedure manual, which is very common in industries like manufacturing.

A very common type of procedure manual, featuring many merged cells.
Here is the JSON version (this time with the pretty option turned OFF):
{"book_name": "Sample_Procedure_Manual.xlsx", "sheets": {"Work Procedure Manual": {"rows": [{"r": 2, "c": {"1": "Work Procedure Manual"}}, {"r": 3, "c": {"1": "Task Name", "2": "Assembly of Parts A and B", "5": "Author", "6": "Hanako Sato"}}, {"r": 4, "c": {"1": "Start Time", "2": "2025-10-10 13:50:00", "5": "End Time", "6": "2025-10-10 17:30:00"}}, {"r": 5, "c": {"1": "Step No.", "2": "Work Content", "5": "Precautions", "8": "Required Tools", "9": "Check"}}, {"r": 6, "c": {"1": 1, "2": "Unpack Part A and place it in the center of the workbench.", "5": "Check for scratches or dirt on the surface", "8": "Work gloves"}}, {"r": 7, "c": {"2": "Visually inspect for surface scratches, deformation, or dirt.", "8": "Air duster"}}, {"r": 8, "c": {"8": "Cleaning cloth"}}, {"r": 11, "c": {"1": 2, "2": "Check the pin position and orientation of Part B,", "5": "Ensure the pin is not reversed.", "8": "Tweezers"}}, {"r": 12, "c": {"2": "and confirm fit with Part A.", "8": "Protection mat"}}, {"r": 16, "c": {"1": 3, "2": "Carefully insert Part B's pin into Part A's hole.", "5": "If there is resistance, do not force it; reposition and try again.", "8": "Alignment jig"}}, {"r": 17, "c": {"8": "Rubber mallet"}}, {"r": 21, "c": {"1": 4, "2": "After temporary fixing, set the specified screws and tighten them temporarily.", "5": "Check for any forgotten or over-tightened screws.", "8": "Torque wrench"}}, {"r": 22, "c": {"2": "Perform final tightening with the torque wrench according to the specified value."}}, {"r": 26, "c": {"1": 5, "2": "Set the assembly in the inspection device and perform an operation check."}}, {"r": 27, "c": {"2": "Record the inspection results in the inspection sheet and clearly state pass/fail."}}, {"r": 28, "c": {"2": "Pack passed products and move them to the designated storage area."}}], "table_candidates": ["B2:C7", "F3:G6"]}, "Blank": {"rows": [{"r": 2, "c": {"1": "Work Procedure Manual"}}, {"r": 3, "c": {"1": "Task Name", "5": "Author"}}, {"r": 4, "c": {"1": "Start Time", "5": "End Time"}}, {"r": 5, "c": {"1": "Step", "2": "Work Content", "5": "Notes (Required tools, materials, precautions)", "8": "Check"}}, {"r": 6, "c": {"1": 1, "8": "False"}}, {"r": 11, "c": {"1": 2, "8": "False"}}, {"r": 16, "c": {"1": 3, "8": "False"}}, {"r": 21, "c": {"1": 4, "8": "False"}}, {"r": 26, "c": {"1": 5, "8": "False"}}]}}}
Here is what the LLM inferred from the JSON:
## Work Procedure Manual
### Basic Information
- **Task Name:** Assembly of Parts A and B
- **Author:** Hanako Sato
- **Start Time:** 2025-10-10 13:50:00
- **End Time:** 2025-10-10 17:30:00
---
### Step List
| **Step No.** | **Work Content** | **Precautions** | **Required Tools** | **Check** |
|---:|---|---|---|---|
| 1 | Unpack Part A and place it in the center of the workbench.; Visually inspect for surface scratches, deformation, or dirt. | Check for scratches or dirt on the surface | Work gloves; Air duster; Cleaning cloth | False |
| 2 | Check the pin position and orientation of Part B, and confirm fit with Part A. | Ensure the pin is not reversed. | Tweezers; Protection mat | False |
| 3 | Carefully insert Part B's pin into Part A's hole. | If there is resistance, do not force it; reposition and try again. | Alignment jig; Rubber mallet | False |
| 4 | After temporary fixing, set the specified screws and tighten them temporarily.; Perform final tightening with the torque wrench according to the specified value. | Check for any forgotten or over-tightened screws. | Torque wrench | False |
| 5 | Set the assembly in the inspection device and perform an operation check.; Record the inspection results in the inspection sheet and clearly state pass/fail.; Pack passed products and move them to the designated storage area. | | | False |
Since this document has no shapes or similar objects, it was understood without any trouble.
As you can see, the recognition accuracy can be improved significantly even with just JSON. By combining this with PDF or image versions and providing them to the LLM simultaneously, you can make it understand Excel documents almost perfectly.
Comparison with Existing Methods
| Method | Cell Values | Shapes/Arrows | Chart Structure | Table Inference | Cost | Environment |
|---|---|---|---|---|---|---|
| pandas / openpyxl | ◎ | ✕ | ✕ | △ | Free | Cross-OS |
| OCR / VLM | △ | △ | △ | △ | High | GPU/Cloud |
| Azure Document Intelligence | △ | △ | △ | △ | High | Cloud |
| ExStruct | ◎ | ◎ | ◎ | ◎ | Free | Windows+Excel |
RAG Use Cases
ExStruct is not a library intended to run as a resident server process.
It is intended to be used as an offline preprocessing step in a Windows environment where Excel is present, feeding the generated structured data into a RAG or search platform.
Of Course, There Are Drawbacks
I've emphasized the good points, but of course, there are also drawbacks.
Basically, advanced extraction requires a COM reference through xlwings, so the operating environment must be Windows + Excel. It won't work on Mac or Linux. (As a fallback, I have provided a "light mode" for simple structuring using only openpyxl).
However, environments where people are struggling with Excel documents usually already have Windows + Excel, so I decided to move forward with this implementation.
If I were to implement OpenXML analysis manually, I could move away from COM dependency, but that would be quite a chore, so I'll stick with this for now. I'm sure Big Tech will eventually develop Excel document analysis technology.
Results
Even with this somewhat tedious, non-AI approach of Excel analysis and thorough structuring, the result was a dramatic improvement in the LLM's inference accuracy for Excel documents.
With just this structuring process, I have personally felt the effects in areas such as:
- Answering questions about internal company documents via LLM
- Reducing the man-hours required for preprocessing for RAG data accumulation
Cloud services specialized in document analysis like Azure's Document Intelligence are also emerging, and Excel document analysis technology will likely continue to evolve. However, for companies that cannot afford a large budget, this kind of approach might be quite effective.
Addition: Added MCP Server functionality
I have implemented an MCP server function as an optional installation for ExStruct (January 24, 2026).
This allows AI agents to autonomously analyze Excel documents locally.
(Bonus) Library Development Story
The library was developed using uv × Codex × ChatGPT.
| Library | Use case |
|---|---|
| xlwings | Provides API for Excel COM reference |
| pydantic | Data model definitions |
| openpyxl, pandas | Fast cell scanning |
| scipy | Used for table detection algorithms |
| pytest, pytest-mock | Testing |
| ruff, mypy | Linter and formatter |
| mkdocs-material, mkdocstrings-python | Automatic generation of API reference static sites from md files and docstrings |
Internal Implementation
The structure of the data model looks something like this:
To be honest, it's just persistently referencing COM, performing rule-based analysis → extracting data, and thoroughly structuring it.
Below is the code for shape object analysis. It determines the chart type based on the shape number and the shape numbers published by Microsoft, and calculates the angle and direction for arrow-type objects from their properties.
Key Points of Focus
Since this structured data is designed with the premise of being input into LLMs, addressing context size and token count issues is essential.
I refined the structure of the data through repeated inference tests with LLMs and made it possible for users to freely adjust various output items.
from exstruct import ExStructEngine, StructOptions, OutputOptions, FormatOptions, FilterOptions, DestinationOptions
engine = ExStructEngine(
options=StructOptions(mode="verbose"), # Hyperlinks are included by default in verbose mode
output=OutputOptions(
format=FormatOptions(pretty=True),
filters=FilterOptions(include_shapes=False), # Exclude shapes from output
destinations=DestinationOptions(sheets_dir=Path("out_sheets")), # Save per sheet
),
)
wb2 = engine.extract("input.xlsx")
engine.export(wb2, Path("out_filtered.json")) # Output after applying filters
The Combination of MkDocs × AI Agents Was Excellent
It's great to have an API reference for a library, isn't it? For this project, I've deployed an API reference site using GitHub Pages.
I used a library called MkDocs-Material.
This is a static site generator that allows you to generate a site with a single command after pip installing it and setting up a simple YAML configuration and Markdown files.
How to Use
Installation:
pip install mkdocs-material
Additionally, by using a library called mkdocstrings-python, you can automatically generate documentation from the program's docstrings. This means that once you set up the wireframe for the API reference section, you don't really have to do much else.
Install this as well:
pip install mkdocstrings-python
By default, MkDocs treats Markdown files under the docs/ directory as pages. Please create your site pages as Markdown files there.
Next, create mkdocs.yml, set the mkdocstrings-python extension in the plugins, and define the page hierarchy as follows:
After that, simply run:
mkdocs build
and the static site will be generated automatically.
What's more, by combining this with GitHub Actions, you can easily deploy to GitHub Pages simultaneously with a repository push.
A simple workflow like the one below is all you need:
By combining this with Codex:
- Implement with Codex (ensuring it writes thorough docstrings).
- Push to the repository.
- CI automatically completes the reference update.
In this way, you can build an API reference site alongside your implementation.
Having an API reference makes it feel like a professional library (just my humble opinion).
Future Roadmap (Under Consideration)
-
Cell background color extraction (Gantt chart support)Implemented -
SmartArt analysis supportImplemented - Text extraction via local OCR for images
- Semantic clustering of shape groups
- JSON → Excel reconstruction (partial)
(The Abyss) OpenXML analysis → Server support by removing COM dependency
Conclusion
In this article, I introduced an approach to "brute-force" the analysis and structuring of troublesome Excel documents to significantly boost 👆 an LLM's understanding of Excel.
I hope this serves as a reference for those in industries like manufacturing who are struggling with the digital transformation (DX) of internal knowledge.
Thank you for reading this far!
It would be very encouraging if you could give a Star to the GitHub repository!
See you later! 👋
Supplementary Links
- I've also written an article on how to utilize ExStruct in combination with NotebookLM.
- Reference article about TOON format:
- Azure Document Intelligence
Discussion
素晴らしい機能をありがとうございます。ぜひ使わせていただきます。MCP サーバーがあるとさらに便利になるだろうと感じました。単純なユースケースですが、「ExStruct で Excel ファイルを読む」→「その内容を分析・要約」を agentic に LLM に実行させる、ということを考えています。ご検討のほど、どうぞよろしくお願いいたします。
コメントありがとうございます!
確かにMCPサーバー等のAIエージェント向けの機能があるとExStructが更に強力になりそうですね!検討します🙇♂️
MVPですが、MCPサーバー機能を実装しました!
本記事も更新してます。
ありがとうございます!ぜひ使わせていただきます。
今後、
ExStructへの feature requests などは GitHub リポジトリの方で Issue として投稿してよろしいでしょうか?そちらの方が開発上管理しやすいと思われますので。ぜひぜひ何かあれば Issue にお願いします!応援ありがとうございますm(_ _)m