📝

MS-Wordファイル(.docx)からGoのストリーム処理で文章を取り出すとハック感があって楽しい

2022/02/04に公開

Word

tech

はじめに

この記事は表題に関する遊びです。

細かな実装方法は記事中の物と違いますが、同等のコードを下記repo.に掲載しています。よろしければご参照ください。

https://github.com/tenkoh/go-docc

読み出した内容を標準出力に吐き出す簡易appもあります。

go installl https://github.com/tenkoh/go-docc/cmd/docc@latest

.docxの苦悩

私は、私はただ文書の中身を取得したいのです。そのためだけに、わざわざMicrosoft Officeさんに頼りたくないのです。
大量の.docxを処理するのに、人海戦術なんて嫌なのです。バッチ処理をお手軽にやりたいのです。

このようなニーズは多々あるようで、.docxさんをハンドリングするライブラリが先達により作成されています。Goで実装されたものだと、例えば次の２つをお見かけしました。どちらもなかなか多機能です。すごいです。

https://github.com/unidoc/unioffice

https://github.com/sajari/docconv

閑話休題

さて、良く知られていることですが.docxの実態はアーカイブファイルで、その内部にXML形式のファイル群を持っています。したがって単に文書の中身を取得したいだけであれば、.docxを展開して得られるword/document.xmlをパースしてあげるだけでいけるはずです。

そうした処理ではGoのストリーム処理が火を吹きそうですね（私見）。 玄人感があって良いので無駄にトライしてみましょう

実装

.docxファイルをarchive/zipで展開し、返り値の[]*zip.Fileの中からファイル名がword/document.xmlの物だけを探し、*zip.FileをOpenして得られるio.ReadCloserをxml.NewDecoderに渡します。必要な文章情報はxmlタグのp>r>tの中身だけなので、それを結合して返したらおしまいです。

import (
	"archive/zip"
	"encoding/xml"
	"errors"
	"fmt"
	"io"
	"path/filepath"
)

var ErrDocumentsNotFound = errors.New("foo")

type Document struct {
	XMLName xml.Name `xml:"document"`
	Body    struct {
		P []struct {
			R []struct {
				T struct {
					Text  string `xml:",chardata"`
					Space string `xml:"space,attr"`
				} `xml:"t"`
			} `xml:"r"`
		} `xml:"p"`
	} `xml:"body"`
}

func Decode(docxPath string) ([]string, error) {
	archive, _ := zip.OpenReader(docxPath)
	defer archive.Close()

	for _, f := range archive.File {
		target := filepath.Clean("word/document.xml")
		if n := filepath.Clean(f.Name); n != target {
			continue
		}

		fd, _ := f.Open()
		defer fd.Close()

		ps, _ := decodeXML(fd)
		return ps, nil
	}
	return nil, ErrDocumentsNotFound
}

func decodeXML(r io.Reader) ([]string, error) {
	doc := new(Document)
	if err := xml.NewDecoder(r).Decode(doc); err != nil {
		return nil, fmt.Errorf("could not decode the document: %w", err)
	}
	ps := []string{}
	for _, p := range doc.Body.P {
		t := ""
		for _, r := range p.R {
			t = t + r.T.Text
		}
		ps = append(ps, t)
	}
	return ps, nil
}

結び

途中で記載したように改善の余地しかない簡易実装ですが、これでもう.docxを手動で開く日々からはおさらばです。おあとがよろしいようで。

GitHubで編集を提案

はじめに

.docxの苦悩

閑話休題

実装

結び

Discussion