🐈

名前付きキャプチャグループで pegjs 風にオブジェクトを組み立てるパーサジェネレータを作ってみる

2021/09/29に公開

JavaScript

parser

pegjs

tech

tldr;

ES2018 の named capture group を活用することで、簡易なパーサジェネレータを 1kb 切るサイズで実装できた。ただし正規表現の表現力に縛られているので、再帰はできない。

完成形 https://github.com/mizchi/flat-pargen

named capture group を pegjs 風に活用したい

ES2018 以降の JavaScript の正規表現では、 (?<Name>...) で正規表現の実行結果に名前をつけることができます。

> new RegExp("(?<a>x)(?<b>y)").exec("xy").groups
[Object: null prototype] { a: 'x', b: 'y' }

自分はこれをみて pegjs を思い出しました。例えば pegjs はこのようにルールを表現します。

start
  = additive

additive
  = left:multiplicative "+" right:additive { return left + right; }
  / multiplicative

multiplicative
  = left:primary "*" right:multiplicative { return left * right; }
  / primary

primary
  = integer
  / "(" additive:additive ")" { return additive; }

integer "integer"
  = digits:[0-9]+ { return parseInt(digits.join(""), 10); }

Documentation » PEG.js – Parser Generator for JavaScript

名前を付けながらパターンを宣言して、それをコードブロックでキャプチャして変形します。これでパースしながらASTを組み立てることができるわけですね。

でもちょっと工夫すれば pegjs と同じような深いオブジェクトを扱えそうな気がしたので、やってみました。

考え方

named capture group で pegjs のようなネストを扱えるかというと、そのままではできません。実際に capture できるのは一層だけです。つまり、 match.groups の実行結果は必ず Record<string, string> になります。

実際に実行してみると次のようになります。

> new RegExp("(?<a>(?<b>x))").exec("x").groups
[Object: null prototype] { a: 'x', b: 'x' }

pegjs の発想だと (?<a>(?<b>x)) は {a: {b: "x"}} と出て欲しいところですね。

じゃあ再帰的に1層目だけ named capture group の正規表現を作って、その実行結果に対して、再び named capture group でのパースを行えばいいのでは？と思ってやってみたのが、以下の実装と解説になります。

パーサージェネレータとして表現したいパターンを考える

Expr 正規表現
Seq 連続するシーケンスにすべてマッチする
Or いずれかのパターンにマッチ
Repeat 指定表現の繰り返し

今回はこの4つを扱います。 TypeScript でデータ構造を宣言してみます。

enum Kind {
  SEQ = 1,
  REPEAT,
  EXPR,
  OR,
}

type Parser<In = any, Out = any> = (input: In) => Out;
type NodeBase = {
  // capture 用の名前
  key: string | void;
  reshape: Parser;
};

export type Node = Seq | Expr | Or | Repeat;

export type Seq = NodeBase & {
  kind: Kind.SEQ;
  children: Node[];
};

export type Repeat = NodeBase & {
  kind: Kind.REPEAT;
  pattern: Node;
};

export type Or = NodeBase & {
  kind: Kind.OR;
  patterns: Array<Seq | Expr>;
};

export type Expr = NodeBase & {
  kind: Kind.EXPR;
  expr: string;
};

AST を宣言しただけです。

次に、全体を一つの正規表現にする関数を用意します。

"a", "b", "c" のシーケンスの入力があったとき、 Seq は abc に、 Or は (a|b|c) の正規表現として組み立てられます。また、 Repeat は (pattern)* にコンパイルできますね。

// 文字列結合用のヘルパ
const concat = <T>(items: T[], fn: (t: T) => string) =>
  items.reduce((acc: string, item: T) => acc + fn(item), "");

// serialize
function serializeToFlatRegex(node: Node): string {
  switch (node.kind) {
    case Kind.EXPR: {
      return node.expr;
    }
    case Kind.OR: {
      const patterns = node.patterns.map(serializeToFlatRegex);
      return "(" + patterns.join("|") + ")";
    }
    case Kind.REPEAT: {
      const pattern = serializeToFlatRegex(node.pattern);
      return `(${pattern}){0,}`;
    }
    case Kind.SEQ: {
      return concat(node.children, serializeToFlatRegex);
    }
    default: {
      throw new Error("WIP expr and parser");
    }
  }
}

Seq のときだけ (?<key>expr) の named capture group として扱いたいので、一層だけの named capture group に変換します。

function serializeToGroupRegex(seq: Seq): string {
  return seq.children.reduce((acc, child) => {
    const flat = serializeToFlatRegex(child);
    if (child.key) {
      return `${acc}(?<${child.key}>${flat})`;
    }
    return `${acc}${flat}`;
  }, "");
}

で、この定義をパーサの関数に変換する compile 関数を実装してみました。

export function compile(node: Node): Parser<string, any> {
  const reshape = node.reshape;
  switch (node.kind) {
    case Kind.EXPR: {
      const re = new RegExp(`^${serializeToFlatRegex(node)}`);
      return (input: string) => {
        const m = re.exec(input);
        if (m == null) return;
        return reshape(input);
      };
    }
    case Kind.OR: {
      const compiledPatterns = node.patterns.map((p) => {
        return {
          parse: compile(p),
          re: new RegExp(`^${serializeToFlatRegex(p)}`),
        };
      });
      return (input: string) => {
        for (const next of compiledPatterns) {
          const m = next.re.exec(input);
          if (m == null) continue;
          const result = next.parse(input);
          if (result) {
            return reshape(result);
          }
        }
        return null;
      };
    }
    case Kind.SEQ: {
      const re = new RegExp(`^${serializeToGroupRegex(node)}`);
      const composedParser = node.children.reduce(
        (parent: Parser<any, any>, next: Node) => {
          if (next.key == null) return parent;
          const childParser = compile(next);
          return (result: any) =>
            parent({
              ...result,
              [next.key!]: childParser(result[next.key!]),
            });
        },
        (result: any) => result
      );
      return (input: string = "") => {
        const m = input.match(re);
        if (m == null) return;
        const full = m[0];
        if (m.groups) {
          return reshape(composedParser(m.groups));
        }
        return reshape(composedParser(full));
      };
    }
    case Kind.REPEAT: {
      const re = new RegExp(`^${serializeToFlatRegex(node.pattern)}`);
      const parser = compile(node.pattern);
      return (input: string) => {
        const xs: string[] = [];
        while (input.length > 0) {
          const m = input.match(re);
          if (m == null) break;
          const full = m[0];
          xs.push(full);
          input = input.slice(full.length);
        }
        return node.reshape(xs.map(parser) as any);
      };
    }
    default: {
      throw new Error("WIP expr and parser");
    }
  }
}

各 Node ごとに、自分が入力を扱う関数を返す返却し、それを主に Seq で組み立てます。Seq から見たとき key 属性を持っているときだけ子要素のパーサを使って値を変形します。

Expr: 正規表現の実行
Seq: named capture group 付きの正規表現を生成して、入力を判定して groups にバラして、その key ごとの parser に委譲して、子要素のパーサを再帰的に実行する
Repeat: フラットに展開した子要素の正規表現で入力を複数回判定して、入力を全部くいきったら子要素のパーサに map しつつ渡します。
Or: 複数のパターンをとり、一つずつ flat にした正規表現のパターンで判定して、一番最初に満たしたものを子要素のパーサに渡しながら変換します。

reshape 関数は pegjs でいう { return ...} のブロックで、パース済みのものを別の値に変形したいときのヘルパです。

完成

できあがったコードはこちら。解説はしませんが、ASTビルダも用意してます。

enum Kind {
  SEQ = 1,
  REPEAT,
  EXPR,
  OR,
}

type Parser<In = any, Out = any> = (input: In) => Out;
type NodeBase = {
  reshape: Parser;
  key: string | void;
};

export type Node = Seq | Expr | Or | Repeat;

export type Seq = NodeBase & {
  kind: Kind.SEQ;
  children: Node[];
};

export type Repeat = NodeBase & {
  kind: Kind.REPEAT;
  pattern: Node;
};

export type Or = NodeBase & {
  kind: Kind.OR;
  patterns: Array<Seq | Expr>;
};

export type Expr = NodeBase & {
  kind: Kind.EXPR;
  expr: string;
};

const concat = <T>(items: T[], fn: (t: T) => string) =>
  items.reduce((acc: string, item: T) => acc + fn(item), "");

// serialize
function serializeToFlatRegex(node: Node): string {
  switch (node.kind) {
    case Kind.EXPR: {
      return node.expr;
    }
    case Kind.OR: {
      const patterns = node.patterns.map(serializeToFlatRegex);
      return "(" + patterns.join("|") + ")";
    }
    case Kind.REPEAT: {
      const pattern = serializeToFlatRegex(node.pattern);
      return `(${pattern}){0,}`;
    }
    case Kind.SEQ: {
      return concat(node.children, serializeToFlatRegex);
    }
    default: {
      throw new Error("WIP expr and parser");
    }
  }
}

function serializeToGroupRegex(seq: Seq): string {
  return seq.children.reduce((acc, child) => {
    const flat = serializeToFlatRegex(child);
    if (child.key) {
      return `${acc}(?<${child.key}>${flat})`;
    }
    return `${acc}${flat}`;
  }, "");
}

export function compile(node: Node): Parser<string, any> {
  const reshape = node.reshape;
  switch (node.kind) {
    case Kind.EXPR: {
      const re = new RegExp(`^${serializeToFlatRegex(node)}`);
      return (input: string) => {
        const m = re.exec(input);
        if (m == null) return;
        return reshape(input);
      };
    }
    case Kind.OR: {
      const compiledPatterns = node.patterns.map((p) => {
        return {
          parse: compile(p),
          re: new RegExp(`^${serializeToFlatRegex(p)}`),
        };
      });
      return (input: string) => {
        for (const next of compiledPatterns) {
          const m = next.re.exec(input);
          if (m == null) continue;
          const result = next.parse(input);
          if (result) {
            return reshape(result);
          }
        }
        return null;
      };
    }
    case Kind.SEQ: {
      const re = new RegExp(`^${serializeToGroupRegex(node)}`);
      const composedParser = node.children.reduce(
        (parent: Parser<any, any>, next: Node) => {
          if (next.key == null) return parent;
          const childParser = compile(next);
          return (result: any) =>
            parent({
              ...result,
              [next.key!]: childParser(result[next.key!]),
            });
        },
        (result: any) => result
      );
      return (input: string = "") => {
        const m = input.match(re);
        if (m == null) return;
        const full = m[0];
        if (m.groups) {
          return reshape(composedParser(m.groups));
        }
        return reshape(composedParser(full));
      };
    }
    case Kind.REPEAT: {
      const re = new RegExp(`^${serializeToFlatRegex(node.pattern)}`);
      const parser = compile(node.pattern);
      return (input: string) => {
        const xs: string[] = [];
        while (input.length > 0) {
          const m = input.match(re);
          if (m == null) break;
          const full = m[0];
          xs.push(full);
          input = input.slice(full.length);
        }
        return node.reshape(xs.map(parser) as any);
      };
    }
    default: {
      throw new Error("WIP expr and parser");
    }
  }
}

// helper
const defaultReshape: Parser<any, any> = <T>(i: T): T => i;

function createSeq(children: Expr[], reshape: Parser = defaultReshape): Seq {
  return {
    kind: Kind.SEQ,
    children,
    reshape,
    key: undefined,
  };
}

function createOr(
  patterns: Array<Seq | Expr>,
  reshape: Parser = defaultReshape
): Or {
  return {
    kind: Kind.OR,
    patterns,
    key: undefined,
    reshape,
  };
}

function createRepeat(
  pattern: Node,
  reshape: Parser<any, any> = defaultReshape
): Repeat {
  return {
    kind: Kind.REPEAT,
    pattern,
    reshape,
    key: undefined,
  };
}

function createExpr(
  expr: string,
  reshape: Parser<any, any> = defaultReshape
): Expr {
  return {
    kind: Kind.EXPR,
    expr,
    reshape,
    key: undefined,
  };
}

export const $ = {
  expr: createExpr,
  repeat: createRepeat,
  or: createOr,
  seq: createSeq,
  param<T extends Node>(key: string, node: T): T {
    return { ...node, key };
  },
  join(...expr: string[]): Expr {
    return createExpr(expr.join(""));
  },
};

使ってみる

(npm の @mizchi/flat-pargen にリリースしてあります)

{ a: 1, b: 2, c: "xxx", d: true } のオブジェクトリテラルもどきをパースするルールを書いてみます。

// ... 今まで書いたコード
// もしくは import {compile, $} from "@mizchi/flat-pargen";

// whitespace
const _ = $.expr("\\s*");

// 文字列、数値、真偽値いずれか
const anyLiteral = $.or([
  $.expr("\\d+", (input) => ({ type: "number", value: Number(input) })),
  $.expr("true|false", (input) => ({
    type: "boolean",
    value: Boolean(input),
  })),
  $.expr(`"[^"]*"`, (input) => ({
    type: "string",
    value: input.slice(1, -1),
  })),
]);

// key: literal のペアの表現
const keyPairSeq = [
  $.param("key", $.expr("\\w+")),
  _,
  $.expr("\\:"),
  _,
  $.param("val", anyLiteral as any),
];

const parse = compile(
  $.seq(
    [
      $.expr("\\{"),
      _,
      $.param("head", $.seq(keyPairSeq)),
      $.param<any>(
        "tail",
        $.repeat(
          $.seq([
            // , a: b
            _,
            $.expr(","),
            _,
            ...keyPairSeq,
            _,
          ])
        )
      ),
      _,
      $.expr("\\}"),
    ],
    (input: any) => {
      return [input.head, ...input.tail];
    }
  )
);
console.log(parse(`{ a: 1, b: 2, c: "xxx", d: true }`);
/*
  [
    { key: "a", val: { type: "number", value: 1 } },
    { key: "b", val: { type: "number", value: 2 } },
    { key: "c", val: { type: "string", value: "xxx" } },
    { key: "d", val: { type: "boolean", value: true } },
  ]
*/

head と tail を別のルールで組み立てて reshape 関数で一つの配列に組み直しています。区切り文字のあるリスト表現を扱う時の定石ですね。

おわり

今回作ったのは、「複雑な正規表現を多少簡単に扱える」といったレベルのものです。本質的に正規表現の表現力に縛られているので、再帰の実装はできません。つまり、ネストしたオブジェクトはパースできないので、JSON のパーサが書けません。

今回この試みを行った個人的なモチベーションとしては、named capture group の活用を考えるのと、小さいサイズのパーサジェネレータのサイズがほしかったのがあります。その点では満足しています。

本当は再帰を扱えるような、正規表現でない実装を考えていたのですが、そういうのはこの延長ではなく別のパーサコンビネータの役目だと思い、小サイズを実現したこのフェーズで終えることとします。

tldr;

named capture group を pegjs 風に活用したい

考え方

パーサージェネレータとして表現したいパターンを考える

完成

使ってみる

おわり

Discussion