🔥

AWS Price List APIの出力をPython ijsonでパースしたい

2023/01/04に公開

AWS

動機

AWSの料金を調べられるAPIで、Price List APIというものがあります。

これを使うとき、巨大なJSONをダウンロードして、データを読まないといけません。Pythonのijsonというモジュールで頑張ってみました。

バルクAPIを使う

AWS Price List Bulk API を使用すると、AWS のサービスの料金を一括クエリで照会できます。API が JSON ファイルまたは CSV ファイルを返します。バルク API は、過去の料金表のバージョンすべてを保持します。

過去の料金も見たいな、ということでバルクAPIを使います。URLにアクセスしてJSONをダウンロードします(curlでもよい)。※拡張子をcsv にするとCSVファイルをダウンロードできます。

アクセスするURLは、リージョンus-east-1を含んだhttps://pricing.us-east-1.amazonaws.com/offers/... の形式になります。

JSONのダウンロード

オファーファイルは、サービスの仕様や料金の情報が書かれたファイルです。

オファーインデックスファイル

https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/index.json

サービスごとのファイルのアドレスが書かれています。2023/01/03の時点で、205個ありました。EC2は以下のようになっています。

offer index file

  "offers" : {
    "AmazonEC2" : {
      "offerCode" : "AmazonEC2",
      "versionIndexUrl" : "/offers/v1.0/aws/AmazonEC2/index.json",
      "currentVersionUrl" : "/offers/v1.0/aws/AmazonEC2/current/index.json",
      "currentRegionIndexUrl" : "/offers/v1.0/aws/AmazonEC2/current/region_index.json",
    },

(savingsPlanVersionIndexUrl, currentSavingsPlanIndexUrlがあるのですが省略)

versionIndexUrl

上記のversionIndexUrlには、過去バージョン含めたオファーファイルのアドレスがあります。
https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/index.json

76個バージョンありました。バージョンが新しくなるごとにどんどんファイルサイズが大きくなります。リザーブドインスタンスのデータが増えていくようです。オンデマンドは更新日が上書きされていて、数としては変わらないようです。

versionIndexUrl

  "versions" : {
    "20180607191619" : {
      "versionEffectiveBeginDate" : "2018-05-01T00:00:00Z",
      "versionEffectiveEndDate" : "2018-06-01T00:00:00Z",
      "offerVersionUrl" : "/offers/v1.0/aws/AmazonEC2/20180607191619/index.json"
    },

currentRegionIndexUrl
リージョンごとのデータです。EC2のcurrentRegionIndexUrlは、このようになっています。

currentRegionIndexUrl

  "regions" : {
    "ap-northeast-1" : {
      "regionCode" : "ap-northeast-1",
      "currentVersionUrl" : "/offers/v1.0/aws/AmazonEC2/20221222205132/ap-northeast-1/index.json"
    },
},

サイズの大きいJSONを処理する

JSONファイルが数百MB、大きいと3GB以上だったりします。私のPCではファイルを開くだけでフリーズしてしまうので、ijsonというモジュールを使いました。公式の説明とQiitaの記事を参考にしました。

EC2インスタンスタイプのm4.xlargeを探してみることにします(特に理由はないです)。

JSONを検索するとき、オーダーO(n) x 項目数にするために、1項目ごとに当てはまるskuのリストを書き出して、複数項目のskuの積集合をとることにしました。これが最適か分かりませんが、forループは時間がかかりすぎるのを回避しています。

ディレクトリ構造は以下です。pyは2ファイルに分けていますが繋げてもよさそうです。

.
├── data/
│   └── offer_20180607191619.json
├── result/
├── save_search.py
└── get_diff.py

save_search.pyは条件を入れてskuを保存するスクリプトです。ざっくり書くと、parse = ijson.parse(file)でJSONファイルをパースして、一行ずつprefix, event, valueをループ。それぞれの行で判定して見つかれば、prefix部分からregexでskuを取り出しています。

save_search.py

import re
import ijson

version = '20180607191619'

json_data = f'data/offer_{version}.json'


def save(filename, out):
    with open(f'result/{version}-{filename}.txt', 'w') as f:
        for line in out:
            f.write(f"{line}\n")


def search(word_prefix, word_value):
    out = []
    with open(json_data, 'r') as file:
        parse = ijson.parse(file)

        for prefix, _, value in parse:
            if "Reserved" not in prefix and word_prefix in prefix and value == word_value:
                # Get sku
                ma = re.search("products\\.(.*)\\.attributes", prefix)
                if ma is not None:
                    item = ma.groups()[0]
                    out.append(item)
    return out


out = search(".", "m4.xlarge")
save("m4.xlarge", out)

out = search(".", "Asia Pacific (Tokyo)")
save("tokyo-region", out)

out = search("capacitystatus", "Used")
save("capacitystatus", out)

out = search("operatingSystem", "Linux")
save("operatingSystem", out)

out = search("preInstalledSw", "NA")
save("preInstalledSw", out)

out = search("tenancy", "Shared")
save("Shared", out)

これで、resultの中に各条件に当てはまるskuが入ったファイルができています。

├── result
    ├── 20180607191619-capacitystatus.txt
    ├── 20180607191619-m4.xlarge.txt
    ├── 20180607191619-operatingSystem.txt
    ├── 20180607191619-preInstalledSw.txt
    ├── 20180607191619-Shared.txt
    └── 20180607191619-tokyo-region.txt

get_diff.pyは各ファイルのskuの積を見つけるためのスクリプトです。1つに絞れなかったら、jsonに戻って、どの条件で切り分けられるかを追加して繰り返すことになります。

get_diff.py

import ijson

version = '20180607191619'


json_data = f'data/offer_{version}.json'


def get_result(file):
    with open(f"result/{file}", "r") as f:
        out = f.readlines()
    return out


def search_price(search_word):
    with open(json_data, 'r') as file:
        parse = ijson.parse(file)

        for prefix, _, value in parse:
            if "Reserved" not in prefix and search_word in prefix:
                if "pricePerUnit.USD" in prefix:
                    print(value)


tokyo = get_result(f"{version}-tokyo-region.txt")
instances = get_result(f"{version}-m4.xlarge.txt")
shared = get_result(f"{version}-Shared.txt")
Linux = get_result(f"{version}-operatingSystem.txt")
preInstalledSw = get_result(f"{version}-preInstalledSw.txt")
capacitystatus = get_result(f"{version}-capacitystatus.txt")

result = set(tokyo) & set(instances) & set(shared) & set(Linux) & set(preInstalledSw) & set(capacitystatus)


if len(result) == 1:
    a = list(result)[0].strip("\n")
    search_price(a)
else:
    print(result)

0.2580000000　(USD)と出るのですが、これはコンソールから選んだときに表示される料金と同じです。無事見つけられています。

まとめ

AWS Pricing APIを使ってダウンロードしたJSONから、料金を抽出するPythonスクリプトを作りました
JSONが巨大なので処理が大変
(本当は、EC2の料金の変遷を見たかったのですが、ファイル多すぎだったので断念しました)

動機

バルクAPIを使う

JSONのダウンロード

サイズの大きいJSONを処理する

まとめ

Discussion