😀

オライリー本の{URL,書影,書名,動物名}一覧をsedとgrepとawkで処理

2021/09/28に公開

はじめに

@gin_135さんの記事を試したが動かなかったのでやってみた。

HTMLダウンロード

まずオライリーアニマル一覧からページをダウンロード。

# どちらを実行しても実行時間はあまり変わらない？
curl -s 'https://www.oreilly.com/animals.csp?x-o='{0..1243..20} > source
# 0.12s user 0.08s system 1% cpu 17.777 total
curl -s 'https://www.oreilly.com/animals.csp?x-o=[0-1243:20]' > source
# 0.11s user 0.09s system 0% cpu 20.113 total

HTML->空白区切り

元記事では動物だけ抽出していたが、それだと以下のように簡単すぎる。

grep -oP '(?<="animal-name">)[^<]+' source

ので、これを以下のように、<URL>\n<書影>\n<書名>\n<動物名>\n\nに整形して列挙したい。

head-20

https://shop.oreilly.com/product/9780596155452.do
https://covers.oreilly.com/images/9780596155452/cat.gif
Mobile Design and Development
12-Wired Bird of Paradise

https://shop.oreilly.com/product/0636920024491.do
https://covers.oreilly.com/images/0636920024491/cat.gif
Windows PowerShell for Developers
3-Banded Armadillo

https://shop.oreilly.com/product/9780596007065.do
https://covers.oreilly.com/images/9780596007065/cat.gif
Jakarta Commons Cookbook
Aardvark

https://shop.oreilly.com/product/0636920029786.do
https://covers.oreilly.com/images/0636920029786/cat.gif
Clojure Cookbook
Aardwolf

本の情報は以下のように列挙されている。

...
<div class="animal-row">
    <a class="book" href="https://shop.oreilly.com/product/9780596155452.do" title="">
      <img class="book-cvr" src="https://covers.oreilly.com/images/9780596155452/cat.gif" />
      <p class="book-title">Mobile Design and Development</p>
    </a>
    <p class="animal-name">12-Wired Bird of Paradise</p>
  </div>
<div class="animal-row">
...

あとは気合。結果的にgrep, sedを1回ずつ使った。

grep -oP '(?<="book" href=")[^"]+|(?<="book-cvr" src=")[^"]+|(?<="book-title">)[^<]+|(?<="animal-name">)[^<]+' source | sed '4~4G'

ワンライナーだと見にくいので改行。

grep -oP '(?<="book" href=")[^"]+'\
'|(?<="book-cvr" src=")[^"]+'\
'|(?<="book-title">)[^<]+'\
'|(?<="animal-name">)[^<]+' source | sed '4~4G'

grepは-oでマッチ部分のみを抽出させる。また-PでPerl正規表現を有効化し、肯定先読みを用いて必要要素のテキスト部のみマッチさせている。

sedは4~4でNR%4==0となる行の末尾にGコマンドで空行を挿入している。

空行区切り->JSON

briandfoy/oreilly-animalsを見習って、以下のような扱いやすいJSON形式に変換したい。

[
  {
    "animal": "12-Wired Bird of Paradise",
    "cover_src": "https://covers.oreilly.com/images/9780596155452/cat.gif",
    "link": "https://shop.oreilly.com/product/9780596155452.do",
    "title": "Mobile Design and Development"
  },
  { ... }
]

空行区切りのデータはawkのRS="";FS="\n"で、通常通り処理できる。
ただし、JSONなのでクオート"は\"のようにエスケープする必要がある。

grep -oP '(?<="book" href=")[^"]+'\
'|(?<="book-cvr" src=")[^"]+'\
'|(?<="book-title">)[^<]+'\
'|(?<="animal-name">)[^<]+' source | sed '4~4G' |
sed 's/"/\\"/g' |
awk 'BEGIN{RS="";FS="\n";printf"["}{printf"{\"animal\":\""$4"\",\"cover_src\":\""$2"\",\"link\":\""$1"\",\"title\":\""$3"\"},"}END{print"]"}' |
sed 's/..$/]/'

出力をjq .に渡すと

head-20
[
  {
    "animal": "12-Wired Bird of Paradise",
    "cover_src": "https://covers.oreilly.com/images/9780596155452/cat.gif",
    "link": "https://shop.oreilly.com/product/9780596155452.do",
    "title": "Mobile Design and Development"
  },
  {
    "animal": "3-Banded Armadillo",
    "cover_src": "https://covers.oreilly.com/images/0636920024491/cat.gif",
    "link": "https://shop.oreilly.com/product/0636920024491.do",
    "title": "Windows PowerShell for Developers"
  },
  {
    "animal": "Aardvark",
    "cover_src": "https://covers.oreilly.com/images/9780596007065/cat.gif",
    "link": "https://shop.oreilly.com/product/9780596007065.do",
    "title": "Jakarta Commons Cookbook"
  },
  {

よさそう。

さいごに

シェルだとサクっと短い記述量で済んで、楽(可読性は…)

GitHubで編集を提案

Discussion

ログインするとコメントできます