Case Studies in Infrastructure Change Management

Authors Wendy Look and Mark Dallman offer an overview of two long-term projects at Google: one to migrate all of Google’s systems from Google File System (GFS) to its successor, Colossus, and the other to move from local disk storage to diskless compute nodes for all jobs.

Kota UENISHI

Infrastructure Change Management

インフラの移行をするには Infrastructure Change Management (ICM)をきっちりやるといい。MapReduceは2013年にDeprecateして2019年には99%移行した。

You can’t ignore the low-probability but highly catastrophic events that can crop up mid-flight. Exercises like the Wheel-of-Misfortune (disaster roleplaying) and DiRT (annual event to push production systems tolimit and inflict actual outages) are good ways to uncover these.

一番むずかしいのはright owner をみつけること
As of 2019, Flume was rolled out toover 99% of C++ and Java pipelines, and the Flume support rotation was staffed with 12 engineers.

とりあげるのは

Moonshot: GFS → Colossus
Locak disk → diskless

For each of these case studies, weprovide an overview, the project’s impact, the tools and processesused to manage the change, as well as individual lessons learnedafter each completed change.

Kota UENISHI

Moonshot

2010年: GFSからColossusへの移行計画を発表。Colossusはまだプロトタイプ。2011年中の完全移行を目指す。ユーザーメリットを用意した:

ocassional hiccup
reduced quota costs
better performance
lots of friendly SRE support

Kota UENISHI

GFSのつらみ

GMailとかでめちゃ使われていたので数分のダウンタイムもNG
GFS Master のRAMにChunk locationsが乗っていた
GFS Masterが restart すると影響範囲のchunkserverは10〜30分使えなくなる
GFS Master が基本的にsingle threadedだったので性能がでなかった

Colossus

2006年に最初の実装ができていた
BigTableの次期バックエンドとして開発された
2007年にはGFSのReplacementになりそうなことがわかっていた
2008 Janには最初のVideo streamingが始まった（動いた）
「急すぎるだろ」に対しては「even to the point of taking out‐ages, so long as we don’t lose Customer data.」
4人だったチームは "a dedicated 14–18 SRE team members per site to support the Colossus storage layerafter the migration."

Steamroller Project

Kota UENISHI

Moonshot でどんなツールを使ったか

Quota and Storage usage dashboard
Quota move service
Migration planning and scheduling tool
Migration tracking tool
Bulk data migration service

プロセス Process

Weekly check-in meeting
フェーズをわけた

Video and Bigtable were the first customers on Colossussince GFS limitations hit them heavily and they were actively look‐ing for a replacement storage system. Migrating these two earlyadopters helped the Moonshot team realize the time it would take tomigrate a service at a per cell level and the tactical steps necessaryfor the migration (e.g., how to turn up a Colossus production andtest cell, when to add and remove quota, etc.).

たまにサポート対象プロジェクトを選んで計画をねった
いろんなコミュニケーションチャネルを用意した
- 各種ML
- 1:1 Office Hour for use cases
- creating an exception procedure for folks who couldnot migrate by the targeted deadline,
- FAQ
- instructions
- forms for feedbacks and feature requests

Kota UENISHI

Capacity Planning

D server のメタデータを横から覗き見
かなり強引にみえるｗ

Steamroller project

専用のチームを用意して、どうやってColossusのためのリソースを確保するか取り組んだ
各チームのBorg jobのリソースを適正なサイズに切り詰めていった
切り詰め方をミスったこともあった
締め切りが強引だという批判もあった
最低4TPM必要だと見積もったが2でスタートした、人がたりなかった
いいところもあった、SteamrollerでMoonshotが動き出した

Lessons learned

The Moonshot project forced all teams to migrate by the target dead‐line.
The Moonshot team was comprised of 20%ers: いろんなチームからいろんなRoleの人が集まった、人が足りなかったけど、いる人でやるしかない
Each change as part of the Moonshot project caused a rippling effect ofcustomer frustration.: なるべく情報をいろんなチャンネルで拡散して人を安心させるのがよい

Kota UENISHI

Diskless

HDDは相対的にどんどん遅くなって、ディスクのなかに「閉じ込められた」データの量は年々大きくなる
ディスクとCPUが同じマシンに入っているせいでデータの読み出しが混み合って性能が出ない
同じマシンに入ってるせいで故障に巻き込まれる: 分けると故障率が25~30%改善した
ディスクとCPUを分けることで投資計画を別々に立てることができるようになった: TCO削減

Kota UENISHI

グーグルは歴史的に二種類のサーバーがいた

Index: SATA x2
Diskfull: SATA x6 or more

移行のアプローチ two paths to Diskless:

(1) anexplicit conversion for sophisticated teams that preferred direct con‐trol and
(2) an “autoconversion” option for all others.

自動化のツールはうまくいった。ユーザーの移行モニタリングは最初はスプレッドシートで管理していた。
各種管理系のツールやドキュメント、ダッシュボードなどは2016年にやっと揃った（問題が起きてから整備していた）が、その頃にはみんなburned out. 毎日毎日サポートの質問に答えていた。

うまくいかなかったこと

staffing, planning, communication, and risk management.

Staffing: 2016年にVPが投入されてTPMが追加されていったが、その時点で2/3のBorg jobsの移行が済んでいた。

Planning: