📑

ひとりMongoDB University / M121 Aggregation Framework(1)

2021/04/04に公開

MongoDB

tech

この記録は、アドベントカレンダー形式ではじめた、MongoDB Universityの学習コースの記録の続きになります！

ただいまのコース

M121: M121: The MongoDB Aggregation Framework
- https://university.mongodb.com/mercury/M121/2021_March_16/overview

このコースでは、Aggregationの深堀りをしていきます。

Chapter: 0 / Introduction and Aggregation Concepts

Introduction to the MongoDB Aggregation Framework

このコースのイントロダクション。

Aggregation FrameworkはMongoDBのキーコンセプトの1つです
- データのフィルタ、変換、集計、そして分析のための機能です
開発者向けのコースの1つになります！
ぜひディスカッションフォーラムでも発言やナレッジをシェアしてください！

Atlas Requirement

MongoDB Atlasのデータを使うので、接続できるか確認してください。

トレーニングの中で接続用のURLが提示されます


# トレーニング用のMongoDBクラスタに接続してチェック
MongoDB Enterprise Cluster0-shard-0:PRIMARY> show collections
air_airlines
air_alliances
air_routes
bronze_banking
child_reference
customers
employees
exoplanets
gold_banking
icecream_data
movies
nycFacilities
parent_reference
silver_banking
solarSystem
stocks
system.profile

The Concept of Pipelines

パイプラインのコンセプト。
パイプラインはそれぞれの処理を文字通りパイプラインとして繋げていく。
それぞれの処理をStageと呼ぶ。

以下はパイプラインのステージの例。

$match (ドキュメントのフィルタ)
$project (取り出すフィールドの指定や値を利用しての新しいフィールドの設定)
$group (集計処理)

複雑なパイプライン処理でも、基本は「ステージ」を繋げていくこと。
一部例外があるが、その点はコースの後半でカバーします。

Quiz: The Concept of Pipelines

Q. Aggregation Frameworkに関して正しいものは？
A. こたえ

Documents flow through the pipeline, passing one stage to the next
- ドキュメントは各ステージからステージを通って、パイプライン処理が実施されます
The Aggregation Framework provides us many stages to filter and transform our data.
- アグリゲーションフレームワークはデータをフィルタしたり、変換したりするためのものです

Aggregation Structure and Syntax

基本的なルール：パイプラインは引数に配列を取る
- 配列の中身は、「ステージ」（1つ以上のステージの集まり）
- パイプラインを指定した配列のほか、最後にオプションのハッシュを第二引数にする

# シンタックスは以下の通り
# db.collectionName.aggregate([{ stage1 }, { stage2 }], { options})

# サンプル１件を抽出
db.employees.findOne()
{
	"_id" : ObjectId("59d288690e3733b153a9396c"),
	"employee_ID" : "08905606-e8e0-43ef-ad9e-6bcfc7f6c5b3",
	"acl" : [
		"HR",
		"Management",
		"Finance",
		"Executive"
	],
	"employee_compensation" : {
		"acl" : [
			"Management",
			"Finance",
			"Executive"
		],
		"salary" : 150832,
		"stock_award" : 2312,
		"programs" : {
			"acl" : [
				"Finance",
				"Executive"
			],
			"401K_contrib" : 0.2,
			"health_plan" : true,
			"spp" : 0.01
		}
	},
	"employee_grade" : 1,
	"team" : "Yellow",
	"age" : 63,
	"first_name" : "Lucile",
	"last_name" : "Sharpe",
	"gender" : "female",
	"phone" : "+1 (829) 527-2881",
	"address" : "127 Maple Street, Rosewood, Arizona, 87080"
}

# aggregationの　$matchを使って上記のデータを取得してみる
db.employees.aggregate([ { $match: { "first_name" : "Lucile", "last_name" : "Sharpe" } }]).pretty()
{
	"_id" : ObjectId("59d288690e3733b153a9396c"),
	"employee_ID" : "08905606-e8e0-43ef-ad9e-6bcfc7f6c5b3",
	"acl" : [
		"HR",
		"Management",
		"Finance",
		"Executive"
	],
	"employee_compensation" : {
		"acl" : [
			"Management",
			"Finance",
			"Executive"
		],
		"salary" : 150832,
		"stock_award" : 2312,
		"programs" : {
			"acl" : [
				"Finance",
				"Executive"
			],
			"401K_contrib" : 0.2,
			"health_plan" : true,
			"spp" : 0.01
		}
	},
	"employee_grade" : 1,
	"team" : "Yellow",
	"age" : 63,
	"first_name" : "Lucile",
	"last_name" : "Sharpe",
	"gender" : "female",
	"phone" : "+1 (829) 527-2881",
	"address" : "127 Maple Street, Rosewood, Arizona, 87080"
}

# $projectも使って絞り込んでみる
db.employees.aggregate([ { $match: { "first_name" : "Lucile", "last_name" : "Sharpe" } }, { $project: { "first_name": 1, "last_name": 1, _id: 1, employee_ID: 1 } }]).pretty()
{
	"_id" : ObjectId("59d288690e3733b153a9396c"),
	"employee_ID" : "08905606-e8e0-43ef-ad9e-6bcfc7f6c5b3",
	"first_name" : "Lucile",
	"last_name" : "Sharpe"
}

オペレーターについて

$xxxx ($記号)で指定されたキーワード
aggregationオペレータは、各ステージの先頭で指定されるもの
- Exp. { $match: { "first_name" : "Lucile" } } なら　$match
queryオペレータは、さらにステージで処理する対象を特定したりするためのもの

# $projectの中で、$concatはクエリオペレータ
db.employees.aggregate([ { $match: { "first_name" : "Lucile", "last_name" : "Sharpe" } },
   { $project:
      { "first_name": 1,
        "last_name": 1,
        _id: 1,
        employee_ID: 1,
        Name: { $concat: ["$first_name", " ", "$last_name" ] }
      }
    }
  ]).pretty()
{
	"_id" : ObjectId("59d288690e3733b153a9396c"),
	"employee_ID" : "08905606-e8e0-43ef-ad9e-6bcfc7f6c5b3",
	"first_name" : "Lucile",
	"last_name" : "Sharpe",
	"Name" : "Lucile Sharpe"
}

Quiz: Aggregation Structure and Syntax

Q. Which of the following statements is true?
A. こたえ
- An aggregation pipeline is an array of stages.
- Some expressions can only be used in certian stages.

あるシンタックスは、特定のステージでしか利用されないものもあるよ！

Chapter 1: Basic Aggregation

次はアグリゲーションの基本の章です。まずは $match から。

$match: Filtering documents（動画）

ドキュメントのフィルタに利用します
いろんなステージで利用することができますが、一部注意点もあります
できるだけ早いステージ、とくに最初のステージで、インデックスをカバーするクエリで絞り込むことで、速く処理することができます
シンタックスは、db.collectionName.find({...}) と同じです

# タイプが星でないもの
db.solarSystem.aggregate([
  { $match: { type: { $ne: "Star"} } }
]).pretty()

# たくさんあるうちの一個のみ書き出してます
{
	"_id" : ObjectId("59a06674c8df9f3cd2ee7d54"),
	"name" : "Earth",
	"type" : "Terrestrial planet",
	"orderFromSun" : 3,
	"radius" : {
		"value" : 6378.137,
		"units" : "km"
	},
	"mass" : {
		"value" : 5.9723e+24,
		"units" : "kg"
	},
	"sma" : {
		"value" : 149600000,
		"units" : "km"
	},
	"orbitalPeriod" : {
		"value" : 1,
		"units" : "years"
	},
	"eccentricity" : 0.0167,
	"meanOrbitalVelocity" : {
		"value" : 29.78,
		"units" : "km/sec"
	},
	"rotationPeriod" : {
		"value" : 1,
		"units" : "days"
	},
	"inclinationOfAxis" : {
		"value" : 23.45,
		"units" : "degrees"
	},
	"meanTemperature" : 15,
	"gravity" : {
		"value" : 9.8,
		"units" : "m/s^2"
	},
	"escapeVelocity" : {
		"value" : 11.18,
		"units" : "km/sec"
	},
	"meanDensity" : 5.52,
	"atmosphericComposition" : "N2+O2",
	"numberOfMoons" : 1,
	"hasRings" : false,
	"hasMagneticField" : true
},
....

# 件数を取り出してみる
db.solarSystem.aggregate([
  { $match: { type: { $ne: "Star"} } },
  { $count: "planets" }
])

# 8件
{ "planets" : 8 }

Quiz: $match: Filtering documents

Q. Which of the following is/are true of the $match stage? ($ matchについて正しいものは？)
A. こたえ

It uses the familiar MongoDB query language.
- MongoDBのQueryとしてとってもよく使われます！
It should come very early in an aggregation pipeline.
- パイプラインのできるだけ早い、最初のステージで利用してデータを絞り込むのが大事です

$match and $project / Lab - $match (練習問題)

ここは練習問題のまえの準備（動作チェック）。

練習用のデータをダウンロード
練習用のMongoDB Atlasクラスタに接続してみよう
接続したら、db コマンドで処理対象のデータベースを確認してみよう

m121 % tree
.
└── chapter1
    ├── validateLab1.js
    └── validateLab2.js

1 directory, 2 files

# 中身の確認
chapter1 % cat validateLab1.js
var validateLab1 = pipeline => {
  let aggregations = db.getSiblingDB("aggregations")
  if (!pipeline) {
    print("var pipeline isn't properly set up!")
  } else {
    try {
      var result = aggregations.movies.aggregate(pipeline).toArray().length
      let sentinel = result
      let data = 0
      while (result != 1) {
        data++
        result = result % 2 === 0 ? result / 2 : result * 3 + 1
      }
      if (sentinel === 23) {
        print("Answer is", data)
      } else {
        print("You aren't returning the correct number of documents")
      }
    } catch (e) {
      print(e.message)
    }
  }
}

接続してみる。

# 現在のネームスペースを確認
MongoDB Enterprise Cluster0-shard-0:PRIMARY> db
aggregations

After connecting to the cluster, ensure you can see the movies collection by
typing show collections and then run the command db.movies.findOne().
Take a moment to familiarize yourself with the schema.

接続したら、show collections コマンドでコレクションを確認します。
また、db.movies.findOne() コマンドで、ドキュメントを1つ表示します。
moviesのドキュメントのスキーマをよく確認してください。


MongoDB Enterprise Cluster0-shard-0:PRIMARY> show collections
air_airlines
air_alliances
air_routes
bronze_banking
child_reference
customers
employees
exoplanets
gold_banking
icecream_data
movies
nycFacilities
parent_reference
silver_banking
solarSystem
stocks
system.profile
MongoDB Enterprise Cluster0-shard-0:PRIMARY> db.movies.findOne()
{
	"_id" : ObjectId("573a1390f29313caabcd4cf1"),
	"title" : "Ingeborg Holm",
	"year" : 1913,
	"runtime" : 96,
	"released" : ISODate("1913-10-27T00:00:00Z"),
	"cast" : [
		"Hilda Borgstr�m",
		"Aron Lindgren",
		"Erik Lindholm",
		"Georg Gr�nroos"
	],
	"poster" : "http://ia.media-imdb.com/images/M/MV5BMTI5MjYzMTY3Ml5BMl5BanBnXkFtZTcwMzY1NDE2Mw@@._V1_SX300.jpg",
	"plot" : "Ingeborg Holm's husband opens up a grocery store and life is on the sunny side for them and their three children. But her husband becomes sick and dies. Ingeborg tries to keep the store, ...",
	"fullplot" : "Ingeborg Holm's husband opens up a grocery store and life is on the sunny side for them and their three children. But her husband becomes sick and dies. Ingeborg tries to keep the store, but because of the lazy, wasteful staff she eventually has to close it. With no money left, she has to move to the poor-house and she is separated from her children. Her children are taken care of by foster-parents, but Ingeborg simply has to get out of the poor-house to see them again...",
	"lastupdated" : "2015-08-25 00:11:47.743000000",
	"type" : "movie",
	"directors" : [
		"Victor Sj�str�m"
	],
	"writers" : [
		"Nils Krok (play)",
		"Victor Sj�str�m"
	],
	"imdb" : {
		"rating" : 7,
		"votes" : 493,
		"id" : 3014
	},
	"countries" : [
		"Sweden"
	],
	"genres" : [
		"Drama"
	]
}

$match and $project / Lab - $match (練習問題つづき)

実際の練習問題はここから。

Help MongoDB pick a movie our next movie night!
Based on employee polling, we've decided that potential movies must
meet the following criteria.

以下の条件に合うデータを絞り込んでください！

imdb.rating is at least 7
- { imdb.rating: { $gte: 7 } }
genres does not contain "Crime" or "Horror"
- { genres: { $nin: [ "Crime", "Horror" ] } }
rated is either "PG" or "G"
- { rated: { $in: [ "PG", "G" ] } }
languages contains "English" and "Japanese"
- { languages: { $all: [ "English", "Japanese" ] } }

As a hint, your aggregation should return 23
上記の条件で23件になるはず！

参考

$text はテキストインデックスが設定されたフィールドに対して検索するオペレータ。

db.movies.aggregate([ { $match: { "imdb.rating": { $gt: 7 } } }, { $count: "title" } ])
{ "title" : 13483 }

db.movies.aggregate([ { $match: { genres: { $in: [ "Crime", "Horror" ] } } }, { $count: "title" } ])
{ "title" : 9383 }

db.movies.aggregate([ { $match: { rated: { $nin: [ "PG", "G" ] } } }, { $count: "title" } ])
{ "title" : 40854 }

# $inだといずれかなので、どれも含む場合は $all
db.movies.aggregate([ { $match: { languages: { $all: [ "English", "Japanese" ] } } }, { $count: "title" } ])
{ "title" : 533 }


db.movies.aggregate([ { $match: { "$and":[{ rated: { $nin: [ "PG", "G" ] } }, { languages: { $all: [ "English", "Japanese" ] } } } ]}, { $count: "title" } ])

上記の全部の条件を結合してみる。


var pipeline = [
  { $match: { $and:[
      { "imdb.rating": { $gte: 7 } },
      { genres: { $nin: [ "Crime", "Horror" ] } },
      { rated: { $in: [ "PG", "G" ] } },
      { languages: { $all: [ "English", "Japanese" ] } }
  ] } },
  { $count: "title" }
]

db.movies.aggregate(pipeline)

# ヒント通り23件！
{ "title" : 23 }

var pipeline = [
  { $match: { $and:[
      { "imdb.rating": { $gte: 7 } },
      { genres: { $nin: [ "Crime", "Horror" ] } },
      { rated: { $in: [ "PG", "G" ] } },
      { languages: { $all: [ "English", "Japanese" ] } }
  ] } }
]

# itcount() でも確認できる
db.movies.aggregate(pipeline).itcount()
23

上記で指定したパイプラインの条件を使って、さらにスクリプトを実行する。

load('validateLab1.js')
true

validateLab1(pipeline)
Answer is 15

答えは15であっていました！
なお、回答のpiplineはこちら。条件が全てandなら $and を使わなくてもいい。


# 模範解答
var pipeline = [
  {
    $match: {
      "imdb.rating": { $gte: 7 },
      genres: { $nin: [ "Crime", "Horror" ] } ,
      rated: { $in: ["PG", "G" ] },
      languages: { $all: [ "English", "Japanese" ] }
    }
  }
]

# 自分の答え
var pipeline = [
  { $match: { $and:[
      { "imdb.rating": { $gte: 7 } },
      { genres: { $nin: [ "Crime", "Horror" ] } },
      { rated: { $in: [ "PG", "G" ] } },
      { languages: { $all: [ "English", "Japanese" ] } }
  ] } }
]

今回のメモ

7章もあるのと、やはり手を動かしながらでないとわからないので、ハンズオン的なものがあると助かります。

$matchの基本まで
Queryオペレーターは、丁寧にリファレンスを見て調べること....
matchのステージには条件を複数書いてよい
- なにも指定がなければANDで連結になる

時間をかけてですが、少しずつ把握してきました。奥が深いです！

※ なお、M100コースは 3/25に完走しています！
このコースも、RDBMSやSQLとの対比をしながらで、なかなか面白かったです！

また、いつもはVSCodeを使っていますが、IntelliJ IDEA (DataGrip) でも、データソースにMongoDBが設定できることがわかったので、試して見たりしています。

地味にコツコツ進めようと思います。
期限まであと1ヶ月半.....