☕️

ひとりMongoDB University / M100 MongoDB for SQL Pros (2)

2021/03/20に公開

MongoDB

tech

この記録は、アドベントカレンダー形式ではじめた、MongoDB Universityの学習コースの記録の続きになります！

ただいまのコース

M100: M100 MongoDB for SQL Pros
- https://university.mongodb.com/courses/M100/about

このコースでは、RDBMS (SQL) の知識と対比させながら、MongoDBの学習を進めることを目標としています。

前回分のメモから続いて、Chapter 2, Chapter 3の内容になります。

Chapter 2: Modeling for MongoDB

"In RDBMS, indexes are built on any number of columns in a table.

In MongoDB, because there is far less need to bring data together, indexes are mainly used for filtering and sorting, rarely for joining. This leads to lower memory usage for indexes."

RDBMSでのインデックスは1テーブルのデータ取得だけでなくジョインにも必要。用途に応じて複数のインデックスが必要になればインデックス用のスペース、メモリも要る。
MongoDBはコレクションの中に小テーブルを持つようなイメージでジョインが少ない分、メモリのコストを抑えられる。

BSON Specification
- MongoDBは実際はJSONじゃなくてBSON (BynaryサポートしたJSON) を使っている
- Ref. https://docs.mongodb.com/manual/core/document/

MongoDBは通常分散での利用なので、RDBのように複数テーブル（コレクション）をトランザクション設定して更新すると、通信や地理的な面で遅延がある。
できるだけ利用に即して、一度に取り出したいデータを1ドキュメントにまとめるのが良い。
モデリングの際はここを考慮する。

RDBの場合はERダイアグラムや正規化からなのに対し、MongoDBはワークロードを起点に。
一度に抽出すべきデータから考える。
集計やアグリゲーションしやすい形でモデリングする。
まずは1つのドキュメントが基本。
必要に応じ関連データを内包するかリンクさせるか考える。

スキーマデザインのアンチパターン。ドキュメンとには関連するデータは正規化せずにサブドキュメント、配列で保持した方がいいけど、配列のデータが多すぎる場合も注意。
（更新があるかどうか）

関連する情報が配列なら別コレクションに切り出してリンクさせるのもあり。
ただしコレクションに切り出すと「ドキュメント単位」でインデックスができる。

インデックスサイズもバカにならないので、切り出したいデータが親ドキュメントからしか参照されないなら、そのまま内包するのも一つの方法。

IOTのデータ、デバイスに紐づく時系列なデータはバケットパターンで持つのがその例。

Ref. https://www.mongodb.com/blog/post/building-with-patterns-the-bucket-pattern

JSON (BSON) で柔軟にデータを保持できるけど、バリデーションもちゃんとできる。JSON Schemaに従ってネストしたデータもバリデーションを適用し、ワーニング、エラー、新規登録時、更新時、といった設定ができる。

Ref. http://json-schema.org

整合性の保持。MongoDBの場合、インデックスによる一意性、JSON Schemaによるドメイン整合性が担保できる。参照整合性に関しては、外部コレクションとの参照制約やカスケーディングデリートは保持してないけど、モデリングで内包することでも担保。

ストアドプロシージャ的なのは、コーディングでの担保に加えて、Atlasのプラットフォームだとトリガを設定できるそう。

Ref. https://docs.atlas.mongodb.com/triggers/

Chapter 3: Code and Queries

SQLとの比較で、MQLはデータのCRUD操作がメインで複雑なデータの取り出し、加工はAggregation pipelineを使います。
テーブルのジョインが無い分、がさっと大きな条件で取ってきて、パイプライン的に集計やフィルタしていく。
特にデバッグがしやすい。

SQLとMQL&Aggregationの比較表。
たとえば、部分一致のLike（like “%xxx%”) 的なのはパターンマッチで。

Ref. https://docs.mongodb.com/manual/reference/sql-comparison/index.html

SQLはネストしたクエリは、インサイド -> アウトで解釈される。MongoDBのアグリゲーションは、トップ -> ボトムで。

MongoDB Aggregation Framework Queries

単一のドキュメントの取得では、find() を利用
SQLでいうところのSELECTに該当する
ただし複雑な問合せや集計、他のコレクション（RDBでのテーブル）との関連付けには、find() では弱い
aggregationでSELECTに該当するのは $match

※アプリケーションコード側で処理、という方法ももちろんあり。

find()とaggregate()のテスト

簡単なクエリを試してみる。
複雑でないものなら、find()でも抽出ができる。

// SQLだと、こんな感じ
// SELECT theaterId, geo FROM theaters
//       WHERE theaterId > 5000 limit 2

// findの場合
db.theaters.find({ theaterId: { $gt: 5000 } }, { theaterId: 1, 'location.geo': 1 }).limit(2)
{ _id: ObjectId("59a47287cfa9a3a73e51ec46"),
  theaterId: 8039,
  location: { geo: { type: 'Point', coordinates: [ -122.200085, 37.729292 ] } } }
{ _id: ObjectId("59a47287cfa9a3a73e51ec57"),
  theaterId: 8060,
  location: { geo: { type: 'Point', coordinates: [ -97.040653, 32.897449 ] } } }

// aggregateの場合
db.theaters.aggregate([{ $match: {theaterId: {$gt: 5000} } }, { $project: { theaterId: 1, 'location.geo': 1 }}, { $limit: 2 } ])
[ { _id: ObjectId("59a47287cfa9a3a73e51ec46"),
    theaterId: 8039,
    location: { geo: { type: 'Point', coordinates: [ -122.200085, 37.729292 ] } } },
  { _id: ObjectId("59a47287cfa9a3a73e51ec57"),
    theaterId: 8060,
    location: { geo: { type: 'Point', coordinates: [ -97.040653, 32.897449 ] } } } ]


// ランダムにtheaterId > 5000 の中から2つ抽出
db.theaters.aggregate([{ $match: {theaterId: {$gt: 5000} } }, { $project: { theaterId: 1, 'location.geo': 1 }}, { $sample: { size: 2 } } ])
[ { _id: ObjectId("59a47287cfa9a3a73e51ec5a"),
    theaterId: 8064,
    location: { geo: { type: 'Point', coordinates: [ -93.227904, 44.873221 ] } } },
  { _id: ObjectId("59a47287cfa9a3a73e51ec57"),
    theaterId: 8060,
    location: { geo: { type: 'Point', coordinates: [ -97.040653, 32.897449 ] } } } ]

// ランダムにtheaterId > 5000 の中から2つ抽出（もう一回試すと違う結果が返る）
db.theaters.aggregate([{ $match: {theaterId: {$gt: 5000} } }, { $project: { theaterId: 1, 'location.geo': 1 }}, { $sample: { size: 2 } } ])
[ { _id: ObjectId("59a47287cfa9a3a73e51ecc1"),
    theaterId: 8193,
    location: { geo: { type: 'Point', coordinates: [ -81.684564, 30.491183 ] } } },
  { _id: ObjectId("59a47287cfa9a3a73e51ec74"),
    theaterId: 8090,
    location: { geo: { type: 'Point', coordinates: [ -77.448032, 38.952866 ] } } } ]

上記だけだとつまらないので、もう少し変えてみる。


// theaterIdが5000より大きいか、1000より小さいデータからランダムに2つ
db.theaters.aggregate([{ $match: { $or: [{theaterId: {$gt: 5000}}, { theaterId:  { $lt: 1000 }}] } }, { $project: { theaterId: 1, 'location.geo': 1 }}, { $sample: { size: 2 } } ])
[ { _id: ObjectId("59a47286cfa9a3a73e51e7f1"),
    theaterId: 128,
    location: { geo: { type: 'Point', coordinates: [ -119.78868, 39.475178 ] } } },
  { _id: ObjectId("59a47287cfa9a3a73e51ed12"),
    theaterId: 872,
    location: { geo: { type: 'Point', coordinates: [ -118.12892, 33.922974 ] } } } ]

Object Mappers

MongoDBは必要な単位で1つのドキュメントとしてデータを登録するので、アプリケーション側でオブジェクトとして利用する際に、オブジェクトマッピングのための何層かの処理や、データの合成にかかるコストが少ない
シンプルなオブジェクトマッピングになる

Object Mappers: Quiz1

In a relational database, it gathers data from multiple tables by performing joins to create the object.
- RDBではオブジェクト生成のために複数のテーブルを結合させてデータを抽出する必要がある
In MongoDB, the data already represents the object so it can be sent without processing.
- MongoDBではすでにデータが利用しやすい形でドキュメントにまとまっているので、加工無しにオブジェクト化できる

※全てのオブジェクト生成にテーブルのジョインが必要、というわけではないだろうけれど、テーブルを正規化するのが一般的な分、たしかにRailsではメインのデータに対して関連付けを行ってオブジェクトを合成している....✍️

Object Mappers: Quiz2

Q. What typically maps directly to an object?
A. A document in MongoDB
- MongoDBのドキュメントがオブジェクトに直に該当するよ

Code and Queries / Code Samples (演習)

Python, Java, .NET, JavaScriptを使ってのデータの抽出にトライ

// 抽出条件
db.test.find(
  { age: { $lt: 30 } }
)

Node.jsでやってみる

find() と　aggregate() の両方を試してみます。

// sample.js
require('dotenv').config();

const {
  MongoClient,
} = require("mongodb");

// Connection URI
const uri = process.env.MONGO_URI;
const assert = require('assert');

// Create a new MongoClient
const client = new MongoClient(uri, {
  useUnifiedTopology: true
});

async function query(filter, pipline) {
  try {
    // Connect the client to the server
    await client.connect();

    const db = client.db('fcc');
    const coll = db.collection('people');
    console.log("Connected successfully to server");

    // find() で抽出
    const findCursor = coll.find(filter);
    const result = []

    // 複数の場合はカーソルで取り出します
    await findCursor.forEach(document => {
      result.push(document);
    });

    // aggregationの場合
    const aggregateCursor = coll.aggregate(pipline);
    const aggregateResult = []

    // aggregationでも複数の場合はカーソルで取り出します
    await aggregateCursor.forEach(document => {
      aggregateResult.push(document);
    });

    return [result, aggregateResult]

  } finally {
    await client.close();
  }
}

const filter = {
  'age': {
    '$lt': 30
  }
};

const pipeline = [{
  $match: {
    'age': {
      '$lt': 30
    }
  }
}];

query(filter, pipeline).then((value) => {
  const findResult = value[0];
  const aggregateResult = value[1];

  console.log('----- find ------');
  console.log(findResult);
  console.log('----- aggregate ------');
  console.log(aggregateResult);

  try {
    assert.deepStrictEqual(findResult, aggregateResult);
    console.log('findResult, aggregateResult - OK');
  } catch (err) {
    console.log(err);
  }
});

アサーションのテストを付けてみたので、実行。


% node sample.js

Connected successfully to server
----- find ------
[ { _id: 5f5e160a8bfa3b017a81f151,
    favoriteFoods: [ 'burrito', 'hot-dog' ],
    name: 'Pablo',
    age: 26,
    __v: 0 },
  { _id: 5f5e160a8bfa3b017a81f152,
    favoriteFoods: [ 'pizza', 'nachos' ],
    name: 'Bob',
    age: 23,
    __v: 0 } ]
----- aggregate ------
[ { _id: 5f5e160a8bfa3b017a81f151,
    favoriteFoods: [ 'burrito', 'hot-dog' ],
    name: 'Pablo',
    age: 26,
    __v: 0 },
  { _id: 5f5e160a8bfa3b017a81f152,
    favoriteFoods: [ 'pizza', 'nachos' ],
    name: 'Bob',
    age: 23,
    __v: 0 } ]
findResult, aggregateResult - OK

Code Sample: Quiz1

Q. Which programming languages can use the exact same query syntax for the conditions than the MongoDB shell?
A. Python, JavaScript

MongoDBのドライバはJava, .NET, Python, JavaScript その他いろいろありますが、MongoDB Shellから渡すのと同じようにクエリを渡せるのはPython, JavaScript

JavaはBsonクラス、.NETはFilterクラスを生成して渡す形。

Code Sample: Quiz2

Q. MongoDBからクエリを発行してデータを取り出す際のおおまかな流れは？
A. 以下
- MongoDBClientを使ってデータベースに接続
- クエリを投げる対象のコレクションを取得
- クエリを変数に格納
- コレクションにクエリを投げる
- 結果をオブジェクトに受け取る

今回のメモ

Aggregationはいろいろ複雑なことができそうです！
今回は、aggregationの動作確認は、MongoDB Atlas上のテストデータを使いつつ、MongoDB Compassを使ってチェックしています。

簡単なスクリプトは、Node.jsを使っています。

この次の Chapter 4は、"The Life Cycle of an Application and Additional Resources" になります。
高可用性を意識したアプリケーションについての概要、関連情報になります。