📝

PySparkのPartitionについて

2022/04/25に公開

PySparkのPartitionについて調べた。

Partitions the output by the given columns on the file system.
If specified, the output is laid out on the file system similar to Hive’s partitioning scheme.

「パーティションを指定すると、出力はHiveのparitioning schemeと似たような感じでレイアウトされる」

Hiveのパーティションについて確認

こちらの記事を参照。

Hiveではテーブルにパーティションを設定する事でテーブルの検索・更新範囲を限定することができる

検索の際にはWHERE句にパーティションを指定する事で検索の際にスキャンする範囲を絞る事ができ、効率的になります。WHERE句でパーティションを指定しないクエリはテーブルのフルスキャンと同義なので気をつけましょう。

https://docs.aws.amazon.com/ja_jp/glue/latest/dg/aws-glue-programming-etl-partitions.html

Glue jobでのPartition

資料は以下。

以下を押さえておけば良さそう。

Partition Filtering
- ファイルの読み取る範囲を限定する機能
Filter Pushdown
- パーティション列に利用されていない列に対するfilter句やwhere句に
  ヒットするブロックのみを読み取る機能

partitionをやる意味

効率化
- スキャン量の軽減
- クエリの実行時間の削減

PySparkのPartitionについて

Hiveのパーティションについて確認

Glue jobでのPartition

partitionをやる意味

参考

Discussion