Closed7

S3に残ったS3ログを列指向フォーマットに圧縮したい

wintwint

モチベ

S3のログがバケットに残ってるけど、一覧性が悪いしスペース効率も悪いので、圧縮したい

仕様

ログの形式

https://docs.aws.amazon.com/ja_jp/AmazonS3/latest/userguide/ServerLogs.html

https://docs.aws.amazon.com/ja_jp/AmazonS3/latest/userguide/LogFormat.html

ログのスキーマ

https://docs.aws.amazon.com/ja_jp/AmazonS3/latest/userguide/using-s3-access-logs-to-identify-requests.html

CREATE EXTERNAL TABLE `s3_access_logs_db.mybucket_logs`(
  `bucketowner` STRING, 
  `bucket_name` STRING, 
  `requestdatetime` STRING, 
  `remoteip` STRING, 
  `requester` STRING, 
  `requestid` STRING, 
  `operation` STRING, 
  `key` STRING, 
  `request_uri` STRING, 
  `httpstatus` STRING, 
  `errorcode` STRING, 
  `bytessent` BIGINT, 
  `objectsize` BIGINT, 
  `totaltime` STRING, 
  `turnaroundtime` STRING, 
  `referrer` STRING, 
  `useragent` STRING, 
  `versionid` STRING, 
  `hostid` STRING, 
  `sigv` STRING, 
  `ciphersuite` STRING, 
  `authtype` STRING, 
  `endpoint` STRING, 
  `tlsversion` STRING)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://awsexamplebucket1-logs/prefix/'

ソリューション候補

https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html

by creating a cluster in Amazon EMR and converting it using Hive

https://www.techscore.com/blog/2019/12/07/parquet_examine/

Apache Spark

wintwint

列指向フォーマット

結論: ORC を採用

候補

  • Parquet
  • ORC

VS

https://www.linkedin.com/pulse/orc-vs-parquet-vivek-singh/

Parquet’s compression ratio was much better as compared to ORC.

As per this benchmark Parquet performed better both in terms of compression and query performance. In both cases snappy compression was used.

https://blog.clairvoyantsoft.com/big-data-file-formats-3fb659903271

ORC is more compression efficient

http://www.thecloudavenue.com/2016/10/comparing-orc-vs-parquet-data-storage.html

Note that ZLIB in ORC and GZIP in Parquet uses the same compression codec, just the property name is different.

There is not much storage savings when using ORC and Parquet when using the same compression code like SNAPPY vs SNAPPY and ZLIB vs GZIP.

https://blog.yannickjaquier.com/hadoop/orc-versus-parquet-compression-and-response-time.html

graph

ref

https://docs.aws.amazon.com/athena/latest/ug/columnar-storage.html

https://engineer.retty.me/entry/columnar-storage-format

余談

Avro は行指向なので違う

https://medium.com/@minyodev/avro-vs-parquet-what-to-use-518ccbe8fb0c

wintwint

S3ログのパース

https://aws.amazon.com/jp/premiumsupport/knowledge-center/analyze-logs-athena/

parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z')

https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html

https://prestodb.io/docs/0.217/functions/datetime.html

date_trunc

クエリ

with trunc AS 
    (SELECT date_trunc('day', parse_datetime(requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z')) AS day
    FROM "s3_access_logs"."table_name" )
SELECT day,
         count() AS count
FROM trunc
GROUP BY  day
ORDER BY  day
このスクラップは2021/08/26にクローズされました