🤖

AWS AthenaでALBのアクセスログを分析できるようにする

2022/07/31に公開

はじめに

特定のページのアクセス数を確認しようと思い、Athenaを利用してALBのアクセスログを検索しようすると、テーブルは存在するもののパーティションの設定が行われていないという状況にでくわしたので、あらためてAthenaのテーブルの作成方法をまとめておこうと思います。

Athenaのテーブルを作成する

以降は、Athenaのコンソールより操作します。

パーティション設定しない場合

以下を実行して、テーブルを作成します。

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
            type string,
            time string,
            elb string,
            client_ip string,
            client_port int,
            target_ip string,
            target_port int,
            request_processing_time double,
            target_processing_time double,
            response_processing_time double,
            elb_status_code int,
            target_status_code string,
            received_bytes bigint,
            sent_bytes bigint,
            request_verb string,
            request_url string,
            request_proto string,
            user_agent string,
            ssl_cipher string,
            ssl_protocol string,
            target_group_arn string,
            trace_id string,
            domain_name string,
            chosen_cert_arn string,
            matched_rule_priority string,
            request_creation_time string,
            actions_executed string,
            redirect_url string,
            lambda_error_reason string,
            target_port_list string,
            target_status_code_list string,
            classification string,
            classification_reason string
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' =
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
            LOCATION 's3://{BUCKET_NAME}/AWSLogs/{AWS_ACCOUNT_ID}/elasticloadbalancing/{AWS_REGION}/'

パーティション設定をする場合

パーティション管理を自身で行う場合

以下を実行して、テーブルを作成します。

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs_partitioned (
            type string,
            time string,
            elb string,
            client_ip string,
            client_port int,
            target_ip string,
            target_port int,
            request_processing_time double,
            target_processing_time double,
            response_processing_time double,
            elb_status_code int,
            target_status_code string,
            received_bytes bigint,
            sent_bytes bigint,
            request_verb string,
            request_url string,
            request_proto string,
            user_agent string,
            ssl_cipher string,
            ssl_protocol string,
            target_group_arn string,
            trace_id string,
            domain_name string,
            chosen_cert_arn string,
            matched_rule_priority string,
            request_creation_time string,
            actions_executed string,
            redirect_url string,
            lambda_error_reason string,
            target_port_list string,
            target_status_code_list string,
            classification string,
            classification_reason string
            )
            PARTITIONED BY
            (
                year string,
                month string,
                day string
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' =
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
            LOCATION 's3://{BUCKET_NAME}/AWSLogs/{AWS_ACCOUNT_ID}/elasticloadbalancing/{AWS_REGION}/'

その後、任意のパーティションキー(year, month, day)を指定してパーティションを追加します。

ALTER TABLE alb_logs_partitioned ADD
  PARTITION (year = '2022', month ='07', day= '30')
  LOCATION's3://{BUCKET_NAME}/AWSLogs/{AWS_ACCOUNT_ID}/elasticloadbalancing/{AWS_REGION}/2022/07/30/'

Prtition Projectionでパーティション管理を自動化する場合

以下を実行して、テーブルを作成します。

CREATE EXTERNAL TABLE IF NOT EXISTS alb_log_partition_projection (
            type string,
            time string,
            elb string,
            client_ip string,
            client_port int,
            target_ip string,
            target_port int,
            request_processing_time double,
            target_processing_time double,
            response_processing_time double,
            elb_status_code int,
            target_status_code string,
            received_bytes bigint,
            sent_bytes bigint,
            request_verb string,
            request_url string,
            request_proto string,
            user_agent string,
            ssl_cipher string,
            ssl_protocol string,
            target_group_arn string,
            trace_id string,
            domain_name string,
            chosen_cert_arn string,
            matched_rule_priority string,
            request_creation_time string,
            actions_executed string,
            redirect_url string,
            lambda_error_reason string,
            target_port_list string,
            target_status_code_list string,
            classification string,
            classification_reason string
            )
            PARTITIONED BY
            (
             date string
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' =
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
            LOCATION 's3://{BUCKET_NAME}/AWSLogs/{AWS_ACCOUNT_ID}/elasticloadbalancing/{AWS_REGION}/'
            TBLPROPERTIES
            (
             "projection.enabled" = "true",
             "projection.date.type" = "date",
             "projection.date.range" = "2022/01/01,NOW",
             "projection.date.format" = "yyyy/MM/dd",
             "projection.date.interval" = "1",
             "projection.date.interval.unit" = "DAYS",
             "storage.location.template" = "s3://{BUCKET_NAME}/AWSLogs/{AWS_ACCOUNT_ID}/elasticloadbalancing/{AWS_REGION}/${date}"
            )

まとめ

Athenaのコンソールより、テーブルを作成する方法を記述しました。また、パーティション設定をしない方法とパーティション設定をする方法を記述し、パーティション設定をする方法については、パーティション管理を自身で行う方法と、Partition Projectionを利用してパーティション管理を自動化する方法についても記述しました。

参考

GitHubで編集を提案

はじめに

Athenaのテーブルを作成する

パーティション設定しない場合

パーティション設定をする場合

まとめ

参考

Discussion