📝

EMR のチュートリアルを AWS CLI でやってみた

に公開

Tutorial: Getting started with Amazon EMR - Amazon EMR
上記チュートリアルを AWS CLI でやってみました。

前提

  • AWS CLI の実行環境は CloudShell
  • リージョンは東京リージョン

1. S3 バケットの作成

# S3 バケット作成
$ aws s3 mb s3://my-emr-tutorial-bucket-20251031

# Python スクリプト作成
$ cat > health_violations.py << 'EOF'
import argparse

from pyspark.sql import SparkSession

def calculate_red_violations(data_source, output_uri):
    """
    Processes sample food establishment inspection data and queries the data to find the top 10 establishments
    with the most Red violations from 2006 to 2020.

    :param data_source: The URI of your food establishment data CSV, such as 's3://amzn-s3-demo-bucket/food-establishment-data.csv'.
    :param output_uri: The URI where output is written, such as 's3://amzn-s3-demo-bucket/restaurant_violation_results'.
    """
    with SparkSession.builder.appName("Calculate Red Health Violations").getOrCreate() as spark:
        # Load the restaurant violation CSV data
        if data_source is not None:
            restaurants_df = spark.read.option("header", "true").csv(data_source)

        # Create an in-memory DataFrame to query
        restaurants_df.createOrReplaceTempView("restaurant_violations")

        # Create a DataFrame of the top 10 restaurants with the most Red violations
        top_red_violation_restaurants = spark.sql("""SELECT name, count(*) AS total_red_violations 
          FROM restaurant_violations 
          WHERE violation_type = 'RED' 
          GROUP BY name 
          ORDER BY total_red_violations DESC LIMIT 10""")

        # Write the results to the specified output URI
        top_red_violation_restaurants.write.option("header", "true").mode("overwrite").csv(output_uri)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--data_source', help="The URI for you CSV restaurant data, like an S3 bucket location.")
    parser.add_argument(
        '--output_uri', help="The URI where output is saved, like an S3 bucket location.")
    args = parser.parse_args()

    calculate_red_violations(args.data_source, args.output_uri)
EOF

# サンプルデータ取得
$ wget https://docs.aws.amazon.com/ja_jp/emr/latest/ManagementGuide/samples/food_establishment_data.zip
$ unzip food_establishment_data.zip

# S3 バケットにアップロード
$ aws s3 cp health_violations.py s3://my-emr-tutorial-bucket-20251031/input/
$ aws s3 cp food_establishment_data.csv s3://my-emr-tutorial-bucket-20251031/input/

2. EMR クラスターの作成

# IAM ロール作成
$ aws emr create-default-roles

# EC2 キーペア作成
$ aws ec2 create-key-pair \
--key-name my-emr-key-pair \
--query 'KeyMaterial' \
--output text > my-emr-key-pair.pem

# EMR クラスター作成
$ aws emr create-cluster \
--name "My First EMR Cluster" \
--release-label emr-7.10.0 \
--applications Name=Spark \
--ec2-attributes KeyName=my-emr-key-pair \
--instance-type m5.xlarge \
--instance-count 3 \
--use-default-roles

{
    "ClusterId": "j-2AXFRBRJRYCNS",
    "ClusterArn": "arn:aws:elasticmapreduce:ap-northeast-1:012345678901:cluster/j-2AXFRBRJRYCNS"
}

3. PySpark アプリケーションをステップとして送信

# ステップの送信
$ aws emr add-steps \
--cluster-id j-2AXFRBRJRYCNS \
--steps Type=Spark,Name="My Spark Application",ActionOnFailure=CONTINUE,Args=[s3://my-emr-tutorial-bucket-20251031/input/health_violations.py,--data_source,s3://my-emr-tutorial-bucket-20251031/input/food_establishment_data.csv,--output_uri,s3://my-emr-tutorial-bucket-20251031/output/]

{
    "StepIds": [
        "s-03269583KEEM21ZLZBLZ"
    ]
}

# 出力結果の確認
$ aws s3 ls s3://my-emr-tutorial-bucket-20251031/output/ --recursive
2025-10-31 04:14:46          0 output/_SUCCESS
2025-10-31 04:14:46        219 output/part-00000-2403a6e9-1607-42a3-b3f4-0165fa791067-c000.csv

$ aws s3 cp s3://my-emr-tutorial-bucket-20251031/output/part-00000-2403a6e9-1607-42a3-b3f4-0165fa791067-c000.csv - | head -20
name,total_red_violations
SUBWAY,322
T-MOBILE PARK,315
WHOLE FOODS MARKET,299
PCC COMMUNITY MARKETS,251
TACO TIME,240
MCDONALD'S,177
THAI GINGER,153
SAFEWAY INC #1508,143
TAQUERIA EL RINCONSITO,134
HIMITSU TERIYAKI,128

4. SSH 接続でクラスターに接続する

# セキュリティグループにインバウンドルールを追加
$ aws ec2 describe-security-groups \
--filters Name=group-name,Values=ElasticMapReduce-master \
--query 'SecurityGroups[0].GroupId'
"sg-074bb8607d486d019"

$ aws ec2 authorize-security-group-ingress \
--group-id sg-074bb8607d486d019 \
--protocol tcp \
--port 22 \
--cidr 0.0.0.0/0

# SSH 接続
$ chmod 400 ~/my-emr-key-pair.pem
$ aws emr ssh \
--cluster-id j-2AXFRBRJRYCNS \
--key-pair-file ~/my-emr-key-pair.pem

ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=10 -i /home/cloudshell-user/my-emr-key-pair.pem hadoop@ec2-18-182-43-108.ap-northeast-1.compute.amazonaws.com -t
Warning: Permanently added 'ec2-18-182-43-108.ap-northeast-1.compute.amazonaws.com' (ED25519) to the list of known hosts.
   ,     #_
   ~\_  ####_        Amazon Linux 2023
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   https://aws.amazon.com/linux/amazon-linux-2023
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'
                                                                    
EEEEEEEEEEEEEEEEEEEE MMMMMMMM           MMMMMMMM RRRRRRRRRRRRRRR    
E::::::::::::::::::E M:::::::M         M:::::::M R::::::::::::::R   
EE:::::EEEEEEEEE:::E M::::::::M       M::::::::M R:::::RRRRRR:::::R 
  E::::E       EEEEE M:::::::::M     M:::::::::M RR::::R      R::::R
  E::::E             M::::::M:::M   M:::M::::::M   R:::R      R::::R
  E:::::EEEEEEEEEE   M:::::M M:::M M:::M M:::::M   R:::RRRRRR:::::R 
  E::::::::::::::E   M:::::M  M:::M:::M  M:::::M   R:::::::::::RR   
  E:::::EEEEEEEEEE   M:::::M   M:::::M   M:::::M   R:::RRRRRR::::R  
  E::::E             M:::::M    M:::M    M:::::M   R:::R      R::::R
  E::::E       EEEEE M:::::M     MMM     M:::::M   R:::R      R::::R
EE:::::EEEEEEEE::::E M:::::M             M:::::M   R:::R      R::::R
E::::::::::::::::::E M:::::M             M:::::M RR::::R      R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM             MMMMMMM RRRRRRR      RRRRRR

# ログファイル確認
$ ls -la /mnt/var/log/hadoop-yarn/
total 268
drwxrwxr-x.  2 yarn hadoop   4096 Oct 31 04:10 .
drwxr-xr-x. 26 root root     4096 Oct 31 04:09 ..
-rw-r--r--.  1 yarn yarn    31982 Oct 31 04:09 hadoop-yarn-proxyserver-ip-172-31-2-135.ap-northeast-1.compute.internal.log
-rw-r--r--.  1 yarn yarn      835 Oct 31 04:09 hadoop-yarn-proxyserver-ip-172-31-2-135.ap-northeast-1.compute.internal.out
-rw-r--r--.  1 yarn yarn   149540 Oct 31 04:19 hadoop-yarn-resourcemanager-ip-172-31-2-135.ap-northeast-1.compute.internal.log
-rw-r--r--.  1 yarn yarn     2362 Oct 31 04:10 hadoop-yarn-resourcemanager-ip-172-31-2-135.ap-northeast-1.compute.internal.out
-rw-r--r--.  1 yarn yarn    47026 Oct 31 04:18 hadoop-yarn-timelineserver-ip-172-31-2-135.ap-northeast-1.compute.internal.log
-rw-r--r--.  1 yarn yarn    23923 Oct 31 04:14 hadoop-yarn-timelineserver-ip-172-31-2-135.ap-northeast-1.compute.internal.out

$ tail -5 /mnt/var/log/hadoop-yarn/hadoop-yarn-resourcemanager-ip-172-31-2-135.ap-northeast-1.compute.internal.log
2025-10-31 04:20:20,393 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler (Pending Container Clear Timer): Release request cache is cleaned up
2025-10-31 04:20:21,207 INFO http.requests.resourcemanager (qtp1562912969-40): 172.31.2.135 - - [31/Oct/2025:04:20:21 +0000] "GET /ws/v1/cluster/nodes HTTP/1.1" 200 697 "-" "Apache-HttpClient/4.5.14 (Java/17.0.16)"
2025-10-31 04:20:21,210 INFO http.requests.resourcemanager (qtp1562912969-37): 172.31.2.135 - - [31/Oct/2025:04:20:21 +0000] "GET /ws/v1/cluster/apps?states=RUNNING HTTP/1.1" 200 31 "-" "Apache-HttpClient/4.5.14 (Java/17.0.16)"
2025-10-31 04:20:23,566 INFO http.requests.resourcemanager (qtp1562912969-40): 172.31.2.135 - - [31/Oct/2025:04:20:23 +0000] "GET /ws/v1/cluster/apps?states=RUNNING HTTP/1.1" 200 31 "-" "Apache-HttpClient/4.5.14 (Java/17.0.16)"
2025-10-31 04:20:23,569 INFO http.requests.resourcemanager (qtp1562912969-35): 172.31.2.135 - - [31/Oct/2025:04:20:23 +0000] "GET /ws/v1/cluster/apps?states=FINISHED,FAILED,KILLED&finishedTimeBegin=1761880823566&finishedTimeEnd=1761884423566 HTTP/1.1" 200 720 "-" "Apache-HttpClient/4.5.14 (Java/17.0.16)"

$ ls -la /mnt/var/log/hadoop/steps/
total 0
drwxr-xr-x. 4 hadoop hadoop 66 Oct 31 04:13 .
drwxr-xr-x. 3 hadoop hadoop 19 Oct 31 04:11 ..
drwxr-xr-x. 2 hadoop hadoop 66 Oct 31 04:11 s-08491022LMVZ4AK9Y3WM

$ cat /mnt/var/log/hadoop/steps/s-08491022LMVZ4AK9Y3WM/controller
1031/input/food_establishment_data.csv --output_uri s3://my-emr-tutorial-bucket-20251031/output/'
INFO Environment:
  PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/aws/puppet/bin/:/opt/puppetlabs/bin
  INVOCATION_ID=3f31d4c82fd242069494788a6698aee4
  SECURITY_PROPERTIES=/emr/instance-controller/lib/security.properties
  HISTCONTROL=ignoredups
  MANPATH=:/opt/puppetlabs/puppet/share/man
  PIDFILE=/emr/instance-controller/run/instance-controller.pid
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  AWS_DEFAULT_REGION=ap-northeast-1
  JAVA_HOME=/etc/alternatives/jre
  BASH_FUNC_which%%=() {  ( alias; eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot "$@"}
  LANG=en_US.UTF-8
  JOURNAL_STREAM=8:23997
  MAIL=/var/spool/mail/hadoop
  which_declare=declare -f
  LOGNAME=hadoop
  PWD=/
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-08491022LMVZ4AK9Y3WM/tmp
  _=/usr/lib/jvm/jre-17/bin/java
  S_COLORS=auto
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  SHELL=/bin/bash
  USER=hadoop
  GEM_HOME=/home/hadoop/.local/share/gem/ruby
  HADOOP_LOGFILE=syslog
  SYSTEMD_COLORS=false
  HOSTNAME=ip-172-31-2-135.ap-northeast-1.compute.internal
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-08491022LMVZ4AK9Y3WM
  SYSTEMD_EXEC_PID=3991
  EMR_STEP_ID=s-08491022LMVZ4AK9Y3WM
  SHLVL=0
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-08491022LMVZ4AK9Y3WM/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-08491022LMVZ4AK9Y3WM/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-08491022LMVZ4AK9Y3WM
INFO ProcessRunner started child process 10521
2025-10-31T04:11:57.724212862Z INFO HadoopJarStepRunner.Runner: startRun() called for s-08491022LMVZ4AK9Y3WM Child Pid: 10521
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 6 seconds
2025-10-31T04:12:03.964648540Z INFO Step created jobs: 
2025-10-31T04:12:03.966537209Z WARN Step failed with exitCode 1 and took 6 seconds

$ exit

5. クリーンアップ

$ aws emr terminate-clusters --cluster-ids j-2AXFRBRJRYCNS
$ aws s3 rm s3://my-emr-tutorial-bucket-20251031 --recursive
$ aws s3 rb s3://my-emr-tutorial-bucket-20251031
$ aws ec2 delete-key-pair --key-name my-emr-key-pair

まとめ

今回は EMR のチュートリアルを AWS CLI でやってみました。
どなたかの参考になれば幸いです。

参考資料

Discussion