🐿️

Amazon SageMaker Studioでリソースの消し忘れを検知する

MEGAZONE株式会社

2023/11/21に公開

はじめに

ご覧いただきありがとうございます。阿河です。

Amazon SageMaker Studioは様々なMLOPS機能を提供していますが、様々なリソースを立てているうちにリソースを消し忘れて思わぬ課金が発生することがあります(私もありました)

特にAmazon SageMaker Data Wranglerは便利な反面、時間あたりの料金がかかるので、リソースの消し忘れをするとお財布に響きます。

IDLE状態のリソースがあった場合、(自動シャットダウンでなく)管理者やユーザーがリソースの消し忘れを確認できるようにしました。

概要

料金について
想定環境
Lambda関数の作成
実行結果

1. 料金について

発生するコストはインスタンスタイプに基づいており、インスタンスごとに個別に請求されます。
課金は、インスタンスの作成時に開始され、インスタンス上のすべてのアプリケーションが
シャットダウンされるか、インスタンスがシャットダウンされたときに停止します。

インスタンス上で実行されているノートブックをシャットダウンしても、
インスタンスをシャットダウンしなければ、課金は継続します。
複数のノートブックを異なるカーネルで起動したとしても、
同じインスタンスタイプである限り、ノートブックは同じインスタンス上で実行されます。
複数のノートブックが開いていたとしても、実行中のインスタンスが 1 台の場合は、
その 1 台のインスタンスの起動時間に対してのみ課金されます。

ノートブックをシャットダウンすると、ノートブック自体は削除されませんが、
未保存のデータが失われます。

料金については、上記のページが分かりやすいと思います。

今回はカーネルゲートウェイアプリケーションと、SageMaker Data Wrangler リソースをテスト対象としています。

2. 想定環境

■ Domain

複数ドメインが同リージョンに存在している環境を想定しています。今回はテストとして2つのドメインをバージニア北部リージョンに作成しています。

■ UserProfile

test-domain-1には、2つのユーザープロファイルが存在します。

test-domain-2には、1つのユーザープロファイルが存在します。

■ Resource

それぞれのユーザープロファイルがカーネルゲートウェイアプリケーションとData Wanglerリソースをそれぞれ立ち上げて、処理を実行します。

test-domain-1に属するuser01は、Data Wrangler用のリソースをシャットダウンしたものの、カーネルゲートウェイアプリケーションを起動したまま放置。

test-domain-1に属するuser02は、カーネルゲートウェイアプリケーションをシャットダウンしたものの、Data Wrangler用のリソースを起動したまま放置。

※インスタンスのシャットダウンについて

インスタンスを削除するには、アプリケーションを削除する必要がある。

カーネルゲートウェイアプリケーションを削除すると、ml.t3.mediumのインスタンスが同時にシャットダウンしました。

■ Lambda

今回はドメインごとにLambda関数を用意します。
test-domain-1用のLambda関数を実行して、リソースを長時間IDLE状態にしたままのユーザーを特定します。
今回の場合はtest-domain-1ドメインに所属するユーザー(user01/user02)2名がリソースを消し忘れているため、2名のユーザー名をLambda関数で取得できればよいです。

■ CloudWatch Logs

/aws/sagemaker/studioのロググループを確認します。
カーネルゲートウェイ/Data Wranglerリソースどちらも、同じアプリケーションタイプとして、ログストリームが作られています。

[domain-id]/[user-profile-name]/[app-type]/[app-name]

上記のログストリーム名で、「どのドメインか」「どのユーザープロファイルか」「アプリケーションタイプ」「アプリケーション名」を識別できます。

3. Lambda関数の作成

※設定
Runtime: Python 3.11
Architecture: x86_64

Lambda関数にアタッチするIAMロールは、「CloudWatch Logs」「SageMaker」の権限を適宜追加してください。

※コード
※Parameterセクションの3つのパラメータは、環境に合わせて変更してください。

import boto3
from datetime import datetime, timezone, timedelta
jst = timezone(timedelta(hours=9), 'JST')


# Parameter
threshold = 86400 #tolerable time for idle state
target_region = "us-east-1" #region
domain_id = "xxxxxxxxxxxx" #domain id



def list_target_user_profile(specified_domain_id, client):
    
    profiles = []
    
    # Check the UserProfile Name corresponding to the specified domain name and store it in the list
    for profile in client.list_user_profiles(DomainIdEquals = specified_domain_id)['UserProfiles']:
        name = profile['UserProfileName']
        profiles.append(name)
            
    return profiles 
    

def list_target_app(userProfiles, client):
    
    apps = []
    
    # Check Apps per UserProfile and store in list
    # Only when the application type is kernelGateway and the status is Inservice, it is the target of storage.
    for i in userProfiles:
        user = i
        for n in client.list_apps(UserProfileNameEquals = i)['Apps']:
            if len(n) > 0 and n['AppType'] == 'KernelGateway' and n['Status'] == 'InService':
                user = n['UserProfileName']
                app = n['AppName']
                dict= {}
                dict[user] = app
                apps.append(dict)
        
    return apps
    
    
def search_idle_instace(apps, client):
    
    target_user = []
    
    # Compare current time and log update date
    # If the difference is greater than or equal to the threshold, store in the list
    for i in apps:
        for d in i:
            response = client.describe_log_streams(
            logGroupName='/aws/sagemaker/studio',
            logStreamNamePrefix = f'{domain_id}/{d}/KernelGateway/{i[d]}',
            descending=True,
            limit = 1
            )
            
            modified_time = response['logStreams'][0]['lastEventTimestamp']
            time = datetime.fromtimestamp(modified_time/1000, jst)
            print(time)
            
            dt = datetime.now(jst)
            print(dt)
            
            diff = dt - time
            print(diff)
            
            d_diff = diff.days * 86400
            sum_diff = d_diff + diff.seconds
            message = f"user_name:{d}, diff:{sum_diff}\n"
            print(message)

            
            if sum_diff > threshold and d not in target_user:
                target_user.append(d)
                
    return target_user
    
    
def lambda_handler(event, context):
    
    user_profiles = []
    
    sm_client = boto3.client("sagemaker", target_region)
    cw_logs_client = boto3.client('logs', target_region)

    # Stores the names of UserProfiles belonging to the domain
    user_profiles = list_target_user_profile(domain_id, sm_client)
    print(user_profiles)
    
    # Define the correspondence between UserProfiles and running applications
    apps = list_target_app(user_profiles, sm_client)
    print(apps)
    
    # Check the application log to see which applications are in idle status
    target_user = search_idle_instace(apps, cw_logs_client)
    print(target_user)

4. 実行結果

Lambda関数を実行します。
24時間(86400秒)経過以降に実行しています。

# result(list_target_user_profile)
['user02', 'user01']

# result(list_target_app)
[{'user02': 'sagemaker-data-wrang-ml-m5-4xlarge-xxxxxxxxxxxxxxxxxxxxxxxx'}, {'user01': 'sagemaker-data-scienc-ml-t3-medium-xxxxxxxxxxxxxxxxxxxxxxxxx'}]

# result(search_idle_instace)
2023-11-17 11:04:42.316000+09:00
2023-11-18 12:25:04.657602+09:00
1 day, 1:20:22.341602
user_name:user02, diff:91222

2023-11-17 10:25:06.861000+09:00
2023-11-18 12:25:04.689970+09:00
1 day, 1:59:57.828970
user_name:user01, diff:93597

# target_user list
['user02', 'user01']

test-use-1に所属しており、かつIDLE状態のリソースを持つユーザーの名前が取得できています。

さいごに

今回用意したLambdaをEventBridge + SNSと連携させて日常的に定期実行させれば、管理者やユーザーがリソースの消し忘れについて確認することができます。色々アレンジしてみて下さい。

誰かの参考になれば幸いです。
ここまで読んでいただき、ありがとうございました！

MEGAZONE株式会社 Tech BlogPublication

MEGAZONEはAWS プレミアティアサービスパートナーとして多数のCompetencyを取得しており、大企業、ゲーム会社、スタートアップ、公共機関などさまざまな分野の5,000以上のお客様にAWSソリューションとサービスを提供しています。