🏔️

Practicing Automotive Fluid Design with AWS AI Surrogate Models

Introduction

In the previous article, we explained the overview and three prediction types of AWS's AI Surrogate Model solution: AI Surrogate Models in Engineering on AWS (MLSimKit).

https://zenn.dev/aws_japan/articles/788fa50caa13be

[AI Surrogate Model Concept]

In this article, we will explain the procedures for actually running MLSimKit on AWS EC2 and predicting vehicle lift coefficients using the KPI prediction method.

[KPI Prediction (Key Performance Indicator Prediction) Concept]

The official tutorial for KPI prediction explains procedures such as MLSimKit installation, dataset download, and model training, but does not cover EC2 environment setup or troubleshooting when actually running it.

Therefore, this article summarizes recommended configurations and procedures for low-cost, rapid execution that the author successfully achieved after several trials.

Prerequisites:

  1. Clone MLSimKit Repository (Local environment)

    git clone https://github.com/awslabs/ai-surrogate-models-in-engineering-on-aws.git
    cd ai-surrogate-models-in-engineering-on-aws
    
  2. [Optional] Install Amazon Q Developer CLI

    • This article's trials used the AI coding agent Amazon Q Developer CLI to perform AWS environment operations at high speed.
    • Installation method: Amazon Q Developer CLI

Execution Procedure Overview:

Phase
Launch EC2
SSH Connection & GPU Verification
Install MLSimKit
Download Data
Create Manifest
Execute Preprocessing
Execute Training
Test Prediction

Chapter 1: Preparing the Execution Environment

1-0. Architecture Overview

The AWS architecture to be built in this tutorial is as follows:

**Network Configuration:**
- **VPC**: Use default VPC (automatically created in each AWS account)
  - Example: 172.31.0.0/16 (value in this verification)
- **Subnet**: Public subnet (internet accessible)
  - Example: 172.31.32.0/20 (value in this verification)
  - Allow only your IP in security group
- **Availability Zone**: Any AZ
  - Example: us-east-1c (value in this verification)
- **Internet Gateway**: Required for SSH connection (automatically attached to default VPC)
- **Security Group**: Allow only SSH (22), source is your IP address only
- **External Connection**: Download WindsorML dataset (1.7GB) from Hugging Face

**Recommended Configuration for Production:**
- Private Subnet + NAT Gateway + Bastion Host
- Use when higher security is required

1-1. Launching EC2 Instance

**Instance Type: g5.2xlarge**
- GPU: NVIDIA A10G × 1 (24GB VRAM)
- vCPU: 8 cores
- Memory: 32GB
- Cost: $1.212/hour (us-east-1)
**AMI: Deep Learning Base GPU AMI (Ubuntu 22.04)**
- CUDA 12.1 pre-installed
- NVIDIA driver configured
- Python 3.10 included
**Storage: 100GB EBS gp3**
- WindsorML: 1.7GB
- System + work area: approximately 67GB
- Recommended 100GB with margin (approximately 30GB free)

Launch Procedure

1. Create Key Pair and Save to Secrets Manager

# Create key pair and save to Secrets Manager
KEY_MATERIAL=$(aws ec2 create-key-pair \
  --key-name mlsimkit-tutorial-key \
  --query 'KeyMaterial' \
  --output text)

aws secretsmanager create-secret \
  --name mlsimkit-tutorial-key \
  --secret-string "$KEY_MATERIAL" \
  --region us-east-1

echo "Private key saved to Secrets Manager"

2. Create Security Group

# Get current IP address
MY_IP=$(curl -s https://checkip.amazonaws.com)

# Create security group
SG_ID=$(aws ec2 create-security-group \
  --group-name mlsimkit-tutorial-sg \
  --description "Security group for MLSimKit tutorial" \
  --query 'GroupId' \
  --output text)

# Allow SSH connection
aws ec2 authorize-security-group-ingress \
  --group-id $SG_ID \
  --protocol tcp \
  --port 22 \
  --cidr ${MY_IP}/32

3. Launch EC2 Instance

# Launch instance
INSTANCE_ID=$(aws ec2 run-instances \
  --image-id ami-0601999f27e2188a7 \
  --instance-type g5.2xlarge \
  --key-name mlsimkit-tutorial-key \
  --security-group-ids $SG_ID \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":100,"VolumeType":"gp3"}}]' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=MLSimKit-Windsor-Tutorial}]' \
  --query 'Instances[0].InstanceId' \
  --output text)

echo "Instance ID: $INSTANCE_ID"

# Wait for instance to start
aws ec2 wait instance-running --instance-ids $INSTANCE_ID

# Get public IP
PUBLIC_IP=$(aws ec2 describe-instances \
  --instance-ids $INSTANCE_ID \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text)

echo "Public IP: $PUBLIC_IP"

1-2. SSH Connection

Retrieve Private Key

# Retrieve private key from Secrets Manager
aws secretsmanager get-secret-value \
  --secret-id mlsimkit-tutorial-key \
  --region us-east-1 \
  --query 'SecretString' \
  --output text > mlsimkit-tutorial-key.pem

# Set permissions
chmod 400 mlsimkit-tutorial-key.pem

Connect

# Connect to EC2 (using PUBLIC_IP obtained above)
ssh -i mlsimkit-tutorial-key.pem ubuntu@$PUBLIC_IP

# Or specify IP address directly
# ssh -i mlsimkit-tutorial-key.pem ubuntu@<EC2 Public IP>

Verify Successful Connection:

# Verify GPU
nvidia-smi

# Example output:
# +-----------------------------------------------------------------------------------------+
# | NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
# |-------------------------------+----------------------+--------------------------------+
# | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
# |===============================+======================+================================|
# |   0  NVIDIA A10G              On  | 00000000:00:1E.0 Off |                    0 |
# |  0%   23C    P8              11W / 300W |      0MiB / 23028MiB |      0%      Default |
# +-------------------------------+----------------------+--------------------------------+

Important: Verify GPU Recognition

Before starting training, always verify that the GPU is properly recognized.

# Verify GPU recognition
nvidia-smi | grep "NVIDIA A10G"

Expected output:

|   0  NVIDIA A10G              On  | 00000000:00:1E.0 Off |                    0 |

If GPU is not recognized:

  • Training time will be slower
  • Refer to "GPU Recognition Error" below to resolve
  • After resolution, always re-verify GPU recognition with nvidia-smi

GPU Recognition Error

# Symptom
nvidia-smi
# NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

# Cause
# Mismatch between kernel version and NVIDIA driver
# New kernel (6.8.0-1041) not supported by driver

# Verify
uname -r  # Current kernel
dkms status  # Kernel with installed driver

# Solution 1: Reboot with older kernel
sudo grub-reboot '1>2'
sudo reboot

# Solution 2: Rebuild driver for current kernel (time-consuming)
sudo dkms install nvidia/580.95.05 -k $(uname -r)

# Verify
nvidia-smi  # Verify GPU recognition

1-3. Install MLSimKit

For MLSimKit explanation, refer to here

Install Prerequisites

# Install python3-venv
sudo apt-get update
sudo apt-get install -y python3.10-venv

Download MLSimKit

# Download from GitHub
wget https://github.com/awslabs/ai-surrogate-models-in-engineering-on-aws/archive/refs/heads/main.tar.gz

# Extract
tar -xzf main.tar.gz

# Change directory
cd ai-surrogate-models-in-engineering-on-aws-main

Execute Installation

# System-wide installation
pip install -e .

Verify Installation:

# Check version
mlsimkit-learn --version

# Example output:
# ML for Simulation Toolkit, version 0.1.0

Chapter 2: Preparing WindsorML Dataset

2-1. Download Dataset

For WindsorML dataset explanation, refer to here

cd tutorials/kpi/windsor
ls -la

Available Scripts:

  • download-dataset - Dataset downloader
  • run-create-manifest-training - Create training manifest
  • run-training-pipeline - Execute training pipeline
  • run-create-manifest-prediction - Create prediction manifest
  • run-prediction - Execute prediction pipeline

Execute Dataset Download

# Create data storage directory
mkdir -p ~/datasets

# Execute download
./download-dataset ~/datasets

Respond to Prompt:

Download all runs? (y/n): 

→ Enter y and press Enter (download all data)

Download Contents:

  • Dataset: WindsorML
  • Size: 1.7GB
  • Cases: 350 cases (run_0 ~ run_349)
  • Source: Hugging Face (neashton/windsorml)

If Hugging Face Rate Limit Error Occurs

# Symptom
HTTP 429 Too Many Requests

# Cause
Too many simultaneous downloads hit rate limit

# Solution 1: Wait and retry
# Hugging Face uses local cache, so
# already downloaded files won't be re-downloaded
./download-dataset ~/datasets

# Solution 2: Download one file at a time (recommended)
# Edit download-dataset script to add max_workers=1

2-2. Verify Data Structure

Downloaded Files

# Verify directory structure
ls ~/datasets/

# Example output:
# run_0  run_1  run_2  ... run_354

Contents of One Case

# Check contents of run_0
ls ~/datasets/run_0/

# Example output:
# windsor_0.stl      ← Vehicle 3D shape
# force_mom_0.csv    ← CFD results (KPI values)

Verify CFD Results

# Check KPI values
cat ~/datasets/run_0/force_mom_0.csv

# Example output:
# cd, cs, cl, cmy
# 0.2818169578178322,0.0008234405065462456,0.48822197919945154,-0.07294317299334006

Meaning of KPI Values:

  • cd: Drag Coefficient - Air resistance
  • cs: Side force Coefficient - Crosswind effects
  • cl: Lift Coefficient - Lifting force
  • cmy: Pitching Moment Coefficient - Front-to-back rotational force

These values are results calculated by actual CFD simulation (takes several hours). MLSimKit aims to predict these values in a short time.


Chapter 3: Creating Manifest

3-1. Create Training Manifest

Execute Manifest Creation

./run-create-manifest-training ~/datasets

Processing:

  • Scan 355 cases of STL and CSV files
  • Link shape files and KPI values for each case
  • Generate manifest file in JSON Lines format

Generated File:

# Check manifest file
ls -lh training.manifest

# Example output:
# -rw-rw-r-- 1 ubuntu ubuntu 59K Nov 14 06:00 training.manifest

The manifest defines relationships for each drawing across multiple cases (355 cases in this case), as shown in the image below:

Verify Manifest Contents

# Display first 3 lines
head -3 training.manifest

Example output:

{"geometry_files": ["file:///home/ubuntu/datasets/run_0/windsor_0.stl"], "kpi": [0.2818169578178322, 0.0008234405065462456, 0.48822197919945154, -0.07294317299334006]}
{"geometry_files": ["file:///home/ubuntu/datasets/run_1/windsor_1.stl"], "kpi": [0.32251110051821463, -0.059431832381329826, -0.061135997912917385, -0.04094381732630274]}

Role of Manifest:

  • Define training data inputs (geometry_files) and outputs (kpi)
  • 350 cases used for training

3-2. Verify Training Configuration

Default Configuration File

Verify the default configuration (training.yaml) from the official tutorial:

cat training.yaml

The WindsorML dataset contains four KPI values (Cd, Cs, Cl, CMy), but the default training.yaml predicts only one KPI (Cl: lift coefficient).

Important Settings:

kpi:
  manifest_uri: training.manifest

  train:
    output_kpi_indices: "2"  # Predict only Cl (lift coefficient)
    pooling_type: max        
    epochs: 100
    opt:
      learning_rate: 0.003

  predict:
    compare-groundtruth: true
  • index 0: Cd (drag coefficient)
  • index 1: Cs (side force coefficient)
  • index 2: Cl (lift coefficient) ← Default
  • index 3: CMy (yaw moment coefficient)

If you want to predict all four coefficients, you need to train and create a model for each coefficient separately (requiring 4× the trials of this execution).


Chapter 4: Execute Preprocessing

4-1. Execute Preprocessing (First Time Only)

KPI training reuses preprocessed data, but preprocessing must be executed the first time:

# Execute only preprocessing with default configuration
mlsimkit-learn --config training.yaml kpi preprocess

Processing:

  • Convert 350 STL files to graph structures that MGN (MeshGraphNet) can handle
  • Split data for training/validation/testing (60%/20%/20%)
  • Save preprocessed data to outputs/training/preprocessed_data

For MGN (MeshGraphNet) explanation, refer to here


4-2. Verify Data Split

# Check manifest files
wc -l outputs/training/*.manifest

# Example output:
#  210 outputs/training/train.manifest      ← Training (60%)
#   70 outputs/training/validate.manifest   ← Validation (20%)
#   70 outputs/training/test.manifest       ← Testing (20%)

Meaning of Data Split:

  • Training (60%): Used for model learning
  • Validation (20%): Performance evaluation during learning and hyperparameter tuning
  • Testing (20%): Final performance evaluation (not used in training)

4-3. Verify Preprocessed Data Size

du -sh outputs/training/preprocessed_data

# Example output:
# 3.7G    outputs/training/preprocessed_data

Chapter 5: Model Training

5-1. Execute Training Pipeline

Start Training

# Execute KPI training
nohup mlsimkit-learn --config training.yaml kpi train > training.log 2>&1 &

# Record process ID
echo $!

Execution:

  • Train only Cl (lift coefficient)
  • Execute 100 epochs

5-2. Monitor Progress

# Monitor log in real-time
tail -f training.log

# Exit with Ctrl+C

Training Flow:

[INFO] Training:   0%|          | 0/100 [00:00<?, ?epochs/s]
[INFO] Epoch 0: train loss = 0.988; validation loss = 1.209
...
[INFO] Training: 100%|██████████| 100/100 [13:50<00:00, 8.30s/epochs]
[INFO] Epoch 21: train loss = 0.405; validation loss = 0.420

Measured Time: 14 minutes
Speed: 8.3 seconds/epoch

5-3. Verify Training Results

After training completes, verify the loss progression:

Key Points:

  • Both Train Loss and Validation Loss decrease (learning progresses) and converge
  • Best validation loss (0.356) achieved at epoch 96
  • Small gap between Train Loss (0.359) and Validation Loss (0.356) → No overfitting, high accuracy prediction possible on new data

Train Loss: Prediction error on training data
Validation Loss: Prediction error on validation data

Formula:

Loss = MSE(pred, actual)
     = Mean((predicted value - actual value)²)

Chapter 6: Evaluation on Test Data

6-1. Execute Prediction on Test Data

After training completes, evaluate final accuracy on test data (70 samples):

mlsimkit-learn --output-dir outputs/training --config training.yaml kpi predict --manifest-path outputs/training/test.manifest

Execution:

  • Predict on 70 test data samples
  • Calculate Cl (lift coefficient) accuracy
  • Measured time: 18 seconds

6-2. Verify Test Data Accuracy

# Check accuracy metrics
cat outputs/training/predictions/dataset_prediction_error_metrics.csv

How to Read the Graph:

  • X-axis: AI surrogate model predicted values
  • Y-axis: CFD simulation actual values
  • Gray diagonal line: Perfect match line

Evaluation:

  • Points distributed along diagonal line → High accuracy
  • Prediction possible across full range of Cl = -0.4 ~ 0.8
  • Almost no outliers

Accuracy on Test Data (70 samples):

Metric Value Evaluation
MAPE 1.87% Excellent
Directional Correctness 89.4% Excellent

Supplementary Explanation: MAPE (Mean Absolute Percentage Error)

(Click to expand)
Definition

Mean Absolute Percentage Error - Displays relative error as percentage

Formula
MAPE = Mean(|predicted value - actual value| / |actual value|) × 100%
Concrete Example (Drag Coefficient Cd)
Sample 1: Actual 0.30, Predicted 0.32 → |0.32-0.30|/0.30 = 0.067 = 6.7%
Sample 2: Actual 0.31, Predicted 0.28 → |0.28-0.31|/0.31 = 0.097 = 9.7%
Sample 3: Actual 0.29, Predicted 0.30 → |0.30-0.29|/0.29 = 0.034 = 3.4%

MAPE = (6.7% + 9.7% + 3.4%) / 3 = 6.6%

Supplementary Explanation: Directional Correctness

(Click to expand)
Definition

Percentage of matching increase/decrease trends

Calculation Method
For all sample pairs (A, B):
  Actual A < B and Predicted A < B → Match ✓
  Actual A > B and Predicted A > B → Match ✓
  Otherwise → Mismatch ✗

Directional Correctness = Number of matching pairs / Total pairs
Concrete Example (Drag Coefficient Cd)
Sample A: Actual 0.30, Predicted 0.28
Sample B: Actual 0.32, Predicted 0.31
Sample C: Actual 0.28, Predicted 0.33

Pair(A,B): Actual A<B, Predicted A<B → Match ✓
Pair(A,C): Actual A>C, Predicted A<C → Mismatch ✗
Pair(B,C): Actual B>C, Predicted B<C → Mismatch ✗

Directional Correctness = 1/3 = 33.3%

Chapter 7: Cleanup

7-1. Stop EC2 Instance

Stop Instance

# Stop instance (execute on local machine)
aws ec2 stop-instances --instance-ids $INSTANCE_ID

# Or specify instance ID directly
# aws ec2 stop-instances --instance-ids i-XXXXX

Important:

  • Always stop after use
  • Charges continue if not stopped ($1.212/hour)

7-2. Resource Cleanup

Delete Secrets Manager Private Key (if not using this EC2 in the future)

# Delete private key
aws secretsmanager delete-secret \
  --secret-id mlsimkit-tutorial-key \
  --force-delete-without-recovery \
  --region us-east-1

# Delete local private key file
rm mlsimkit-tutorial-key.pem

Delete EC2 Key Pair and Security Group (if not using this EC2 in the future)

# Delete key pair
aws ec2 delete-key-pair --key-name mlsimkit-tutorial-key

# Delete security group
aws ec2 delete-security-group --group-id $SG_ID

Important: If you don't delete the Secrets Manager private key, charges of $0.40/month will continue.


Summary

In this article, we performed KPI prediction of fluid simulations on EC2 using AWS's AI surrogate model solution MLSimKit, targeting the WindsorML vehicle aerodynamics dataset with 210 cases.

Traditionally, evaluating the aerodynamic performance of a single vehicle shape required hours of CFD simulation, but using AI surrogate models achieved:

Training time: Just 14 minutes (learning from 210 cases)
Prediction time: Just 18 seconds (testing on 70 cases)
Prediction accuracy: MAPE 1.87% (prediction error for lift coefficient Cl - excellent)

This enables high-accuracy prediction of aerodynamic performance for new vehicle shapes in seconds, making it possible to quickly and cost-effectively evaluate numerous shape proposals in the early design stages.

For a comprehensive overview of MLSimKit, please refer to this article:
https://zenn.dev/aws_japan/articles/788fa50caa13be


アマゾン ウェブ サービス ジャパン (有志)

Discussion