iTranslated by AI
Fetching Baseball Savant Bat Tracking Data Directly and Publishing to Kaggle
Introduction
MLB introduced a new feature called "Bat Tracking" starting from the 2024 season. It measures batters' swings with high-speed cameras and provides new metrics such as bat speed and swing path length.
In this article, I will record my experience of directly fetching features not yet implemented in pybaseball from Baseball Savant and publishing them as a Kaggle dataset.
📊 Created Dataset
MLB Bat Tracking Leaderboard 2024-2025
- Number of Batters: 452 (2024: 226, 2025: 226)
- Number of Columns: 19 columns
- File Size: 107KB
- Period: 2024–2025 (Minimum swings: 50)
🔍 What is Bat Tracking?
A new Statcast feature introduced by MLB starting from the 2024 season, which measures the following metrics:
- avg_bat_speed: Average bat speed (mph)
- swing_length: Swing path length (feet)
- squared_up_per_bat_contact: Rate of contact in the sweet spot
- hard_swing_rate: High-speed swing rate
- blast_per_bat_contact: Rate of contact with ideal launch angle + exit velocity
- swords: Number of empty swings outside the zone (swords)
🚨 Limitations of pybaseball
Normally, MLB Statcast data can be retrieved using the Python library pybaseball. However, Bat Tracking is a new feature for 2024 and is not yet implemented as of pybaseball 2.2.7.
from pybaseball import statcast_batter_bat_tracking
# ❌ ImportError: cannot import name 'statcast_batter_bat_tracking'
Therefore, I used the method of downloading CSVs directly from Baseball Savant.
🛠️ How to Retrieve Directly from Baseball Savant
Baseball Savant CSV Endpoint
The Baseball Savant leaderboard page has query parameters for CSV downloads.
URL Structure:
https://baseballsavant.mlb.com/leaderboard/{type}?year={year}&csv=true
For Bat Tracking:
https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2024&csv=true
Method 1: Download with curl
The easiest way is to use curl.
# 2024 data
curl -s "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2024&csv=true" \
-o mlb_bat_tracking_2024.csv
# 2025 data
curl -s "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2025&csv=true" \
-o mlb_bat_tracking_2025.csv
Method 2: Download with Python
In a Python environment, use the requests library and StringIO.
import pandas as pd
import requests
from io import StringIO
# 2024 data
url_2024 = "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2024&csv=true"
response_2024 = requests.get(url_2024)
df_2024 = pd.read_csv(StringIO(response_2024.text))
df_2024['season'] = 2024
# 2025 data
url_2025 = "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2025&csv=true"
response_2025 = requests.get(url_2025)
df_2025 = pd.read_csv(StringIO(response_2025.text))
df_2025['season'] = 2025
# Combine
df_combined = pd.concat([df_2024, df_2025], ignore_index=True)
df_combined.to_csv('mlb_bat_tracking_2024_2025.csv', index=False)
print(f"Total batters: {len(df_combined)}")
print(f"Columns: {len(df_combined.columns)}")
📦 Creating the Kaggle Dataset
1. Creating dataset-metadata.json
{
"title": "MLB Bat Tracking Leaderboard 2024-2025",
"id": "your-username/mlb-bat-tracking-2024-2025",
"licenses": [{"name": "CC0-1.0"}],
"keywords": ["baseball", "sports"]
}
2. Uploading with Kaggle CLI
kaggle datasets create -p mlb-bat-tracking/
📁 Data Contents
Key Columns
| Column | Description |
|---|---|
id |
Baseball Savant player ID |
name |
Player name |
swings_competitive |
Number of competitive swings |
avg_bat_speed |
Average bat speed (mph) |
swing_length |
Swing path length (feet) |
hard_swing_rate |
High-speed swing rate |
squared_up_per_bat_contact |
Sweet spot contact rate |
blast_per_bat_contact |
Ideal contact rate (Blasts) |
swords |
Number of empty swings outside the zone |
whiff_per_swing |
Whiff rate |
batted_ball_event_per_swing |
Batted ball event rate |
season |
Season year (2024 or 2025) |
🔧 How to Use in a Kaggle Notebook
1. Specifying the Dataset in kernel-metadata.json
{
"id": "your-username/your-notebook-slug",
"title": "Your Notebook Title",
"code_file": "your-notebook.ipynb",
"dataset_sources": ["yasunorim/mlb-bat-tracking-2024-2025"]
}
2. Loading Data within the Notebook
import pandas as pd
# Loading the dataset
df = pd.read_csv('/kaggle/input/mlb-bat-tracking-2024-2025/mlb_bat_tracking_2024_2025.csv')
print(f'Total records: {len(df)}')
print(f'Unique batters: {df["id"].nunique()}')
🔗 Analysis Notebooks Using This Dataset
I have published a Bat Tracking analysis notebook for three players with ties to Japan (Shohei Ohtani, Seiya Suzuki, and Lars Nootbaar).
💡 Other Baseball Savant Leaderboards
You can also retrieve other leaderboards using the same method.
# Expected Stats (xBA, xwOBA, etc.)
curl -s "https://baseballsavant.mlb.com/leaderboard/expected_statistics?year=2024&csv=true"
# Outs Above Average (Fielding metrics)
curl -s "https://baseballsavant.mlb.com/leaderboard/outs_above_average?year=2024&csv=true"
# Pitch Arsenal (Pitch type data)
curl -s "https://baseballsavant.mlb.com/leaderboard/pitch-arsenal-stats?year=2024&csv=true"
📝 What I Learned
How to Deal with Features Not Implemented in pybaseball
- It takes time for the latest MLB statistical features to be reflected in
pybaseball. - Since Baseball Savant provides CSV endpoints, data can be retrieved directly.
- User-Agent issues can be avoided by using
requests+StringIO.
The Value of Kaggle Datasets
- New metrics not found in existing datasets attract attention (Bat Tracking is a new feature for 2024).
- Even a small dataset (107KB) can be valuable if it is topical.
- Publishing a dataset and an analysis notebook as a set creates a synergistic effect.
🔗 Links
- Dataset: https://www.kaggle.com/datasets/yasunorim/mlb-bat-tracking-2024-2025
- Analysis Notebook: https://www.kaggle.com/code/yasunorim/bat-tracking-japanese-mlb-batters-2024-2025
- GitHub Repository: https://github.com/yasumorishima/kaggle-datasets
- Baseball Savant Bat Tracking: https://baseballsavant.mlb.com/leaderboard/bat-tracking
Discussion