iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🏏

Fetching Baseball Savant Bat Tracking Data Directly and Publishing to Kaggle

に公開

Introduction

MLB introduced a new feature called "Bat Tracking" starting from the 2024 season. It measures batters' swings with high-speed cameras and provides new metrics such as bat speed and swing path length.

In this article, I will record my experience of directly fetching features not yet implemented in pybaseball from Baseball Savant and publishing them as a Kaggle dataset.


📊 Created Dataset

https://www.kaggle.com/datasets/yasunorim/mlb-bat-tracking-2024-2025

MLB Bat Tracking Leaderboard 2024-2025

  • Number of Batters: 452 (2024: 226, 2025: 226)
  • Number of Columns: 19 columns
  • File Size: 107KB
  • Period: 2024–2025 (Minimum swings: 50)

🔍 What is Bat Tracking?

A new Statcast feature introduced by MLB starting from the 2024 season, which measures the following metrics:

  • avg_bat_speed: Average bat speed (mph)
  • swing_length: Swing path length (feet)
  • squared_up_per_bat_contact: Rate of contact in the sweet spot
  • hard_swing_rate: High-speed swing rate
  • blast_per_bat_contact: Rate of contact with ideal launch angle + exit velocity
  • swords: Number of empty swings outside the zone (swords)

🚨 Limitations of pybaseball

Normally, MLB Statcast data can be retrieved using the Python library pybaseball. However, Bat Tracking is a new feature for 2024 and is not yet implemented as of pybaseball 2.2.7.

from pybaseball import statcast_batter_bat_tracking

# ❌ ImportError: cannot import name 'statcast_batter_bat_tracking'

Therefore, I used the method of downloading CSVs directly from Baseball Savant.


🛠️ How to Retrieve Directly from Baseball Savant

Baseball Savant CSV Endpoint

The Baseball Savant leaderboard page has query parameters for CSV downloads.

URL Structure:

https://baseballsavant.mlb.com/leaderboard/{type}?year={year}&csv=true

For Bat Tracking:

https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2024&csv=true

Method 1: Download with curl

The easiest way is to use curl.

# 2024 data
curl -s "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2024&csv=true" \
  -o mlb_bat_tracking_2024.csv

# 2025 data
curl -s "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2025&csv=true" \
  -o mlb_bat_tracking_2025.csv

Method 2: Download with Python

In a Python environment, use the requests library and StringIO.

import pandas as pd
import requests
from io import StringIO

# 2024 data
url_2024 = "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2024&csv=true"
response_2024 = requests.get(url_2024)
df_2024 = pd.read_csv(StringIO(response_2024.text))
df_2024['season'] = 2024

# 2025 data
url_2025 = "https://baseballsavant.mlb.com/leaderboard/bat-tracking?year=2025&csv=true"
response_2025 = requests.get(url_2025)
df_2025 = pd.read_csv(StringIO(response_2025.text))
df_2025['season'] = 2025

# Combine
df_combined = pd.concat([df_2024, df_2025], ignore_index=True)
df_combined.to_csv('mlb_bat_tracking_2024_2025.csv', index=False)

print(f"Total batters: {len(df_combined)}")
print(f"Columns: {len(df_combined.columns)}")

📦 Creating the Kaggle Dataset

1. Creating dataset-metadata.json

{
  "title": "MLB Bat Tracking Leaderboard 2024-2025",
  "id": "your-username/mlb-bat-tracking-2024-2025",
  "licenses": [{"name": "CC0-1.0"}],
  "keywords": ["baseball", "sports"]
}

2. Uploading with Kaggle CLI

kaggle datasets create -p mlb-bat-tracking/

📁 Data Contents

Key Columns

Column Description
id Baseball Savant player ID
name Player name
swings_competitive Number of competitive swings
avg_bat_speed Average bat speed (mph)
swing_length Swing path length (feet)
hard_swing_rate High-speed swing rate
squared_up_per_bat_contact Sweet spot contact rate
blast_per_bat_contact Ideal contact rate (Blasts)
swords Number of empty swings outside the zone
whiff_per_swing Whiff rate
batted_ball_event_per_swing Batted ball event rate
season Season year (2024 or 2025)

🔧 How to Use in a Kaggle Notebook

1. Specifying the Dataset in kernel-metadata.json

{
  "id": "your-username/your-notebook-slug",
  "title": "Your Notebook Title",
  "code_file": "your-notebook.ipynb",
  "dataset_sources": ["yasunorim/mlb-bat-tracking-2024-2025"]
}

2. Loading Data within the Notebook

import pandas as pd

# Loading the dataset
df = pd.read_csv('/kaggle/input/mlb-bat-tracking-2024-2025/mlb_bat_tracking_2024_2025.csv')

print(f'Total records: {len(df)}')
print(f'Unique batters: {df["id"].nunique()}')

🔗 Analysis Notebooks Using This Dataset

I have published a Bat Tracking analysis notebook for three players with ties to Japan (Shohei Ohtani, Seiya Suzuki, and Lars Nootbaar).

https://www.kaggle.com/code/yasunorim/bat-tracking-japanese-mlb-batters-2024-2025


💡 Other Baseball Savant Leaderboards

You can also retrieve other leaderboards using the same method.

# Expected Stats (xBA, xwOBA, etc.)
curl -s "https://baseballsavant.mlb.com/leaderboard/expected_statistics?year=2024&csv=true"

# Outs Above Average (Fielding metrics)
curl -s "https://baseballsavant.mlb.com/leaderboard/outs_above_average?year=2024&csv=true"

# Pitch Arsenal (Pitch type data)
curl -s "https://baseballsavant.mlb.com/leaderboard/pitch-arsenal-stats?year=2024&csv=true"

📝 What I Learned

How to Deal with Features Not Implemented in pybaseball

  • It takes time for the latest MLB statistical features to be reflected in pybaseball.
  • Since Baseball Savant provides CSV endpoints, data can be retrieved directly.
  • User-Agent issues can be avoided by using requests + StringIO.

The Value of Kaggle Datasets

  • New metrics not found in existing datasets attract attention (Bat Tracking is a new feature for 2024).
  • Even a small dataset (107KB) can be valuable if it is topical.
  • Publishing a dataset and an analysis notebook as a set creates a synergistic effect.

GitHubで編集を提案

Discussion