jupyter notebook コード備忘録

ちょくちょくやりたいことと実装が結びつかなくて困ってしまうので、備忘録としてまとめておく
※importは省きます。
よくあるimport

import japanize_matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

torohash

コレログラムの作成

自己相関をまとめて見るときなどに。

abs_ror = df['ror'].abs()
corr = {}
for i in range(1, 30):
    corr[i] = abs_ror.corr(abs_ror.shift(i))
plt.figure(figsize=(10, 3))
plt.title('1時間ボラティリティの自己相関')
pd.Series(corr).plot(kind='bar')

torohash

コレログラムの累積の作成

k期前から現在までとの相関を見たいときなど。

corr = {}
for i in range(1, 30):
    ror = df['ror'].rolling(i).sum().shift(1)
    corr[i] = df['ror'].corr(ror)
plt.figure(figsize=(10, 3))
plt.title('リターンとK期前リターンとのコレログラムの累積')
pd.Series(corr).plot(kind='bar')

torohash

相関散布図の描画

alphaは点の透明度、sは点のサイズ。
corrはピアソンの相関係数。

abs_ror = df['ror'].abs()
corr = abs_ror.corr(abs_ror.shift(1))
plt.title(f'ボラティリティの自己相関 (corr={corr:.3f})')
plt.xlabel('ボラティリティ')
plt.ylabel('ボラティリティ')
plt.scatter(abs_ror, abs_ror.shift(1), alpha=0.5, s=8, label=f'{corr:.3f}')
plt.legend()

torohash

単回帰分析（OLS回帰）

summaryについて見とくといい場所

R-squared（決定係数）: XでYを説明できる割合（ピアソンの相関係数の二乗と聞いた）
coef（回帰係数）: Xが1単位増えるごとにYがに起きる変化量
p値: 帰無仮説の棄却とかのやつ0.05未満が水準
あとは一旦わからん。

X = ohlcv_1d['bybit']['bybit_volume'].shift(1)
y = ohlcv_1d['bybit']['bybit_ror'].abs()
# shift(1)した分を削除
X = X.iloc[1:]
y = y.iloc[1:]
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

torohash

ADF検定（定常性チェック）

時期によって平均や分散が異なるかどうかをADF検定で調べられる。（p値0.05未満で帰無仮説を棄却できる）
ちなみに金融時系列データは単純な価格データ（終値とか）は非定常だけど、リターンに変えたり、対数に変えたりすると定常性が出る。

adfuller(df['ror'])
statistic, pvalue, _, _, _, _ = adfuller(df['ror'])
print(f'ADF検定統計量: {statistic:.3f}, p値: {pvalue:.3f}')

torohash

バックテストパターンの例

signalについてはnp.sign()で正負の値をとるとかもある

# シグナル作成
strategy['signal'] = 0
strategy.loc[(strategy['ror'] < 0.0), 'signal'] = -0.5
strategy.loc[(strategy['ror'] < -0.01), 'signal'] = -1.0
strategy.loc[(strategy['ror'] > 0.0), 'signal'] = 0.5
strategy.loc[(strategy['ror'] > 0.01), 'signal'] = 1.0
print(strategy['signal'].value_counts())
# 手数料作成
fee = 0.001 # 片道
strategy['fee'] = np.abs(strategy['signal'].diff()) * fee
# パネル
strategy['pnl'] = strategy['signal'] * strategy['ror']
# 手数料考慮パネル
strategy['pnl_with_fee'] = strategy['pnl'] - strategy['fee']
# 最大DD
strategy['dd'] = strategy['pnl'].cumsum().cummax() - strategy['pnl'].cumsum()
plt.figure(figsize=(10, 3))
plt.plot(strategy['pnl'].cumsum(), label='', color='magenta')
plt.plot(strategy['pnl_with_fee'].cumsum(), color='magenta', linestyle=':')
plt.plot(strategy['ror'].cumsum(), label='HODL', color='lightgray')
plt.plot(strategy['dd'], label='DD', color='k')
plt.legend();
sr_daily = strategy["pnl_with_fee"].mean()/strategy["pnl_with_fee"].std()
sr_anual = sr_daily * np.sqrt(365)
print(f'SR(daily): {sr_daily:.5f}')
print(f'SR(yearly): {sr_anual:.5f}')
print(f'最大DD: {strategy["dd"].max():.5f}')

torohash

データクレンジング

# 欠損チェック(0ならok)
df.isnull().sum()

# 重複行チェック
df[df.index.duplicated()]
# 重複行の削除
df = df[~df.index.duplicated()]

# 抜け漏れチェック（例えば日足が毎日のデータで存在しているかなど）
datetime_index_full = pd.DatetimeIndex(pd.date_range(start=df.index.min(), end=df.index.max(), freq='1d'))
missing_index = datetime_index_full.difference(df.index)
len(missing_index)

torohash

短時間軸から長時間軸へのリサンプル

例では1時間足を日足にリサンプルするケースとして。

def resample_ohlcv(df, timeframe):
    new_df = df.resample(timeframe).agg({
        'open': 'first',
        'high': 'max',
        'low': 'min',
        'close': 'last',
        'volume': 'sum',
        'ror': 'sum'
    })
    return new_df
df = resample_ohlcv(df, '1d')
# 現在のインデックスを日単位の期間に変換
df.index = df.index.to_period('d').to_timestamp()

torohash

データ抜け漏れがあった場合の線形モデルを使った欠損値補完

もし何らかのデータ（A)について欠損があった場合、それと相関性の強いデータ（B）を用意できるなら、Bを説明変数としてAの値を予測することで欠損値の補完を行うことができる。

from sklearn.linear_model import LinearRegression

# 抜け漏れあった部分を埋める
df = pd.DataFrame(df, index=pd.DatetimeIndex(pd.date_range(start=df.index.min(), end=df.index.max(), freq='1d')))

# 近傍のデータで線形モデルを学習させる。
model = LinearRegression()
X_train = df['2018-02-01':'2018-04-13'][['open_price']]
y_train = df2['2018-02-01':'2018-04-13']['value']
model.fit(X_train, y_train)

# 欠損値を予測する
y_predict = model.predict(df['2018-04-14':'2018-04-16'][['open_price']])
df2.loc['2018-04-14':'2018-04-16', 'value'] = y_predict

torohash

時間カラムを元にtimestampのindexを作成する

データによって微調整する必要あり。

df['date'] = pd.to_numeric(df['date'], errors='coerce') # 数値型に変更
df['timestamp'] = pd.to_datetime(df['date'], unit='s', utc=True)
df = df.drop(columns=['date'])
df = df.set_index('timestamp').sort_index()

torohash

Zスコアで分析する

定常性が低いデータだったり、リバーサル傾向のある指標についてはZスコアで分析すると、相関係数の改善には寄与しない印象だけど、トレーディング指標としては改善することがある。
説明力もありそう、かつ相関も弱めだけどあるみたいなケースの場合意識してやるといいかも。

ちなみに期間については有識者の方いわく感覚的で、最近に対してどうなのかを見たいと考えたとき、この時間軸に対しての最近とはどれぐらいなのか？を考えると良いとのこと。例えば日足であれば2週間程度かもしれないし、1時間足であれば1~3日程度かもしれない。

df['z_score'] = (df['value'] - df['value'].rolling(14).mean()) / df['value'].rolling.std(14)

torohash

特定のカラムだけを抽出してdfを作成する

df = origin_df.filter(regex='_close|_open')

複数のdfをconcatしつつ、特定のカラムだけでdfを作成するというのも可能

df = pd.concat([origin_df.filter(regex='_close|_open') for origin_df in origin_dfs.values()], axis=1, join='inner')

torohash

すべてのカラムに対してサフィックス、プレフィックスを付ける

df = df.add_suffix('_suffix')

df = df.add_prefix('prefix_')

もっとカスタムした追加方法を採用するなら以下

df.rename(columns=lambda x: f"{x}_custom_suffix", inplace=True)