MMとか

これを読む。
LOBSTER: よく論文とかでベンチマークにされているやつ。これを機に内容を見てみる。

APPLのデータはこんな感じ

34200.004241176,1,16113575,18,5853300,1
34200.00426064,1,16113584,18,5853200,1
34200.004447484,1,16113594,18,5853100,1
34200.025551909,1,16120456,18,5859100,-1

34200は9時半(NY市場オープン)

1.) Time:
Seconds after midnight with decimal
precision of at least milliseconds
and up to nanoseconds depending on
the requested period
2.) Type:
1: Submission of a new limit order
2: Cancellation (Partial deletion
of a limit order)
3: Deletion (Total deletion of a limit order)
4: Execution of a visible limit order
5: Execution of a hidden limit order
7: Trading halt indicator
(Detailed information below)
3.) Order ID:
Unique order reference number
(Assigned in order flow)
4.) Size:
Number of shares
5.) Price:
Dollar price times 10000
(i.e., A stock price of $91.14 is given
by 911400)
6.) Direction:
-1: Sell limit order
1: Buy limit order

Note:
Execution of a sell (buy) limit
order corresponds to a buyer (seller)
initiated trade, i.e. Buy (Sell) trade.

richwomanbtc

out_ds['log_return_mid_price'] = np.log(mid_price.pct_change() + 1).shift(-1)

これをターゲットにしているが、これは

\log(m_t/m_{t-1}-1 + 1) = \log(m_t)-\log(m_{t-1})

なので

out_ds['log_return_mid_price'] = np.log(mid_price.shift(-1)) - np.log(mid_price)

名前の通りmid priceの対数リターン

ちなみに $r_t=m_t/m_{t-1}-1$ として、 $\log(r_t + 1) = r_t + O(r_t^2)$
なので $r_t$ が小さいときは対数リターンと $r_t$ だいたい一致する

richwomanbtc

全部0と推定したときと、catboostで学習したモデルの推論を使ったときの比較
bootstrap法でサンプルを再抽出→サンプルした統計量(この場合はlogリターンの平均)がt分布に従うとして95%点を求める。→95%信頼区間を計算

分布がきれいなAPPLとかは全部0と推定するよりか良い推定になっている。

richwomanbtc

NOTE

horizonを設定せず、次のbest bid/ask更新時までのリターンを予測している。マイクロ秒単位の間隔になることもあるのでlightgbmでは予測が追いつかないので、実際には適切なhorizonを設定する必要があると思う。
cryptoでも流動性低いとかで仲値リターンの分布の歪さがある銘柄では特に注意する必要がありそう
depth = 1までで特徴量を作っているのでもう少しいくつか取ってきて特徴量を作る、というのが次のステップの一つ

richwomanbtc

次はこれ

軽く読んだけど個人でやれる規模ではない。
Overviewとかクラウド選定は参考になるかも

richwomanbtc

次はこれ

銘柄 $A$ と銘柄 $B$ のリターンを $r^{\eta}_{t}, \eta=A,B$ として線形回帰する

r^A_t = \beta_t r^B_t + e_t

ベータをrollingで再計算してペアトレする(Aを1単位買ってBを\beta_t単位売る)
要は $e_t$ が平均回帰的であるとして、 $e_t$ が大きく振れたときにポジションを取る。
どれくらい振れたときにポジションとるか？とか損切りラインをどうするか？とかのパラメータは遺伝的アルゴリズムで決める。