Ceph PG:activating+remapped の修復

現在の状況

Cephの環境

Host: Ubuntu 22.04 LTS x1 (cephadmを使用してsingle nodeで動作するように設定済み)
OSD: 9台
Ceph Version: 18.2.1 reef

CephのAlerts (critical)

CephHealthError
CephPGsInactive
CephPGUnavilableBlockingIO

ceph health detail

# ceph health detail
HEALTH_ERR full ratio(s) out of order; Reduced data availability: 1 pg inactive; 1 slow ops, oldest one blocked for 173867 sec, osd.1 has slow ops; too many PGs per OSD (277 > max 250)
[ERR] OSD_OUT_OF_ORDER_FULL: full ratio(s) out of order
    backfillfull_ratio (0.75) < nearfull_ratio (0.85), increased
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 15.2 is stuck inactive for 2d, current state activating+remapped, last acting [1,4,0,5,6,3]
[WRN] SLOW_OPS: 1 slow ops, oldest one blocked for 173867 sec, osd.1 has slow ops
[WRN] TOO_MANY_PGS: too many PGs per OSD (277 > max 250)

Yuta Takahashi

[ERR] OSD_OUT_OF_ORDER_FULL: full ratio(s) out of order

このエラーは次の情報を参考の上で、解消済み

OSD_OUT_OF_ORDER_FULL

The utilization thresholds for nearfull, backfillfull, full, and/or failsafe_full are not ascending. In particular, the following pattern is expected: nearfull < backfillfull, backfillfull < full, and full < failsafe_full.

To adjust these utilization thresholds, run the following commands:

ceph osd set-nearfull-ratio <ratio>
ceph osd set-backfillfull-ratio <ratio>
ceph osd set-full-ratio <ratio>

Yuta Takahashi

とりあえず一度Cephを再起動してみる

sudo systemctl restart ceph.target

Yuta Takahashi

再起動したところ、PGの状態が変化した。

active+remapped
から
active+remapped+backfillingとactive+remapped+backfill_wait
に変化した。

また、ceph alertsからcriticalの項目が消えた。

一晩放置して様子を見ようと思う。

Yuta Takahashi

放置したらPGの問題が解決した。

osd.1のioに何らかの問題が発生してpgに問題が生じたと推測する。

Yuta Takahashi

TOO MANY PGSのアラートはこちらのscrapにて対処しようと思う