eBPF に興味があったので eBPF を使ったトラブルシューティングができる EC2 Rescue for Linux を使ってみた
こんにちは。
ご機嫌いかがでしょうか。
"No human labor is no human error" が大好きな吉井 亮です。
私が運営に関わっている OpsJAWS では、AWS のサービスを使った運用に関する勉強会を開催しています。次会のテーマは「EC2で実現する最新の監視」にしようと考えています。それに関連して EC2 ネタを書きます。
今回は EC2 Rescue for Linux を調べました。
このツールは、EC2 Linux インスタンス上の一般的な問題の診断やトラブルシューティングを行うためのツールです。問題診断に約立つ色々な情報を収集してくれます。収集した情報を AWS サポートから指定された場所へアップロードすることも可能です。
何か問題が発生した際に、とりあえずこのツールを実行して Linux サーバーの状態を保存しておくのは良いかもしれません。
BPF Compiler Collection (BCC) インストール
なくても動きますが、より詳細な情報を収集するために BCC をインストールします。
Amazon Linux 2023 の場合は dnf でインストール可能です。その他は Installing BCC を参照ください。
sudo dnf install bcc
Linux サーバー上のユーザーに PATH を通しておきます。
export PATH=$PATH:/usr/share/bcc/tools
念のため bashrc
にも書いておきます。
# BCC Tools
export PATH=$PATH:/usr/share/bcc/tools
EC2 Rescue for Linux は sudo が必要なモジュールがあります。sudo した際にも PATH を通しておきます。
/etc/sudoers
に PATH を追記します。
(追記前)
Defaults secure_path = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/lib/snapd/snap/bin
(追記後)
Defaults secure_path = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/lib/snapd/snap/bin:/usr/share/bcc/tools
EC2 Rescue for Linux ダウンロード
公式 から引用すると、前提条件は以下の通りです。
- Supported operating systems
- Amazon Linux 2
- Amazon Linux 2016.09+
- SUSE Linux Enterprise Server 12+
- RHEL 7+
- Ubuntu 16.04+
- Software requirements
- Python 2.7.9+ or 3.2+
ファイルのダウンロードとハッシュ値の確認をします。
$ wget https://s3.amazonaws.com/ec2rescuelinux/ec2rl.tgz
$ wget https://s3.amazonaws.com/ec2rescuelinux/ec2rl.tgz.sha256
$ sha256sum -c ec2rl.tgz.sha256
ec2rl.tgz: OK
OK と出たならば、tar ball を展開します。
$ tar -zxf ec2rl.tgz
$ ls
ec2rl-1.1.6 ec2rl.tgz ec2rl.tgz.sha256
$ ls ec2rl-1.1.6/
LICENSE NOTICE README.md docs ec2rl ec2rl.py ec2rlcore example_configs example_modules functions.bash lib mod.d post.d pre.d requirements.txt ssmdocs
無事に展開できました。ダウンロードは完了です。
ヘルプを見る
使い方がまるでわからないので、ヘルプを見ます。
出力が長かったのでアコーディオンを閉じています。
クリックして展開
$ ./ec2rl list
Here is a list of available modules that apply to the current host:
S P R Module Name Class Domain Description
amazonlinuxextras collect os Collect amazon-linux-extras list output
* aptlog gather os Gather /var/log/apt and /var/log/dpkg.log files
* arpcache diagnose net Determines if aggressive arp caching is enabled
* arpignore diagnose net Determines if any interfaces have been set to ignore arp requests
arptable collect net Collect output from ip neighbor show for system analysis
* arptablesrules collect net Collect output from arptables-save for system analysis
asymmetricroute diagnose net Check for asymmetric routing
atop collect performance Collect output from atop for system analysis
atophistory gather performance Gather /var/log/atop history files
* bccbiolatency collect performance Collect output from biolatency for system analysis
* bccbiosnoop collect performance Collect output from biosnoop for system analysis.
* bccbiotop collect performance Collect output from biotop for system analysis
* bccbitesize collect performance Collect output from bitesize for system analysis.
* bcccachestat collect performance Collect output from cachestat for system analysis.
* bccdcsnoop collect performance Collect output from dcsnoop for system analysis.
* bccdcstat collect performance Collect output from dcstat for system analysis.
* bccexecsnoop collect performance Collect output from execsnoop for system analysis.
* bccext4dist collect performance Collect output from ext4dist for system analysis
* bccext4slower collect performance Collect output from ext4slower for system analysis
* bccfilelife collect performance Collect output from filelife for system analysis
* bccfileslower collect performance Collect output from fileslower for system analysis
* bccfiletop collect performance Collect output from filetop for system analysis
* bccgethostlatency collect performance Collect output from gethostlatency for system analysis
* bcchardirqs collect performance Collect output from hardirqs for system analysis
* bcckillsnoop collect performance Collect output from killsnoop for system analysis
* bccmysqldqslower performa net Collect output from mysqld_qslower for system analysis
* bccopensnoop collect performance Collect output from opensnoop for system analysis
* bccpidpersec collect performance Collect output from pidspersec for system analysis
* bccrunqlat collect performance Collect output from runqlat for system analysis
* bccslabratetop collect performance Collect output from slabratetop for system analysis
* bccsoftirqs collect performance Collect output from softirqs for system analysis
* bccstatsnoop collect performance Collect output from statsnoop for system analysis
* bccsyncsnoop collect performance Collect output from syncsnoop for system analysis
* bcctcpaccept collect net Collect output from tcpaccept for system analysis
* bcctcpconnect collect net Collect output from tcpconnect for system analysis
* bcctcpconnlat collect net Collect output from tcpconnlat for system analysis
* bcctcplife collect net Collect output from tcplife for system analysis
* bcctcpretrans collect net Collect output from tcpretrans for system analysis
* bcctcptop collect net Collect output from tcptop for system analysis
* bccvfscount collect performance Collect output from vfscount for system analysis
* bccvfsstat collect performance Collect output from vfsstat for system analysis
* bccxfsdist collect performance Collect output from xfsdist for system analysis
* bccxfsslower collect performance Collect output from xfsslower for system analysis
blkid collect os Collect blkid output
* bpfccbiolatency collect performance Collect output from biolatency for system analysis
* bpfccbiosnoop collect performance Collect output from biosnoop for system analysis.
* bpfccbiotop collect performance Collect output from biotop for system analysis
* bpfccbitesize collect performance Collect output from bitesize for system analysis.
* bpfcccachestat collect performance Collect output from cachestat for system analysis.
* bpfccdcsnoop collect performance Collect output from dcsnoop for system analysis.
* bpfccdcstat collect performance Collect output from dcstat for system analysis.
* bpfccexecsnoop collect performance Collect output from execsnoop for system analysis.
* bpfccext4dist collect performance Collect output from ext4dist for system analysis
* bpfccext4slower collect performance Collect output from ext4slower for system analysis
* bpfccfilelife collect performance Collect output from filelife for system analysis
* bpfccfileslower collect performance Collect output from fileslower for system analysis
* bpfccfiletop collect performance Collect output from filetop for system analysis
* bpfccgethostlatenc collect performance Collect output from gethostlatency for system analysis
* bpfcchardirqs collect performance Collect output from hardirqs for system analysis
* bpfcckillsnoop collect performance Collect output from killsnoop for system analysis
* bpfccmysqldqslower performa net Collect output from mysqld_qslower for system analysis
* bpfccopensnoop collect performance Collect output from opensnoop for system analysis
* bpfccpidpersec collect performance Collect output from pidspersec for system analysis
* bpfccrunqlat collect performance Collect output from runqlat for system analysis
* bpfccslabratetop collect performance Collect output from slabratetop for system analysis
* bpfccsoftirqs collect performance Collect output from softirqs for system analysis
* bpfccstatsnoop collect performance Collect output from statsnoop for system analysis
* bpfccsyncsnoop collect performance Collect output from syncsnoop for system analysis
* bpfcctcpaccept collect net Collect output from tcpaccept for system analysis
* bpfcctcpconnect collect net Collect output from tcpconnect for system analysis
* bpfcctcpconnlat collect net Collect output from tcpconnlat for system analysis
* bpfcctcplife collect net Collect output from tcplife for system analysis
* bpfcctcpretrans collect net Collect output from tcpretrans for system analysis
* bpfcctcptop collect net Collect output from tcptop for system analysis
* bpfccvfscount collect performance Collect output from vfscount for system analysis
* bpfccvfsstat collect performance Collect output from vfsstat for system analysis
* bpfccxfsdist collect performance Collect output from xfsdist for system analysis
* bpfccxfsslower collect performance Collect output from xfsslower for system analysis
cgroups collect os Collect /proc/cgroups info
clocksource collect os Collect details on current clocksource
cloudinitlog gather os Gather /etc/cloud-init* log files
collectl collect performance Collect output from collectl for system analysis
collectlhistory gather performance Gather /var/log/collectl history files
conntrackfull diagnose net Attempts to detect ip_conntrack full
consoleoverload diagnose os Attempts to detect console overload
cpuinfo collect os Collect /proc/cpuinfo output
* cron gather os Gather /etc/cron* files
date collect os Collect output from date to get system time and timezone
dhclientleases gather application Gather a copy of the /var/lib/dhclient/*.lease files
dig collect net Collect output from dig for dns troubleshooting
dmesg collect os Collect output from dmesg
dmesgfiles gather os Gather /var/log/dmesg* files
dpkgpackages collect os Collect list of installed packages using dpkg -l
* duplicatefslabels diagnose os Search for duplicate filesystem labels.
* duplicatefsuuid diagnose os Find duplicate filesystem UUIDs
* duplicatepartuuid diagnose os Find duplicate partition UUIDs
* ebtablesrules collect net Collect output from ebtables-save for system analysis
enadiag diagnose net Checks the ethool -S output for ENA specific statistics to diagnose i
entropy collect performance Collect output from entropy_avail for system analysis.
environment gather os Gather /etc/environment file
ethtool collect net Collect output from ethtool for system analysis
ethtoolg collect net Collect output from ethtool -g for system analysis
ethtooli collect net Collect output from ethtool -i for system analysis
ethtoolk collect net Collect output from ethtool -k for system analysis
ethtools collect net Collect output from ethtool -S for system analysis
fstab gather os Gather /etc/fstab file
* * fstabfailures diagnose os Disables fsck and sets nofail in /etc/fstab for all volumes
* * gcore gather performance Collect output from gcore for application analysis.
hosts gather os Gather /etc/hosts file
* httpdlogs gather application Gather Apache /var/log/httpd/* or /var/log/apache2/* log files
* hungtasks diagnose os Detects hung tasks
ifconfig collect net Collect output from ip addr show for system analysis
inittab gather os Gather /etc/inittab file
interrupts collect performance Collect output from /proc/interrupts for system analysis
iomem collect os Collect /proc/iomem output
iostat collect performance Collect output from iostat -x for system analysis
iproute collect net Collect output from ip route show all for system analysis
ipslink collect net Collect output from ip -s link for system analysis
* iptablesrules collect net Collect output from iptables-save for system analysis
ixgbevfversion diagnose net Determines if ixgbevf version is below recommended value
journal collect os Collect journalctl output
kerberosconfig gather os Gather the Kerberos configuration file
* kernelbug diagnose os Detects kernel bugs
kernelcmdline collect os Collect kernel command line (boot) options
kernelconfig gather os Collect /boot/config details
* kerneldereference diagnose os Detects kernel null pointer dereferences
* kernelpanic diagnose os Detects kernel panics
kernelversion collect os Collect output from uname -r for system analysis
kpatch collect os Collect kpatch list output
* kpti collect os Determine status of Kernel Page Table Isolation.
last collect os Collect last -x output
libtirpcnetconfig gather os Gather libtirpc's /etc/netconfig file
localtime collect os Collect zdump /etc/localtime output
lsblk collect os Collect lsblk output
lsmod collect os Collect lsmod output
lspci collect os Collect lspci output
* * ltrace gather performance Gather output from ltrace -fp for application analysis
* * ltracec collect performance Collect output from ltrace -cfp for application analysis
* lvmarchives gather application Gather /etc/lvm/archives/* files
lvmconf gather os Gather /etc/lvm/lvm.conf file
mdstat collect os Collect /proc/mdstat output
meminfo collect os Collect /proc/meminfo output
* messages gather os Gather /var/log/messages* or /var/log/syslog* files
mounts collect os Collect /proc/mounts output
mpstati collect performance Collect output from mpstat -I for system analysis
mpstatp collect performance Collect output from mpstat -P for system analysis
* mysqldlog gather application Gather /var/log/mysqld.log* files
ncport collect net Test TCP network connectivity to port/destination
netstatanp collect performance Collect output from ss -anp for system analysis
netstats collect performance Collect output from netstat -s for system analysis
networkmanagerstat collect net Collect output from systemctl status NetworkManager for system analys
* nginxlogs gather application Gather /var/log/nginx/* log files
* nping collect net Collect output from nping for network troubleshooting.
* npingtraceroute collect net Collect output from nping traceroute for network troubleshooting
nsswitch gather os Gather /etc/nsswitch.conf file
nstat collect performance Collect output from nstat for system analysis
ntpconf gather os Gather /etc/ntp.conf
ntpstat collect os Collect ntpstat output
numastat collect os Collect numastat output
* oomkiller diagnose os Detects oom-killer invocations
openfiles collect performance Collect output from lsof | wc -l for system analysis
* * openssh diagnose os Verify OpenSSH configuration for faults that could prevent remote acc
osrelease collect os Collect details on os release for system analysis
pagetypeinfo collect os Collect /proc/pagetypeinfo output
partitions collect os Collect /proc/partitions output
* * perf collect performance Collect CPU profiling statistics
* perfstat collect performance Collect output from perf for system analysis
procstat collect os Collect /proc/stat output
profile gather os Gather /etc/profile file
ps collect performance Collect output from ps for system analysis
* * rebuildinitrd diagnose os Rebuilds the system initial ramdisk
resolvconf gather net Gather the /etc/resolv.conf file
* retpoline collect os Determine status of kernel retpoline replacements.
rpmpackages collect os Collect list of installed packages using rpm -qa
sarhistory gather performance Gather /var/log/sa (sar) history files
scheddebug collect os Collect /proc/sched_debug output
* * selinuxpermissive diagnose os Sets selinux to permissive mode
* slabinfo collect os Collect /proc/slabinfo output
* slabtop collect performance Collect output from slabtop for system analysis
softirqs collect performance Collect output from /proc/softirqs for system analysis
* softlockup diagnose os Detects CPU soft lockups
* sosreport gather os Gather a sosreport
* * strace gather performance Gather output from strace -fp for application analysis
* * stracec collect performance Collect output from strace -cfp for application analysis
* supportconfig gather os Gather a supportconfig
sysctl collect os Collect sysctl -a output
sysctlconf gather os Collect /etc/sysctl.conf and /etc/sysctl.d files
* systemsmanager gather os Gather AWS Systems Manager logs and configuration
* * tcpdump gather net Gather packet capture for network troubleshooting.
* tcprecycle diagnose net Determines if aggressive TCP recycling is enabled
* tcptraceroute collect net Collect traceroute output on TCP traffic to a network destination.
top collect performance Collect output from top for system analysis
udev gather os Gather /etc/udev rules and configuration
* * udevpersistentnet diagnose net Comments out lines in /etc/udev/rules.d/70-persistent-net.rules
* vmallocinfo collect os Collect /proc/vmallocinfo output
vmstat collect performance Collect output from vmstat for system analysis
vmstatdisk collect performance Collect output from vmstat -d for system analysis
vmstatforks collect performance Collect output from vmstat -f for system analysis
* vmstatslab collect performance Collect output from vmstat -m for system analysis
w collect performance Collect output from w for system analysis
* workspacelogs gather os Gather AWS Linux Workspace log files
* xenfeatures collect os Collect details on xen features for system analysis
xennetrocket diagnose net Attempts to detect xennet issue
* xennetsgmtu diagnose net Attempts to detect possibility of xennet scattergather/mtu issue
yumconfiguration collect os Collect yum related configuration file under /etc/yum* output
* yumlog gather os Gather /var/log/yum.log file
zoneinfo collect os Collect /proc/zoneinfo output
* zypperlog gather os Gather /var/log/zypp and zypper.log log files
S: Requires sudo/root to run
P: Requires --perfimpact=true to run (can potentially cause performance impact)
R: Supports remediation if --remediate is given
Classes refer to the type of task the module performs
Diagnose: success/fail/warn conditions determined by module.
Gather: create a copy of a local file for inspection.
Collect: collect command output
Domains are defined per module and refer to the general area of investigation for the module.
To see module help, you can run:
ec2rl help [MODULEa ... MODULEx]
ec2rl help [--only-modules=MODULEa ... MODULEx] [--only-domains=DOMAINa ... DOMAINx]
EC2 Rescue for Linux にはかなりのモジュールが含まれています。
run
オプションを付けて実行すると、全てのモジュールが実行されます。(前提ソフトウェアがインストールされていれば)
sudo ./ec2rl run
特定のモジュールだけ実行したい場合は、--only-modules
オプションを使います。また、モジュールには引数が必要なものもあります。それらには適切な引数を渡します。
sudo が必要なモジュールもありますのでご注意ください。
$ ./ec2rl run --only-modules=module_name1,module_name2 --arguments=value
または
$ sudo ./ec2rl run --only-modules=module_name1,module_name2 --arguments=value
モジュールの一覧とその説明、渡す引数、sudo の有無、パフォーマンスへの影響をまとめました。
Module Name | help に表示される説明 | 説明だけでは何か不明なので補足 | 引数 | sudoが必要 | パフォーマンスへの影響 |
---|---|---|---|---|---|
amazonlinuxextras | Collect amazon-linux-extras list output | distro | |||
aptlog | Gather /var/log/apt and /var/log/dpkg.log files | distro | * | ||
arpcache | Determines if aggressive arp caching is enabled | ||||
arpignore | Determines if any interfaces have been set to ignore arp requests | ||||
arptable | Collect output from ip neighbor show for system analysis | ||||
arptablesrules | Collect output from arptables-save for system analysis | * | |||
asymmetricroute | Check for asymmetric routing | ||||
atop | Collect output from atop for system analysis | times | |||
atophistory | Gather /var/log/atop history files | times | |||
bccbiolatency | Collect output from biolatency for system analysis | Summarize block device I/O latency as a histogram. | distro | * | |
bccbiosnoop | Collect output from biosnoop for system analysis | Trace block device I/O with PID and latency. | distro | * | |
bccbiotop | Collect output from biotop for system analysis | Top for disks: Summarize block device I/O by process. | times | * | |
bccbitesize | Collect output from bitesize for system analysis | Show per process I/O size histogram. | period | * | |
bcccachestat | Collect output from cachestat for system analysis | Trace page cache hit/miss ratio. | times | * | |
bccdcsnoop | Collect output from dcsnoop for system analysis | Trace directory entry cache (dcache) lookups. | period | * | |
bccdcstat | Collect output from dcstat for system analysis | Directory entry cache (dcache) stats. | times | * | |
bccexecsnoop | Collect output from execsnoop for system analysis | Trace new processes via exec() syscalls. | period | * | |
bccext4dist | Collect output from ext4dist for system analysis | Summarize ext4 operation latency distribution as a histogram. | times | * | |
bccext4slower | Collect output from ext4slower for system analysis | Trace slow ext4 operations. | period | * | |
bccfilelife | Collect output from filelife for system analysis | Trace the lifespan of short-lived files. | period | * | |
bccfileslower | Collect output from fileslower for system analysis | Trace slow synchronous file reads and writes. | period | * | |
bccfiletop | Collect output from filetop for system analysis | File reads and writes by filename and process. Top for files. | times | * | |
bccgethostlatency | Collect output from gethostlatency for system analysis | Show latency for getaddrinfo/gethostbyname[2] calls. | period | * | |
bcchardirqs | Collect output from hardirqs for system analysis | Measure hard IRQ (hard interrupt) event time. | times | * | |
bcckillsnoop | Collect output from killsnoop for system analysis | Trace signals issued by the kill() syscall. | period | * | |
bccmysqldqslower | Collect output from mysqld_qslower for system analysis | Trace MySQL server queries slower than a threshold. | threshold | * | |
bccopensnoop | Collect output from opensnoop for system analysis | Trace open() syscalls. | period | * | |
bccpidpersec | Collect output from pidspersec for system analysis | Count new processes (via fork). | period | * | |
bccrunqlat | Collect output from runqlat for system analysis | Run queue (scheduler) latency as a histogram. | times | * | |
bccslabratetop | Collect output from slabratetop for system analysis | Kernel SLAB/SLUB memory cache allocation rate top. | times | * | |
bccsoftirqs | Collect output from softirqs for system analysis | Measure soft IRQ (soft interrupt) event time. | times | * | |
bccstatsnoop | Collect output from statsnoop for system analysis | Trace stat() syscalls. | period | * | |
bccsyncsnoop | Collect output from syncsnoop for system analysis | Trace sync() syscall. | period | * | |
bcctcpaccept | Collect output from tcpaccept for system analysis | Trace TCP passive connections (accept()). | period | * | |
bcctcpconnect | Collect output from tcpconnect for system analysis | Trace TCP active connections (connect()). | period | * | |
bcctcpconnlat | Collect output from tcpconnlat for system analysis | Trace TCP active connection latency (connect()). | period | * | |
bcctcplife | Collect output from tcplife for system analysis | Trace TCP sessions and summarize lifespan. | period | * | |
bcctcpretrans | Collect output from tcpretrans for system analysis | Trace TCP retransmits and TLPs. | period | * | |
bcctcptop | Collect output from tcptop for system analysis | Summarize TCP send/recv throughput by host. Top for TCP. | times | * | |
bccvfscount | Collect output from vfscount for system analysis | Count VFS calls. | period | * | |
bccvfsstat | Collect output from vfsstat for system analysis | Count some VFS calls, with column output. | times | * | |
bccxfsdist | Collect output from xfsdist for system analysis | Summarize XFS operation latency distribution as a histogram. | times | * | |
bccxfsslower | Collect output from xfsslower for system analysis | Trace slow ZFS operations. | period | * | |
blkid | Collect blkid output | ||||
bpfccbiolatency | Collect output from biolatency for system analysis | (Ubuntu) Summarize block device I/O latency as a histogram. | distro | * | |
bpfccbiosnoop | Collect output from biosnoop for system analysis | (Ubuntu) Trace block device I/O with PID and latency. | distro | * | |
bpfccbiotop | Collect output from biotop for system analysis | (Ubuntu) Top for disks: Summarize block device I/O by process. | distro | * | |
bpfccbitesize | Collect output from bitesize for system analysis | (Ubuntu) Show per process I/O size histogram. | distro | * | |
bpfcccachestat | Collect output from cachestat for system analysis | (Ubuntu) Trace page cache hit/miss ratio. | distro | * | |
bpfccdcsnoop | Collect output from dcsnoop for system analysis | (Ubuntu) Trace directory entry cache (dcache) lookups. | distro | * | |
bpfccdcstat | Collect output from dcstat for system analysis | (Ubuntu) Directory entry cache (dcache) stats. | distro | * | |
bpfccexecsnoop | Collect output from execsnoop for system analysis | (Ubuntu) Trace new processes via exec() syscalls. | distro | * | |
bpfccext4dist | Collect output from ext4dist for system analysis | (Ubuntu) Summarize ext4 operation latency distribution as a histogram. | distro | * | |
bpfccext4slower | Collect output from ext4slower for system analysis | (Ubuntu) Trace slow ext4 operations. | distro | * | |
bpfccfilelife | Collect output from filelife for system analysis | (Ubuntu) Trace the lifespan of short-lived files. | distro | * | |
bpfccfileslower | Collect output from fileslower for system analysis | (Ubuntu) Trace slow synchronous file reads and writes. | distro | * | |
bpfccfiletop | Collect output from filetop for system analysis | (Ubuntu) File reads and writes by filename and process. Top for files. | distro | * | |
bpfccgethostlatency | Collect output from gethostlatency for system analysis | (Ubuntu) Show latency for getaddrinfo/gethostbyname[2] calls. | distro | * | |
bpfcchardirqs | Collect output from hardirqs for system analysis | (Ubuntu) Measure hard IRQ (hard interrupt) event time. | distro | * | |
bpfcckillsnoop | Collect output from killsnoop for system analysis | (Ubuntu) Trace signals issued by the kill() syscall. | distro | * | |
bpfccmysqldqslower | Collect output from mysqld_qslower for system analysis | (Ubuntu) Trace MySQL server queries slower than a threshold. | distro | * | |
bpfccopensnoop | Collect output from opensnoop for system analysis | (Ubuntu) Trace open() syscalls. | distro | * | |
bpfccpidpersec | Collect output from pidspersec for system analysis | (Ubuntu) Count new processes (via fork). | distro | * | |
bpfccrunqlat | Collect output from runqlat for system analysis | (Ubuntu) Run queue (scheduler) latency as a histogram. | distro | * | |
bpfccslabratetop | Collect output from slabratetop for system analysis | (Ubuntu) Kernel SLAB/SLUB memory cache allocation rate top. | distro | * | |
bpfccsoftirqs | Collect output from softirqs for system analysis | (Ubuntu) Measure soft IRQ (soft interrupt) event time. | distro | * | |
bpfccstatsnoop | Collect output from statsnoop for system analysis | (Ubuntu) Trace stat() syscalls. | distro | * | |
bpfccsyncsnoop | Collect output from syncsnoop for system analysis | (Ubuntu) Trace sync() syscall. | distro | * | |
bpfcctcpaccept | Collect output from tcpaccept for system analysis | (Ubuntu) Trace TCP passive connections (accept()). | distro | * | |
bpfcctcpconnect | Collect output from tcpconnect for system analysis | (Ubuntu) Trace TCP active connections (connect()). | distro | * | |
bpfcctcpconnlat | Collect output from tcpconnlat for system analysis | (Ubuntu) Trace TCP active connection latency (connect()). | distro | * | |
bpfcctcplife | Collect output from tcplife for system analysis | (Ubuntu) Trace TCP sessions and summarize lifespan. | distro | * | |
bpfcctcpretrans | Collect output from tcpretrans for system analysis | (Ubuntu) Trace TCP retransmits and TLPs. | distro | * | |
bpfcctcptop | Collect output from tcptop for system analysis | (Ubuntu) Summarize TCP send/recv throughput by host. Top for TCP. | distro | * | |
bpfccvfscount | Collect output from vfscount for system analysis | (Ubuntu) Count VFS calls. | distro | * | |
bpfccvfsstat | Collect output from vfsstat for system analysis | (Ubuntu) Count some VFS calls, with column output. | distro | * | |
bpfccxfsdist | Collect output from xfsdist for system analysis | (Ubuntu) Summarize XFS operation latency distribution as a histogram. | distro | * | |
bpfccxfsslower | Collect output from xfsslower for system analysis | (Ubuntu) Trace slow ZFS operations. | distro | * | |
cgroups | Collect /proc/cgroups info | ||||
clocksource | Collect details on current clocksource | ||||
cloudinitlog | Gather /etc/cloud-init* log files | ||||
collectl | Collect output from collectl for system analysis | times | |||
collectlhistory | Gather /var/log/collectl history files | ||||
conntrackfull | Attempts to detect ip_conntrack full | ||||
consoleoverload | Attempts to detect console overload | ||||
cpuinfo | Collect /proc/cpuinfo output | ||||
cron | Gather /etc/cron* files | * | |||
date | Collect output from date to get system time and timezone | ||||
dhclientleases | Gather a copy of the /var/lib/dhclient/*.lease files | ||||
dig | Collect output from dig for dns troubleshooting | ||||
dmesg | Collect output from dmesg | ||||
dmesgfiles | Gather /var/log/dmesg* files | ||||
dpkgpackages | Collect list of installed packages using dpkg -l | ||||
duplicatefslabels | Search for duplicate filesystem labels | * | |||
duplicatefsuuid | Find duplicate filesystem UUIDs | * | |||
duplicatepartuuid | Find duplicate partition UUIDs | * | |||
ebtablesrules | Collect output from ebtables-save for system analysis | * | |||
enadiag | Checks the ethool -S output for ENA specific statistics to diagnose i | ||||
entropy | Collect output from entropy_avail for system analysis | period | |||
environment | Gather /etc/environment file | ||||
ethtool | Collect output from ethtool for system analysis | ||||
ethtoolg | Collect output from ethtool -g for system analysis | ||||
ethtooli | Collect output from ethtool -i for system analysis | ||||
ethtoolk | Collect output from ethtool -k for system analysis | ||||
ethtools | Collect output from ethtool -S for system analysis | ||||
fstab | Gather /etc/fstab file | ||||
fstabfailures | Disables fsck and sets nofail in /etc/fstab for all volumes | * | |||
gcore | Collect output from gcore for application analysis. | * | * | ||
hosts | Gather /etc/hosts file | ||||
httpdlogs | Gather Apache /var/log/httpd/* or /var/log/apache2/* log files | * | |||
hungtasks | Detects hung tasks | * | |||
ifconfig | Collect output from ip addr show for system analysis | ||||
inittab | Gather /etc/inittab file | ||||
interrupts | Collect output from /proc/interrupts for system analysis | times | |||
iomem | Collect /proc/iomem output | ||||
iostat | Collect output from iostat -x for system analysis | times | |||
iproute | Collect output from ip route show all for system analysis | ||||
ipslink | Collect output from ip -s link for system analysis | ||||
iptablesrules | Collect output from iptables-save for system analysis | * | |||
ixgbevfversion | Determines if ixgbevf version is below recommended value | ||||
journal | Collect journalctl output | ||||
kerberosconfig | Gather the Kerberos configuration file | ||||
kernelbug | Detects kernel bugs | * | |||
kernelcmdline | Collect kernel command line (boot) options | ||||
kernelconfig | Collect /boot/config details | ||||
kerneldereference | Detects kernel null pointer dereferences | * | |||
kernelpanic | Detects kernel panics | * | |||
kernelversion | Collect output from uname -r for system analysis | ||||
kpatch | Collect kpatch list output | ||||
kpti | Determine status of Kernel Page Table Isolation | * | |||
last | Collect last -x output | ||||
libtirpcnetconfig | Gather libtirpc's /etc/netconfig file | ||||
localtime | Collect zdump /etc/localtime output | ||||
lsblk | Collect lsblk output | ||||
lsmod | Collect lsmod output | ||||
lspci | Collect lspci output | ||||
ltrace | Gather output from ltrace -fp for application analysis | * | * | ||
ltracec | Collect output from ltrace -cfp for application analysis | * | * | ||
lvmarchives | Gather /etc/lvm/archives/* files | * | |||
lvmconf | Gather /etc/lvm/lvm.conf file | ||||
mdstat | Collect /proc/mdstat output | ||||
meminfo | Collect /proc/meminfo output | ||||
messages | Gather /var/log/messages* or /var/log/syslog* files | * | |||
mounts | Collect /proc/mounts output | ||||
mpstati | Collect output from mpstat -I for system analysis | times | |||
mpstatp | Collect output from mpstat -P for system analysis | times | |||
mysqldlog | Gather /var/log/mysqld.log* files | * | |||
ncport | Test TCP network connectivity to port/destination | port | |||
netstatanp | Collect output from ss -anp for system analysis | times | |||
netstats | Collect output from netstat -s for system analysis | times | |||
networkmanagerstat | Collect output from systemctl status NetworkManager for system analysis | ||||
nginxlogs | Gather /var/log/nginx/* log files | * | |||
nping | Collect output from nping for network troubleshooting | port | * | ||
npingtraceroute | Collect output from nping traceroute for network troubleshooting | port | * | ||
nsswitch | Gather /etc/nsswitch.conf file | ||||
nstat | Collect output from nstat for system analysis | times | |||
ntpconf | Gather /etc/ntp.conf | ||||
ntpstat | Collect ntpstat output | ||||
numastat | Collect numastat output | ||||
oomkiller | Detects oom-killer invocations | * | |||
openfiles | Collect output from lsof | wc -l for system analysis | times | |||
openssh | Verify OpenSSH configuration for faults that could prevent remote access | * | |||
osrelease | Collect details on os release for system analysis | ||||
pagetypeinfo | Collect /proc/pagetypeinfo output | ||||
partitions | Collect /proc/partitions output | ||||
perf | Collect CPU profiling statistics | * | * | ||
perfstat | Collect output from perf for system analysis | times | * | ||
procstat | Collect /proc/stat output | ||||
profile | Gather /etc/profile file | ||||
ps | Collect output from ps for system analysis | times | |||
rebuildinitrd | Rebuilds the system initial ramdisk | remediate | * | ||
resolvconf | Gather the /etc/resolv.conf file | ||||
retpoline | Determine status of kernel retpoline replacements | * | |||
rpmpackages | Collect list of installed packages using rpm -qa | ||||
sarhistory | Gather /var/log/sa (sar) history files | ||||
scheddebug | Collect /proc/sched_debug output | ||||
selinuxpermissive | Sets selinux to permissive mode | remediate | * | ||
slabinfo | Collect /proc/slabinfo output | * | |||
slabtop | Collect output from slabtop for system analysis | times | * | ||
softirqs | Collect output from /proc/softirqs for system analysis | times | |||
softlockup | Detects CPU soft lockups | * | |||
sosreport | Gather a sosreport | * | |||
strace | Gather output from strace -fp for application analysis | * | * | ||
stracec | Collect output from strace -cfp for application analysis | * | * | ||
supportconfig | Gather a supportconfig | distro | * | ||
sysctl | Collect sysctl -a output | ||||
sysctlconf | Collect /etc/sysctl.conf and /etc/sysctl.d files | ||||
systemsmanager | Gather AWS Systems Manager logs and configuration | * | |||
tcpdump | Gather packet capture for network troubleshooting | * | * | ||
tcprecycle | Determines if aggressive TCP recycling is enabled | ||||
tcptraceroute | Collect traceroute output on TCP traffic to a network destination | port | * | ||
top | Collect output from top for system analysis | times | |||
udev | Gather /etc/udev rules and configuration | ||||
udevpersistentnet | Comments out lines in /etc/udev/rules.d/70-persistent-net.rules | remediate | * | ||
vmallocinfo | Collect /proc/vmallocinfo output | * | |||
vmstat | Collect output from vmstat for system analysis | times | |||
vmstatdisk | Collect output from vmstat -d for system analysis | times | |||
vmstatforks | Collect output from vmstat -f for system analysis | times | |||
vmstatslab | Collect output from vmstat -m for system analysis | times | * | ||
w | Collect output from w for system analysis | times | |||
workspacelogs | Gather AWS Linux Workspace log files | distro | * | ||
xenfeatures | Collect details on xen features for system analysis | * | |||
xennetrocket | Attempts to detect xennet issue | ||||
xennetsgmtu | Attempts to detect possibility of xennet scattergather/mtu issue | * | |||
yumconfiguration | Collect yum related configuration file under /etc/yum* output | ||||
yumlog | Gather /var/log/yum.log file | * | |||
zoneinfo | Collect /proc/zoneinfo output | ||||
zypperlog | Gather /var/log/zypp and zypper.log log files | distro | * |
EC2 Rescue for Linux の実行
試しに bcctcpconnect
というモジュールを実行してみます。アクティブな TCP 接続をトレースするものです。
上の表を見ると、このモジュールは sudo が必要で、引数は period
です。period
何秒間トレースしておくかを指定するものです。
sudo ./ec2rl run --only-modules=bcctcpconnect --period=10
実行結果はファイルで保存されます。コマンド出力に保存場所が示されているので、その配下のログファイルを見ましょう。このディレクトリは root 権限でないと見れないので、chmod -R 777
しておくと楽です。
〜〜省略〜〜
-------------[Output Logs]-------------
The output logs are located in:
/var/tmp/ec2rl/2024-01-09T05_46_01.266443
〜〜省略〜〜
どのコンポーネントを使うか
ダウンロード、実行方法は理解しました。実際にトラブルシューティングを行うにあたってどのモジュールを使えばいいでしょうか。
まったく知見がないので bcc Tutorial を参考にしてみます。
基本的な Linux コマンド
EC2 Rescue for Linux や BCC を使う前に基本的な Linux コマンドでパフォーマンス分析をします。
- uptime
- システムの起動時間、ユーザー数、ロードアベレージの1分、5分、15分の平均を表示
- dmesg | tail
- システムメッセージを表示
- 問題を引き起こす可能性がある事象を探す
- vmstat 1
- 仮想メモリ統計
- r: 実行待ちのプロセス数
- free: 空きメモリ
- si,so: ディスクからメモリへのスワップイン/スワップアウト
- us,sy,id,wa,st: ユーザー、システム、アイドル、IO wait、仮想マシンのCPU使用率
- mpstat -P ALL 1
- CPU の使用率をコアごとに表示
- pidstat 1
- プロセスごとのリソース使用率を表示
- top との違いは、こちらはローリング表示する(クリアされずに追記形式で表示)こと
- iostat -xz 1
- カーネル IO 統計
- r/s、w/s、rkB/s、wkB/s : 秒間の読み込み、書き込み、読み込みキロバイト、書き込みキロバイト
- await : IO 要求の平均待ち時間
- avgqu-sz : IO キューの平均長
- %util : デバイスがアイドル状態でない時間の割合
- free -m
- メモリの使用状況
- 空きメモリよりバッファキャッシュ
- sar -n DEV 1
- ネットワークインターフェイスの使用状況
- sar -n TCP,ETCP 1
- 主要な TCP メトリクスを表示
- active/s : 秒間のアクティブな接続数(送信側)
- Passive/s : 秒間のアクティブな接続数(受信側)
- retrans/s : 秒間の再送回数
- top
- プロセスごとのリソース使用率を表示
一部のコマンドは EC2 Rescue for Linux のモジュールに含まれていますね。
これ以外にも使えるコマンドが Netflix のブログで紹介されていました。
BCC
bcc Tutorial で紹介されているモジュールをサマリます。
EC2 Rescue for Linux のなかでは bcc や bpfcc という接頭詞がついています。
- execsnoop
- 実行されるプロセス/コマンドをトレース
- 親プロセス/コマンド、PID、戻り値、フルパスと引数を表示
- opensnoop
- open() syscall をトレース
- アプリケーションが開いているファイルがわかる
- 間違えたファイルを開き続けているとパフォーマンスに悪影響を及ぼす可能性がある
- ext4slower
- ext4 ファイルシステムの操作時間をトレース
- ファイルシステム経由の I/O 遅延を発見するのに役立つ
- biolatency
- ブロックデバイスの I/O 遅延をトレース
- biosnoop
- I/O ごとのブロックデバイス遅延をトレース
- プロセス名や PID が表示されるので、どのプロセスが遅延を引き起こしているかわかる
- cachestat
- ファイルシステムキャッシュの統計
- tcpconnect
- アクティブな TCP 接続(送信側)をトレース
- 非効率なセッションや予期せぬ侵入者を検出するのに役立つ
- tcpaccept
- アクティブな TCP 接続(受信側)をトレース
- 非効率なセッションや予期せぬ侵入者を検出するのに役立つ
- tcpretrans
- TCP 再送パケットをトレース
- TCP 再送は遅延とスループット問題を引き起こす可能性がある
- ESTABLISHED 再送が多ければ、ネットワーク設計を見直し
- SYN_SENT 再送が多ければ、カーネルの CPU 飽和はパケットドロップを疑う
- runqlat
- CPU run queue の滞留時間をトレース
- 滞留時間が長いということは、CPU 処理が追いついていないということ
- profile
- CPU プロファイラー
- サンプリングされたスタックトレースを表示
まとめ
何か問題が発生したとき、多くの場合では原因が不明なまま調査が長期してしまいます。どのような情報をどのように収集すると原因解決の近道なるか、知っておくと役立つと思います。
アプリケーションの問題なのか、インフラ観点だと CPU なのか、デバイス I/O なのか、ネットワークなのか、カーネルなのか調査対象が多岐に渡ります。こういったツールを使う、または、知識を持っておくことで原因の早期発見につながると思います。
参考
aws-ec2rescue-linux
bcc
Amazon EC2 Linux インスタンス内のパフォーマンスのボトルネックをトラブルシューティングしようと考えています。EC2Rescue for Linux では、どのような高度なツールを使ってそれを行えますか?
Use EC2Rescue for Linux
Discussion