📘

eBPF に興味があったので eBPF を使ったトラブルシューティングができる EC2 Rescue for Linux を使ってみた

2024/01/10に公開

こんにちは。
ご機嫌いかがでしょうか。
"No human labor is no human error" が大好きな吉井 亮です。

私が運営に関わっている OpsJAWS では、AWS のサービスを使った運用に関する勉強会を開催しています。次会のテーマは「EC2で実現する最新の監視」にしようと考えています。それに関連して EC2 ネタを書きます。

今回は EC2 Rescue for Linux を調べました。
このツールは、EC2 Linux インスタンス上の一般的な問題の診断やトラブルシューティングを行うためのツールです。問題診断に約立つ色々な情報を収集してくれます。収集した情報を AWS サポートから指定された場所へアップロードすることも可能です。
何か問題が発生した際に、とりあえずこのツールを実行して Linux サーバーの状態を保存しておくのは良いかもしれません。

BPF Compiler Collection (BCC) インストール

なくても動きますが、より詳細な情報を収集するために BCC をインストールします。
Amazon Linux 2023 の場合は dnf でインストール可能です。その他は Installing BCC を参照ください。

sudo dnf install bcc

Linux サーバー上のユーザーに PATH を通しておきます。

export PATH=$PATH:/usr/share/bcc/tools

念のため bashrc にも書いておきます。

~/.bashrc
# BCC Tools
export PATH=$PATH:/usr/share/bcc/tools

EC2 Rescue for Linux は sudo が必要なモジュールがあります。sudo した際にも PATH を通しておきます。
/etc/sudoers に PATH を追記します。

(追記前)
Defaults    secure_path = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/lib/snapd/snap/bin

(追記後)
Defaults    secure_path = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/lib/snapd/snap/bin:/usr/share/bcc/tools

EC2 Rescue for Linux ダウンロード

公式 から引用すると、前提条件は以下の通りです。

  • Supported operating systems
    • Amazon Linux 2
    • Amazon Linux 2016.09+
    • SUSE Linux Enterprise Server 12+
    • RHEL 7+
    • Ubuntu 16.04+
  • Software requirements
    • Python 2.7.9+ or 3.2+

ファイルのダウンロードとハッシュ値の確認をします。

$ wget https://s3.amazonaws.com/ec2rescuelinux/ec2rl.tgz
$ wget https://s3.amazonaws.com/ec2rescuelinux/ec2rl.tgz.sha256
$ sha256sum -c ec2rl.tgz.sha256 
ec2rl.tgz: OK

OK と出たならば、tar ball を展開します。

$ tar -zxf ec2rl.tgz 
$ ls
ec2rl-1.1.6  ec2rl.tgz  ec2rl.tgz.sha256

$ ls ec2rl-1.1.6/
LICENSE  NOTICE  README.md  docs  ec2rl  ec2rl.py  ec2rlcore  example_configs  example_modules  functions.bash  lib  mod.d  post.d  pre.d  requirements.txt  ssmdocs 

無事に展開できました。ダウンロードは完了です。

ヘルプを見る

使い方がまるでわからないので、ヘルプを見ます。
出力が長かったのでアコーディオンを閉じています。

クリックして展開
$ ./ec2rl list
Here is a list of available modules that apply to the current host:

S P R Module Name         Class     Domain       Description                                                            
      amazonlinuxextras   collect   os           Collect amazon-linux-extras list output                                
*     aptlog              gather    os           Gather /var/log/apt and /var/log/dpkg.log files                        
    * arpcache            diagnose  net          Determines if aggressive arp caching is enabled                        
    * arpignore           diagnose  net          Determines if any interfaces have been set to ignore arp requests      
      arptable            collect   net          Collect output from ip neighbor show for system analysis               
*     arptablesrules      collect   net          Collect output from arptables-save for system analysis                 
      asymmetricroute     diagnose  net          Check for asymmetric routing                                           
      atop                collect   performance  Collect output from atop for system analysis                           
      atophistory         gather    performance  Gather /var/log/atop history files                                     
*     bccbiolatency       collect   performance  Collect output from biolatency for system analysis                     
*     bccbiosnoop         collect   performance  Collect output from biosnoop for system analysis.                      
*     bccbiotop           collect   performance  Collect output from biotop for system analysis                         
*     bccbitesize         collect   performance  Collect output from bitesize for system analysis.                      
*     bcccachestat        collect   performance  Collect output from cachestat for system analysis.                     
*     bccdcsnoop          collect   performance  Collect output from dcsnoop for system analysis.                       
*     bccdcstat           collect   performance  Collect output from dcstat for system analysis.                        
*     bccexecsnoop        collect   performance  Collect output from execsnoop for system analysis.                     
*     bccext4dist         collect   performance  Collect output from ext4dist for system analysis                       
*     bccext4slower       collect   performance  Collect output from ext4slower for system analysis                     
*     bccfilelife         collect   performance  Collect output from filelife for system analysis                       
*     bccfileslower       collect   performance  Collect output from fileslower for system analysis                     
*     bccfiletop          collect   performance  Collect output from filetop for system analysis                        
*     bccgethostlatency   collect   performance  Collect output from gethostlatency for system analysis                 
*     bcchardirqs         collect   performance  Collect output from hardirqs for system analysis                       
*     bcckillsnoop        collect   performance  Collect output from killsnoop for system analysis                      
*     bccmysqldqslower    performa  net          Collect output from mysqld_qslower for system analysis                 
*     bccopensnoop        collect   performance  Collect output from opensnoop for system analysis                      
*     bccpidpersec        collect   performance  Collect output from pidspersec for system analysis                     
*     bccrunqlat          collect   performance  Collect output from runqlat for system analysis                        
*     bccslabratetop      collect   performance  Collect output from slabratetop for system analysis                    
*     bccsoftirqs         collect   performance  Collect output from softirqs for system analysis                       
*     bccstatsnoop        collect   performance  Collect output from statsnoop for system analysis                      
*     bccsyncsnoop        collect   performance  Collect output from syncsnoop for system analysis                      
*     bcctcpaccept        collect   net          Collect output from tcpaccept for system analysis                      
*     bcctcpconnect       collect   net          Collect output from tcpconnect for system analysis                     
*     bcctcpconnlat       collect   net          Collect output from tcpconnlat for system analysis                     
*     bcctcplife          collect   net          Collect output from tcplife for system analysis                        
*     bcctcpretrans       collect   net          Collect output from tcpretrans for system analysis                     
*     bcctcptop           collect   net          Collect output from tcptop for system analysis                         
*     bccvfscount         collect   performance  Collect output from vfscount for system analysis                       
*     bccvfsstat          collect   performance  Collect output from vfsstat for system analysis                        
*     bccxfsdist          collect   performance  Collect output from xfsdist for system analysis                        
*     bccxfsslower        collect   performance  Collect output from xfsslower for system analysis                      
      blkid               collect   os           Collect blkid output                                                   
*     bpfccbiolatency     collect   performance  Collect output from biolatency for system analysis                     
*     bpfccbiosnoop       collect   performance  Collect output from biosnoop for system analysis.                      
*     bpfccbiotop         collect   performance  Collect output from biotop for system analysis                         
*     bpfccbitesize       collect   performance  Collect output from bitesize for system analysis.                      
*     bpfcccachestat      collect   performance  Collect output from cachestat for system analysis.                     
*     bpfccdcsnoop        collect   performance  Collect output from dcsnoop for system analysis.                       
*     bpfccdcstat         collect   performance  Collect output from dcstat for system analysis.                        
*     bpfccexecsnoop      collect   performance  Collect output from execsnoop for system analysis.                     
*     bpfccext4dist       collect   performance  Collect output from ext4dist for system analysis                       
*     bpfccext4slower     collect   performance  Collect output from ext4slower for system analysis                     
*     bpfccfilelife       collect   performance  Collect output from filelife for system analysis                       
*     bpfccfileslower     collect   performance  Collect output from fileslower for system analysis                     
*     bpfccfiletop        collect   performance  Collect output from filetop for system analysis                        
*     bpfccgethostlatenc  collect   performance  Collect output from gethostlatency for system analysis                 
*     bpfcchardirqs       collect   performance  Collect output from hardirqs for system analysis                       
*     bpfcckillsnoop      collect   performance  Collect output from killsnoop for system analysis                      
*     bpfccmysqldqslower  performa  net          Collect output from mysqld_qslower for system analysis                 
*     bpfccopensnoop      collect   performance  Collect output from opensnoop for system analysis                      
*     bpfccpidpersec      collect   performance  Collect output from pidspersec for system analysis                     
*     bpfccrunqlat        collect   performance  Collect output from runqlat for system analysis                        
*     bpfccslabratetop    collect   performance  Collect output from slabratetop for system analysis                    
*     bpfccsoftirqs       collect   performance  Collect output from softirqs for system analysis                       
*     bpfccstatsnoop      collect   performance  Collect output from statsnoop for system analysis                      
*     bpfccsyncsnoop      collect   performance  Collect output from syncsnoop for system analysis                      
*     bpfcctcpaccept      collect   net          Collect output from tcpaccept for system analysis                      
*     bpfcctcpconnect     collect   net          Collect output from tcpconnect for system analysis                     
*     bpfcctcpconnlat     collect   net          Collect output from tcpconnlat for system analysis                     
*     bpfcctcplife        collect   net          Collect output from tcplife for system analysis                        
*     bpfcctcpretrans     collect   net          Collect output from tcpretrans for system analysis                     
*     bpfcctcptop         collect   net          Collect output from tcptop for system analysis                         
*     bpfccvfscount       collect   performance  Collect output from vfscount for system analysis                       
*     bpfccvfsstat        collect   performance  Collect output from vfsstat for system analysis                        
*     bpfccxfsdist        collect   performance  Collect output from xfsdist for system analysis                        
*     bpfccxfsslower      collect   performance  Collect output from xfsslower for system analysis                      
      cgroups             collect   os           Collect /proc/cgroups info                                             
      clocksource         collect   os           Collect details on current clocksource                                 
      cloudinitlog        gather    os           Gather /etc/cloud-init* log files                                      
      collectl            collect   performance  Collect output from collectl for system analysis                       
      collectlhistory     gather    performance  Gather /var/log/collectl history files                                 
      conntrackfull       diagnose  net          Attempts to detect ip_conntrack full                                   
      consoleoverload     diagnose  os           Attempts to detect console overload                                    
      cpuinfo             collect   os           Collect /proc/cpuinfo output                                           
*     cron                gather    os           Gather /etc/cron* files                                                
      date                collect   os           Collect output from date to get system time and timezone               
      dhclientleases      gather    application  Gather a copy of the /var/lib/dhclient/*.lease files                   
      dig                 collect   net          Collect output from dig for dns troubleshooting                        
      dmesg               collect   os           Collect output from dmesg                                              
      dmesgfiles          gather    os           Gather /var/log/dmesg* files                                           
      dpkgpackages        collect   os           Collect list of installed packages using dpkg -l                       
*     duplicatefslabels   diagnose  os           Search for duplicate filesystem labels.                                
*     duplicatefsuuid     diagnose  os           Find duplicate filesystem UUIDs                                        
*     duplicatepartuuid   diagnose  os           Find duplicate partition UUIDs                                         
*     ebtablesrules       collect   net          Collect output from ebtables-save for system analysis                  
      enadiag             diagnose  net          Checks the ethool -S output for ENA specific statistics to diagnose i  
      entropy             collect   performance  Collect output from entropy_avail for system analysis.                 
      environment         gather    os           Gather /etc/environment file                                           
      ethtool             collect   net          Collect output from ethtool for system analysis                        
      ethtoolg            collect   net          Collect output from ethtool -g for system analysis                     
      ethtooli            collect   net          Collect output from ethtool -i for system analysis                     
      ethtoolk            collect   net          Collect output from ethtool -k for system analysis                     
      ethtools            collect   net          Collect output from ethtool -S for system analysis                     
      fstab               gather    os           Gather /etc/fstab file                                                 
*   * fstabfailures       diagnose  os           Disables fsck and sets nofail in /etc/fstab for all volumes            
* *   gcore               gather    performance  Collect output from gcore for application analysis.                    
      hosts               gather    os           Gather /etc/hosts file                                                 
*     httpdlogs           gather    application  Gather Apache /var/log/httpd/* or /var/log/apache2/* log files         
*     hungtasks           diagnose  os           Detects hung tasks                                                     
      ifconfig            collect   net          Collect output from ip addr show for system analysis                   
      inittab             gather    os           Gather /etc/inittab file                                               
      interrupts          collect   performance  Collect output from /proc/interrupts for system analysis               
      iomem               collect   os           Collect /proc/iomem output                                             
      iostat              collect   performance  Collect output from iostat -x for system analysis                      
      iproute             collect   net          Collect output from ip route show all for system analysis              
      ipslink             collect   net          Collect output from ip -s link for system analysis                     
*     iptablesrules       collect   net          Collect output from iptables-save for system analysis                  
      ixgbevfversion      diagnose  net          Determines if ixgbevf version is below recommended value               
      journal             collect   os           Collect journalctl output                                              
      kerberosconfig      gather    os           Gather the Kerberos configuration file                                 
*     kernelbug           diagnose  os           Detects kernel bugs                                                    
      kernelcmdline       collect   os           Collect kernel command line (boot) options                             
      kernelconfig        gather    os           Collect /boot/config details                                           
*     kerneldereference   diagnose  os           Detects kernel null pointer dereferences                               
*     kernelpanic         diagnose  os           Detects kernel panics                                                  
      kernelversion       collect   os           Collect output from uname -r for system analysis                       
      kpatch              collect   os           Collect kpatch list output                                             
*     kpti                collect   os           Determine status of Kernel Page Table Isolation.                       
      last                collect   os           Collect last -x output                                                 
      libtirpcnetconfig   gather    os           Gather libtirpc's /etc/netconfig file                                  
      localtime           collect   os           Collect zdump /etc/localtime output                                    
      lsblk               collect   os           Collect lsblk output                                                   
      lsmod               collect   os           Collect lsmod output                                                   
      lspci               collect   os           Collect lspci output                                                   
* *   ltrace              gather    performance  Gather output from ltrace -fp for application analysis                 
* *   ltracec             collect   performance  Collect output from ltrace -cfp for application analysis               
*     lvmarchives         gather    application  Gather /etc/lvm/archives/* files                                       
      lvmconf             gather    os           Gather /etc/lvm/lvm.conf file                                          
      mdstat              collect   os           Collect /proc/mdstat output                                            
      meminfo             collect   os           Collect /proc/meminfo output                                           
*     messages            gather    os           Gather /var/log/messages* or /var/log/syslog* files                    
      mounts              collect   os           Collect /proc/mounts output                                            
      mpstati             collect   performance  Collect output from mpstat -I for system analysis                      
      mpstatp             collect   performance  Collect output from mpstat -P for system analysis                      
*     mysqldlog           gather    application  Gather /var/log/mysqld.log* files                                      
      ncport              collect   net          Test TCP network connectivity to port/destination                      
      netstatanp          collect   performance  Collect output from ss -anp for system analysis                        
      netstats            collect   performance  Collect output from netstat -s for system analysis                     
      networkmanagerstat  collect   net          Collect output from systemctl status NetworkManager for system analys  
*     nginxlogs           gather    application  Gather /var/log/nginx/* log files                                      
*     nping               collect   net          Collect output from nping for network troubleshooting.                 
*     npingtraceroute     collect   net          Collect output from nping traceroute for network troubleshooting       
      nsswitch            gather    os           Gather /etc/nsswitch.conf file                                         
      nstat               collect   performance  Collect output from nstat for system analysis                          
      ntpconf             gather    os           Gather /etc/ntp.conf                                                   
      ntpstat             collect   os           Collect ntpstat output                                                 
      numastat            collect   os           Collect numastat output                                                
*     oomkiller           diagnose  os           Detects oom-killer invocations                                         
      openfiles           collect   performance  Collect output from lsof | wc -l for system analysis                   
*   * openssh             diagnose  os           Verify OpenSSH configuration for faults that could prevent remote acc  
      osrelease           collect   os           Collect details on os release for system analysis                      
      pagetypeinfo        collect   os           Collect /proc/pagetypeinfo output                                      
      partitions          collect   os           Collect /proc/partitions output                                        
* *   perf                collect   performance  Collect CPU profiling statistics                                       
*     perfstat            collect   performance  Collect output from perf for system analysis                           
      procstat            collect   os           Collect /proc/stat output                                              
      profile             gather    os           Gather /etc/profile file                                               
      ps                  collect   performance  Collect output from ps for system analysis                             
*   * rebuildinitrd       diagnose  os           Rebuilds the system initial ramdisk                                    
      resolvconf          gather    net          Gather the /etc/resolv.conf file                                       
*     retpoline           collect   os           Determine status of kernel retpoline replacements.                     
      rpmpackages         collect   os           Collect list of installed packages using rpm -qa                       
      sarhistory          gather    performance  Gather /var/log/sa (sar) history files                                 
      scheddebug          collect   os           Collect /proc/sched_debug output                                       
*   * selinuxpermissive   diagnose  os           Sets selinux to permissive mode                                        
*     slabinfo            collect   os           Collect /proc/slabinfo output                                          
*     slabtop             collect   performance  Collect output from slabtop for system analysis                        
      softirqs            collect   performance  Collect output from /proc/softirqs for system analysis                 
*     softlockup          diagnose  os           Detects CPU soft lockups                                               
*     sosreport           gather    os           Gather a sosreport                                                     
* *   strace              gather    performance  Gather output from strace -fp for application analysis                 
* *   stracec             collect   performance  Collect output from strace -cfp for application analysis               
*     supportconfig       gather    os           Gather a supportconfig                                                 
      sysctl              collect   os           Collect sysctl -a output                                               
      sysctlconf          gather    os           Collect /etc/sysctl.conf and /etc/sysctl.d files                       
*     systemsmanager      gather    os           Gather AWS Systems Manager logs and configuration                      
* *   tcpdump             gather    net          Gather packet capture for network troubleshooting.                     
    * tcprecycle          diagnose  net          Determines if aggressive TCP recycling is enabled                      
*     tcptraceroute       collect   net          Collect traceroute output on TCP traffic to a network destination.     
      top                 collect   performance  Collect output from top for system analysis                            
      udev                gather    os           Gather /etc/udev rules and configuration                               
*   * udevpersistentnet   diagnose  net          Comments out lines in /etc/udev/rules.d/70-persistent-net.rules        
*     vmallocinfo         collect   os           Collect /proc/vmallocinfo output                                       
      vmstat              collect   performance  Collect output from vmstat for system analysis                         
      vmstatdisk          collect   performance  Collect output from vmstat -d for system analysis                      
      vmstatforks         collect   performance  Collect output from vmstat -f for system analysis                      
*     vmstatslab          collect   performance  Collect output from vmstat -m for system analysis                      
      w                   collect   performance  Collect output from w for system analysis                              
*     workspacelogs       gather    os           Gather AWS Linux Workspace log files                                   
*     xenfeatures         collect   os           Collect details on xen features for system analysis                    
      xennetrocket        diagnose  net          Attempts to detect xennet issue                                        
*     xennetsgmtu         diagnose  net          Attempts to detect possibility of xennet scattergather/mtu issue       
      yumconfiguration    collect   os           Collect yum related configuration file under /etc/yum* output          
*     yumlog              gather    os           Gather /var/log/yum.log file                                           
      zoneinfo            collect   os           Collect /proc/zoneinfo output                                          
*     zypperlog           gather    os           Gather /var/log/zypp and zypper.log log files                          

S: Requires sudo/root to run
P: Requires --perfimpact=true to run (can potentially cause performance impact)
R: Supports remediation if --remediate is given

Classes refer to the type of task the module performs
 Diagnose: success/fail/warn conditions determined by module.
 Gather: create a copy of a local file for inspection.
 Collect: collect command output

Domains are defined per module and refer to the general area of investigation for the module.

To see module help, you can run:

ec2rl help [MODULEa ... MODULEx]
ec2rl help [--only-modules=MODULEa ... MODULEx] [--only-domains=DOMAINa ... DOMAINx]

EC2 Rescue for Linux にはかなりのモジュールが含まれています。
run オプションを付けて実行すると、全てのモジュールが実行されます。(前提ソフトウェアがインストールされていれば)

sudo ./ec2rl run

特定のモジュールだけ実行したい場合は、--only-modules オプションを使います。また、モジュールには引数が必要なものもあります。それらには適切な引数を渡します。
sudo が必要なモジュールもありますのでご注意ください。

$ ./ec2rl run --only-modules=module_name1,module_name2 --arguments=value
または
$ sudo ./ec2rl run --only-modules=module_name1,module_name2 --arguments=value

モジュールの一覧とその説明、渡す引数、sudo の有無、パフォーマンスへの影響をまとめました。

Module Name help に表示される説明 説明だけでは何か不明なので補足 引数 sudoが必要 パフォーマンスへの影響
amazonlinuxextras Collect amazon-linux-extras list output distro
aptlog Gather /var/log/apt and /var/log/dpkg.log files distro *
arpcache Determines if aggressive arp caching is enabled
arpignore Determines if any interfaces have been set to ignore arp requests
arptable Collect output from ip neighbor show for system analysis
arptablesrules Collect output from arptables-save for system analysis *
asymmetricroute Check for asymmetric routing
atop Collect output from atop for system analysis times
atophistory Gather /var/log/atop history files times
bccbiolatency Collect output from biolatency for system analysis Summarize block device I/O latency as a histogram. distro *
bccbiosnoop Collect output from biosnoop for system analysis Trace block device I/O with PID and latency. distro *
bccbiotop Collect output from biotop for system analysis Top for disks: Summarize block device I/O by process. times *
bccbitesize Collect output from bitesize for system analysis Show per process I/O size histogram. period *
bcccachestat Collect output from cachestat for system analysis Trace page cache hit/miss ratio. times *
bccdcsnoop Collect output from dcsnoop for system analysis Trace directory entry cache (dcache) lookups. period *
bccdcstat Collect output from dcstat for system analysis Directory entry cache (dcache) stats. times *
bccexecsnoop Collect output from execsnoop for system analysis Trace new processes via exec() syscalls. period *
bccext4dist Collect output from ext4dist for system analysis Summarize ext4 operation latency distribution as a histogram. times *
bccext4slower Collect output from ext4slower for system analysis Trace slow ext4 operations. period *
bccfilelife Collect output from filelife for system analysis Trace the lifespan of short-lived files. period *
bccfileslower Collect output from fileslower for system analysis Trace slow synchronous file reads and writes. period *
bccfiletop Collect output from filetop for system analysis File reads and writes by filename and process. Top for files. times *
bccgethostlatency Collect output from gethostlatency for system analysis Show latency for getaddrinfo/gethostbyname[2] calls. period *
bcchardirqs Collect output from hardirqs for system analysis Measure hard IRQ (hard interrupt) event time. times *
bcckillsnoop Collect output from killsnoop for system analysis Trace signals issued by the kill() syscall. period *
bccmysqldqslower Collect output from mysqld_qslower for system analysis Trace MySQL server queries slower than a threshold. threshold *
bccopensnoop Collect output from opensnoop for system analysis Trace open() syscalls. period *
bccpidpersec Collect output from pidspersec for system analysis Count new processes (via fork). period *
bccrunqlat Collect output from runqlat for system analysis Run queue (scheduler) latency as a histogram. times *
bccslabratetop Collect output from slabratetop for system analysis Kernel SLAB/SLUB memory cache allocation rate top. times *
bccsoftirqs Collect output from softirqs for system analysis Measure soft IRQ (soft interrupt) event time. times *
bccstatsnoop Collect output from statsnoop for system analysis Trace stat() syscalls. period *
bccsyncsnoop Collect output from syncsnoop for system analysis Trace sync() syscall. period *
bcctcpaccept Collect output from tcpaccept for system analysis Trace TCP passive connections (accept()). period *
bcctcpconnect Collect output from tcpconnect for system analysis Trace TCP active connections (connect()). period *
bcctcpconnlat Collect output from tcpconnlat for system analysis Trace TCP active connection latency (connect()). period *
bcctcplife Collect output from tcplife for system analysis Trace TCP sessions and summarize lifespan. period *
bcctcpretrans Collect output from tcpretrans for system analysis Trace TCP retransmits and TLPs. period *
bcctcptop Collect output from tcptop for system analysis Summarize TCP send/recv throughput by host. Top for TCP. times *
bccvfscount Collect output from vfscount for system analysis Count VFS calls. period *
bccvfsstat Collect output from vfsstat for system analysis Count some VFS calls, with column output. times *
bccxfsdist Collect output from xfsdist for system analysis Summarize XFS operation latency distribution as a histogram. times *
bccxfsslower Collect output from xfsslower for system analysis Trace slow ZFS operations. period *
blkid Collect blkid output
bpfccbiolatency Collect output from biolatency for system analysis (Ubuntu) Summarize block device I/O latency as a histogram. distro *
bpfccbiosnoop Collect output from biosnoop for system analysis (Ubuntu) Trace block device I/O with PID and latency. distro *
bpfccbiotop Collect output from biotop for system analysis (Ubuntu) Top for disks: Summarize block device I/O by process. distro *
bpfccbitesize Collect output from bitesize for system analysis (Ubuntu) Show per process I/O size histogram. distro *
bpfcccachestat Collect output from cachestat for system analysis (Ubuntu) Trace page cache hit/miss ratio. distro *
bpfccdcsnoop Collect output from dcsnoop for system analysis (Ubuntu) Trace directory entry cache (dcache) lookups. distro *
bpfccdcstat Collect output from dcstat for system analysis (Ubuntu) Directory entry cache (dcache) stats. distro *
bpfccexecsnoop Collect output from execsnoop for system analysis (Ubuntu) Trace new processes via exec() syscalls. distro *
bpfccext4dist Collect output from ext4dist for system analysis (Ubuntu) Summarize ext4 operation latency distribution as a histogram. distro *
bpfccext4slower Collect output from ext4slower for system analysis (Ubuntu) Trace slow ext4 operations. distro *
bpfccfilelife Collect output from filelife for system analysis (Ubuntu) Trace the lifespan of short-lived files. distro *
bpfccfileslower Collect output from fileslower for system analysis (Ubuntu) Trace slow synchronous file reads and writes. distro *
bpfccfiletop Collect output from filetop for system analysis (Ubuntu) File reads and writes by filename and process. Top for files. distro *
bpfccgethostlatency Collect output from gethostlatency for system analysis (Ubuntu) Show latency for getaddrinfo/gethostbyname[2] calls. distro *
bpfcchardirqs Collect output from hardirqs for system analysis (Ubuntu) Measure hard IRQ (hard interrupt) event time. distro *
bpfcckillsnoop Collect output from killsnoop for system analysis (Ubuntu) Trace signals issued by the kill() syscall. distro *
bpfccmysqldqslower Collect output from mysqld_qslower for system analysis (Ubuntu) Trace MySQL server queries slower than a threshold. distro *
bpfccopensnoop Collect output from opensnoop for system analysis (Ubuntu) Trace open() syscalls. distro *
bpfccpidpersec Collect output from pidspersec for system analysis (Ubuntu) Count new processes (via fork). distro *
bpfccrunqlat Collect output from runqlat for system analysis (Ubuntu) Run queue (scheduler) latency as a histogram. distro *
bpfccslabratetop Collect output from slabratetop for system analysis (Ubuntu) Kernel SLAB/SLUB memory cache allocation rate top. distro *
bpfccsoftirqs Collect output from softirqs for system analysis (Ubuntu) Measure soft IRQ (soft interrupt) event time. distro *
bpfccstatsnoop Collect output from statsnoop for system analysis (Ubuntu) Trace stat() syscalls. distro *
bpfccsyncsnoop Collect output from syncsnoop for system analysis (Ubuntu) Trace sync() syscall. distro *
bpfcctcpaccept Collect output from tcpaccept for system analysis (Ubuntu) Trace TCP passive connections (accept()). distro *
bpfcctcpconnect Collect output from tcpconnect for system analysis (Ubuntu) Trace TCP active connections (connect()). distro *
bpfcctcpconnlat Collect output from tcpconnlat for system analysis (Ubuntu) Trace TCP active connection latency (connect()). distro *
bpfcctcplife Collect output from tcplife for system analysis (Ubuntu) Trace TCP sessions and summarize lifespan. distro *
bpfcctcpretrans Collect output from tcpretrans for system analysis (Ubuntu) Trace TCP retransmits and TLPs. distro *
bpfcctcptop Collect output from tcptop for system analysis (Ubuntu) Summarize TCP send/recv throughput by host. Top for TCP. distro *
bpfccvfscount Collect output from vfscount for system analysis (Ubuntu) Count VFS calls. distro *
bpfccvfsstat Collect output from vfsstat for system analysis (Ubuntu) Count some VFS calls, with column output. distro *
bpfccxfsdist Collect output from xfsdist for system analysis (Ubuntu) Summarize XFS operation latency distribution as a histogram. distro *
bpfccxfsslower Collect output from xfsslower for system analysis (Ubuntu) Trace slow ZFS operations. distro *
cgroups Collect /proc/cgroups info
clocksource Collect details on current clocksource
cloudinitlog Gather /etc/cloud-init* log files
collectl Collect output from collectl for system analysis times
collectlhistory Gather /var/log/collectl history files
conntrackfull Attempts to detect ip_conntrack full
consoleoverload Attempts to detect console overload
cpuinfo Collect /proc/cpuinfo output
cron Gather /etc/cron* files *
date Collect output from date to get system time and timezone
dhclientleases Gather a copy of the /var/lib/dhclient/*.lease files
dig Collect output from dig for dns troubleshooting
dmesg Collect output from dmesg
dmesgfiles Gather /var/log/dmesg* files
dpkgpackages Collect list of installed packages using dpkg -l
duplicatefslabels Search for duplicate filesystem labels *
duplicatefsuuid Find duplicate filesystem UUIDs *
duplicatepartuuid Find duplicate partition UUIDs *
ebtablesrules Collect output from ebtables-save for system analysis *
enadiag Checks the ethool -S output for ENA specific statistics to diagnose i
entropy Collect output from entropy_avail for system analysis period
environment Gather /etc/environment file
ethtool Collect output from ethtool for system analysis
ethtoolg Collect output from ethtool -g for system analysis
ethtooli Collect output from ethtool -i for system analysis
ethtoolk Collect output from ethtool -k for system analysis
ethtools Collect output from ethtool -S for system analysis
fstab Gather /etc/fstab file
fstabfailures Disables fsck and sets nofail in /etc/fstab for all volumes *
gcore Collect output from gcore for application analysis. * *
hosts Gather /etc/hosts file
httpdlogs Gather Apache /var/log/httpd/* or /var/log/apache2/* log files *
hungtasks Detects hung tasks *
ifconfig Collect output from ip addr show for system analysis
inittab Gather /etc/inittab file
interrupts Collect output from /proc/interrupts for system analysis times
iomem Collect /proc/iomem output
iostat Collect output from iostat -x for system analysis times
iproute Collect output from ip route show all for system analysis
ipslink Collect output from ip -s link for system analysis
iptablesrules Collect output from iptables-save for system analysis *
ixgbevfversion Determines if ixgbevf version is below recommended value
journal Collect journalctl output
kerberosconfig Gather the Kerberos configuration file
kernelbug Detects kernel bugs *
kernelcmdline Collect kernel command line (boot) options
kernelconfig Collect /boot/config details
kerneldereference Detects kernel null pointer dereferences *
kernelpanic Detects kernel panics *
kernelversion Collect output from uname -r for system analysis
kpatch Collect kpatch list output
kpti Determine status of Kernel Page Table Isolation *
last Collect last -x output
libtirpcnetconfig Gather libtirpc's /etc/netconfig file
localtime Collect zdump /etc/localtime output
lsblk Collect lsblk output
lsmod Collect lsmod output
lspci Collect lspci output
ltrace Gather output from ltrace -fp for application analysis * *
ltracec Collect output from ltrace -cfp for application analysis * *
lvmarchives Gather /etc/lvm/archives/* files *
lvmconf Gather /etc/lvm/lvm.conf file
mdstat Collect /proc/mdstat output
meminfo Collect /proc/meminfo output
messages Gather /var/log/messages* or /var/log/syslog* files *
mounts Collect /proc/mounts output
mpstati Collect output from mpstat -I for system analysis times
mpstatp Collect output from mpstat -P for system analysis times
mysqldlog Gather /var/log/mysqld.log* files *
ncport Test TCP network connectivity to port/destination port
netstatanp Collect output from ss -anp for system analysis times
netstats Collect output from netstat -s for system analysis times
networkmanagerstat Collect output from systemctl status NetworkManager for system analysis
nginxlogs Gather /var/log/nginx/* log files *
nping Collect output from nping for network troubleshooting port *
npingtraceroute Collect output from nping traceroute for network troubleshooting port *
nsswitch Gather /etc/nsswitch.conf file
nstat Collect output from nstat for system analysis times
ntpconf Gather /etc/ntp.conf
ntpstat Collect ntpstat output
numastat Collect numastat output
oomkiller Detects oom-killer invocations *
openfiles Collect output from lsof | wc -l for system analysis times
openssh Verify OpenSSH configuration for faults that could prevent remote access *
osrelease Collect details on os release for system analysis
pagetypeinfo Collect /proc/pagetypeinfo output
partitions Collect /proc/partitions output
perf Collect CPU profiling statistics * *
perfstat Collect output from perf for system analysis times *
procstat Collect /proc/stat output
profile Gather /etc/profile file
ps Collect output from ps for system analysis times
rebuildinitrd Rebuilds the system initial ramdisk remediate *
resolvconf Gather the /etc/resolv.conf file
retpoline Determine status of kernel retpoline replacements *
rpmpackages Collect list of installed packages using rpm -qa
sarhistory Gather /var/log/sa (sar) history files
scheddebug Collect /proc/sched_debug output
selinuxpermissive Sets selinux to permissive mode remediate *
slabinfo Collect /proc/slabinfo output *
slabtop Collect output from slabtop for system analysis times *
softirqs Collect output from /proc/softirqs for system analysis times
softlockup Detects CPU soft lockups *
sosreport Gather a sosreport *
strace Gather output from strace -fp for application analysis * *
stracec Collect output from strace -cfp for application analysis * *
supportconfig Gather a supportconfig distro *
sysctl Collect sysctl -a output
sysctlconf Collect /etc/sysctl.conf and /etc/sysctl.d files
systemsmanager Gather AWS Systems Manager logs and configuration *
tcpdump Gather packet capture for network troubleshooting * *
tcprecycle Determines if aggressive TCP recycling is enabled
tcptraceroute Collect traceroute output on TCP traffic to a network destination port *
top Collect output from top for system analysis times
udev Gather /etc/udev rules and configuration
udevpersistentnet Comments out lines in /etc/udev/rules.d/70-persistent-net.rules remediate *
vmallocinfo Collect /proc/vmallocinfo output *
vmstat Collect output from vmstat for system analysis times
vmstatdisk Collect output from vmstat -d for system analysis times
vmstatforks Collect output from vmstat -f for system analysis times
vmstatslab Collect output from vmstat -m for system analysis times *
w Collect output from w for system analysis times
workspacelogs Gather AWS Linux Workspace log files distro *
xenfeatures Collect details on xen features for system analysis *
xennetrocket Attempts to detect xennet issue
xennetsgmtu Attempts to detect possibility of xennet scattergather/mtu issue *
yumconfiguration Collect yum related configuration file under /etc/yum* output
yumlog Gather /var/log/yum.log file *
zoneinfo Collect /proc/zoneinfo output
zypperlog Gather /var/log/zypp and zypper.log log files distro *

EC2 Rescue for Linux の実行

試しに bcctcpconnect というモジュールを実行してみます。アクティブな TCP 接続をトレースするものです。
上の表を見ると、このモジュールは sudo が必要で、引数は period です。period 何秒間トレースしておくかを指定するものです。

sudo ./ec2rl run --only-modules=bcctcpconnect --period=10

実行結果はファイルで保存されます。コマンド出力に保存場所が示されているので、その配下のログファイルを見ましょう。このディレクトリは root 権限でないと見れないので、chmod -R 777 しておくと楽です。

〜〜省略〜〜

-------------[Output  Logs]-------------

The output logs are located in:
/var/tmp/ec2rl/2024-01-09T05_46_01.266443

〜〜省略〜〜

どのコンポーネントを使うか

ダウンロード、実行方法は理解しました。実際にトラブルシューティングを行うにあたってどのモジュールを使えばいいでしょうか。
まったく知見がないので bcc Tutorial を参考にしてみます。

基本的な Linux コマンド

EC2 Rescue for Linux や BCC を使う前に基本的な Linux コマンドでパフォーマンス分析をします。

  • uptime
    • システムの起動時間、ユーザー数、ロードアベレージの1分、5分、15分の平均を表示
  • dmesg | tail
    • システムメッセージを表示
    • 問題を引き起こす可能性がある事象を探す
  • vmstat 1
    • 仮想メモリ統計
    • r: 実行待ちのプロセス数
    • free: 空きメモリ
    • si,so: ディスクからメモリへのスワップイン/スワップアウト
    • us,sy,id,wa,st: ユーザー、システム、アイドル、IO wait、仮想マシンのCPU使用率
  • mpstat -P ALL 1
    • CPU の使用率をコアごとに表示
  • pidstat 1
    • プロセスごとのリソース使用率を表示
    • top との違いは、こちらはローリング表示する(クリアされずに追記形式で表示)こと
  • iostat -xz 1
    • カーネル IO 統計
    • r/s、w/s、rkB/s、wkB/s : 秒間の読み込み、書き込み、読み込みキロバイト、書き込みキロバイト
    • await : IO 要求の平均待ち時間
    • avgqu-sz : IO キューの平均長
    • %util : デバイスがアイドル状態でない時間の割合
  • free -m
    • メモリの使用状況
    • 空きメモリよりバッファキャッシュ
  • sar -n DEV 1
    • ネットワークインターフェイスの使用状況
  • sar -n TCP,ETCP 1
    • 主要な TCP メトリクスを表示
    • active/s : 秒間のアクティブな接続数(送信側)
    • Passive/s : 秒間のアクティブな接続数(受信側)
    • retrans/s : 秒間の再送回数
  • top
    • プロセスごとのリソース使用率を表示

一部のコマンドは EC2 Rescue for Linux のモジュールに含まれていますね。
これ以外にも使えるコマンドが Netflix のブログで紹介されていました。

BCC

bcc Tutorial で紹介されているモジュールをサマリます。
EC2 Rescue for Linux のなかでは bcc や bpfcc という接頭詞がついています。

  • execsnoop
    • 実行されるプロセス/コマンドをトレース
    • 親プロセス/コマンド、PID、戻り値、フルパスと引数を表示
  • opensnoop
    • open() syscall をトレース
    • アプリケーションが開いているファイルがわかる
    • 間違えたファイルを開き続けているとパフォーマンスに悪影響を及ぼす可能性がある
  • ext4slower
    • ext4 ファイルシステムの操作時間をトレース
    • ファイルシステム経由の I/O 遅延を発見するのに役立つ
  • biolatency
    • ブロックデバイスの I/O 遅延をトレース
  • biosnoop
    • I/O ごとのブロックデバイス遅延をトレース
    • プロセス名や PID が表示されるので、どのプロセスが遅延を引き起こしているかわかる
  • cachestat
    • ファイルシステムキャッシュの統計
  • tcpconnect
    • アクティブな TCP 接続(送信側)をトレース
    • 非効率なセッションや予期せぬ侵入者を検出するのに役立つ
  • tcpaccept
    • アクティブな TCP 接続(受信側)をトレース
    • 非効率なセッションや予期せぬ侵入者を検出するのに役立つ
  • tcpretrans
    • TCP 再送パケットをトレース
    • TCP 再送は遅延とスループット問題を引き起こす可能性がある
    • ESTABLISHED 再送が多ければ、ネットワーク設計を見直し
    • SYN_SENT 再送が多ければ、カーネルの CPU 飽和はパケットドロップを疑う
  • runqlat
    • CPU run queue の滞留時間をトレース
    • 滞留時間が長いということは、CPU 処理が追いついていないということ
  • profile
    • CPU プロファイラー
    • サンプリングされたスタックトレースを表示

まとめ

何か問題が発生したとき、多くの場合では原因が不明なまま調査が長期してしまいます。どのような情報をどのように収集すると原因解決の近道なるか、知っておくと役立つと思います。
アプリケーションの問題なのか、インフラ観点だと CPU なのか、デバイス I/O なのか、ネットワークなのか、カーネルなのか調査対象が多岐に渡ります。こういったツールを使う、または、知識を持っておくことで原因の早期発見につながると思います。

参考

aws-ec2rescue-linux
bcc
Amazon EC2 Linux インスタンス内のパフォーマンスのボトルネックをトラブルシューティングしようと考えています。EC2Rescue for Linux では、どのような高度なツールを使ってそれを行えますか?
Use EC2Rescue for Linux

Discussion