🐰

機械学習モデル作成シリーズ Step4 データ確認

2024/03/09に公開

Python

機械学習

tech

機械学習モデル作成のStep4です。前回はロードしたデータを取り扱うデータ形式について扱いました。
今回はデータの確認を解説します。

4 Pandasによるデータ確認

通常、データをロードした後はデータを確認し、平均値や中央値などの統計情報から異常値の存在を確認します。目的はデータ概要の理解、異常値の除去や誤ったデータエントリの修正です。

データ例にはkaggle HSMコンペの脳波データを使用します。
データには有害な脳症状が発生した時の脳波についての情報が含まれており、id(識別子)やexpert_consensus(専門家の判断)、vote(その症状であると判断した専門家の数)等が含まれます。

今回はデータをロードし確認するための、pandasの各種メソッドを解説します。

4.0 データロード

csvデータをDataFrameとしてロードします。

train = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')

4.1 .shape

データの形状を確認します。二次元配列なら(x,y)三次元配列なら(a,b,c)のような形式で返されます。
・例

print(train.shape)

# 出力
# (106800, 15)

4.2 .groupby()

データを集約し、一意のデータ数を確認します。
・例

print(train.groupby("spectrogram_id").head(1).reset_index(drop=True).shape)

# 出力
# (11138, 15)

groupbyにより、一意の"spectrogram_id"列のデータに対して、他の行のデータが集約されます。
groupbyは集約データの操作とセットで利用され、今回は.head(1)なので、"sectrogram_id"ρテウのデータが重複していた場合、一番最初の行のみが保持されます。
またreset_indexにより、保持されたデータに新しく列インデックスが割り振られます。インデックスは行の参照や他のdfとの結合のキーとして利用できます。

・groupbyの例

import pandas as pd

# サンプルデータの作成
data = {'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
        'Values': [100, 150, 200, 250, 300, 350]}
df = pd.DataFrame(data)

print(df)
# 出力
#   Category  Values
# 0        A     100
# 1        A     150
# 2        B     200
# 3        B     250
# 4        C     300
# 5        C     350

# 各カテゴリーの最初のデータを取得
first_in_group = df.groupby('Category').head(1)
print(first_in_group)
# 出力
#   Category  Values
# 0        A     100
# 2        B     200
# 4        C     300

# 各カテゴリーのValuesの平均を計算。新しい列が生成されるため、インデックスが変更される(今回はCategory列がインデックスになっている)
mean_values = df.groupby('Category')['Values'].mean()
print(mean_values)
# 出力
# Category
# A    125.0
# B    225.0
# C    325.0

この操作はデータを集約します。
groupbyはできることが非常に多いので、また別の記事でも解説しようと思います。

4.3 .info()

データ各列の非欠損値の数と、データ型を確認します。

print(train.info())

# 出力
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 106800 entries, 0 to 106799
# Data columns (total 15 columns):
#  #   Column                            Non-Null Count   Dtype  
# ---  ------                            --------------   -----  
#  0   eeg_id                            106800 non-null  int64  
#  1   eeg_sub_id                        106800 non-null  int64  
#  2   eeg_label_offset_seconds          106800 non-null  float64
#  3   spectrogram_id                    106800 non-null  int64  
#  4   spectrogram_sub_id                106800 non-null  int64  
#  5   spectrogram_label_offset_seconds  106800 non-null  float64
#  6   label_id                          106800 non-null  int64  
#  7   patient_id                        106800 non-null  int64  
#  8   expert_consensus                  106800 non-null  object 
#  9   seizure_vote                      106800 non-null  int64  
#  10  lpd_vote                          106800 non-null  int64  
#  11  gpd_vote                          106800 non-null  int64  
#  12  lrda_vote                         106800 non-null  int64  
#  13  grda_vote                         106800 non-null  int64  
#  14  other_vote                        106800 non-null  int64  
# dtypes: float64(2), int64(12), object(1)
# memory usage: 12.2+ MB
# None # 戻り値

欠損値は見当たらず、expert_consensus列がobject型、その他の列がint型やfloat型であることが分かります。
object型は
・文字列（str）のデータ
・日付や時刻を含む列で、Pandasが特定の日付/時刻型として認識できない場合
・リスト、辞書などのPythonの複合データ型
・Pythonの任意のオブジェクト
などに割り当てられます。機械学習モデルは数値しか扱えないため、このデータは何かしらの処理をする必要があります。

4.4 .dtypes()

データ型を確認します。基本的にはinfo()と同じですが、dtypes()は結果をpd.Seriesで返すため、この結果をそのまま別の処理に使用することができます。
※info()の返り値はNoneです。

print(train.dtypes())

# 出力
# eeg_id                                int64
# eeg_sub_id                            int64
# eeg_label_offset_seconds            float64
# spectrogram_id                        int64
# spectrogram_sub_id                    int64
# spectrogram_label_offset_seconds    float64
# label_id                              int64
# patient_id                            int64
# expert_consensus                     object
# seizure_vote                          int64
# lpd_vote                              int64
# gpd_vote                              int64
# lrda_vote                             int64
# grda_vote                             int64
# other_vote                            int64
# dtype: object

4.5 .describe()

統計情報を確認します。数値型(intやfloat)の列に対してのみ適用されます。
返り値はDataFrameです。
※統計情報は欠損値を除外した上で計算されます。
・count 非欠損値の数
・maen 平均値。合計を要素数で割ったもの
・std 標準偏差。データが平均値からどの程度バラついているか
・min 最小値
・25% 第一四分位数(Q1)。データを大きさ順に並べた時に下から25%の位置にある値
・50% 中央値(Q2)。データを大きさ順に並べた時に50%の位置にある値
・75% 第三四分位数(Q3)。データを大きさ順に並べた時に下から75%の位置にある値
・max 最大値

print(train.describe())

# 出力
# 	eeg_id	eeg_sub_id	eeg_label_offset_seconds	spectrogram_id	spectrogram_sub_id	spectrogram_label_offset_seconds	label_id	patient_id	seizure_vote	lpd_vote	gpd_vote	lrda_vote	grda_vote	other_vote
# count	1.068000e+05	106800.000000	106800.000000	1.068000e+05	106800.000000	106800.000000	1.068000e+05	106800.000000	106800.000000	106800.000000	106800.000000	106800.000000	106800.000000	106800.000000
# mean	2.104387e+09	26.286189	118.817228	1.067262e+09	43.733596	520.431404	2.141415e+09	32304.428493	0.878024	1.138783	1.264925	0.948296	1.059185	1.966283
# std	1.233371e+09	69.757658	314.557803	6.291475e+08	104.292116	1449.759868	1.241670e+09	18538.196252	1.538873	2.818845	3.131889	2.136799	2.228492	3.621180
# min	5.686570e+05	0.000000	0.000000	3.537330e+05	0.000000	0.000000	3.380000e+02	56.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
# 25%	1.026896e+09	1.000000	6.000000	5.238626e+08	2.000000	12.000000	1.067419e+09	16707.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
# 50%	2.071326e+09	5.000000	26.000000	1.057904e+09	8.000000	62.000000	2.138332e+09	32068.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
# 75%	3.172787e+09	16.000000	82.000000	1.623195e+09	29.000000	394.000000	3.217816e+09	48036.000000	1.000000	1.000000	0.000000	1.000000	1.000000	2.000000
# max	4.294958e+09	742.000000	3372.000000	2.147388e+09	1021.000000	17632.000000	4.294934e+09	65494.000000	19.000000	18.000000	16.000000	15.000000	15.000000	25.000000

4.5 .describe(include=['O'])

統計情報を確認します。object型の列に対してのみ適用されます。
※統計情報は欠損値を除外した上で計算されます。
※include='all'で数値型とobject型の両方に適用されます。

・count 非欠損値の数
・unique 列に存在する一意なデータの数
・top 列に最も多く出現する値
・freq topの値が出現する回数

print(train.describe(include=['O'])) # 大文字のo

# 出力
#	expert_consensus
# count	106800
# unique	6
# top	Seizure
# freq	20933

4.6 ['列名'].unique()

列に存在する一意のデータを確認します。
Numpyのndarrayを返します。

print(train['expert_consensus']).unique())

# 出力
# array(['Seizure', 'GPD', 'LRDA', 'Other', 'GRDA', 'LPD'], dtype=object)

4.7 ['列名'].nunique()

列に存在する一意のデータの数を確認します。unique()のデータ数です。

print(train['expert_consensus']).nunique())

# 出力
# 6

4.8 .isnull().sum()

欠損値を確認します。
isnull(): 欠損値の位置がTrueのdfを生成
sum(): 列ごとの数を集計

print(train.isnull().sum())

# 出力
# eeg_id                              0
# eeg_sub_id                          0
# eeg_label_offset_seconds            0
# spectrogram_id                      0
# spectrogram_sub_id                  0
# spectrogram_label_offset_seconds    0
# label_id                            0
# patient_id                          0
# expert_consensus                    0
# seizure_vote                        0
# lpd_vote                            0
# gpd_vote                            0
# lrda_vote                           0
# grda_vote                           0
# other_vote                          0
# dtype: int64

すべての行に欠損値が含まれていないことが分かります。
DataFrameをsumで集計するのでSeriesを返します。

4.9 .isnull().sum().sum()

欠損値を確認します。
.isnull().sum()の合計なので、前データに含まれる欠損地の数を表します。

print(train.isnull().sum().sum())

# 出力
# 0

4.10 .value_counts()

列に含まれるデータの種類と数を確認します。
各データがいくつ含まれているのかをSeries/DataFarmeで返します

print(train['expert_consensus'].value_counts())

# 出力
# expert_consensus
# Seizure    20933
# GRDA       18861
# Other      18808
# GPD        16702
# LRDA       16640
# LPD        14856
# Name: count, dtype: int64

print(train['sizure_vote'].value_counts())
seizure_vote

# 出力
# 0     73906
# 3     19520
# 1      6475
# 2      2329
# 5      1825
# 4      1745
# 6       336
# 7       313
# 8        91
# 9        57
# 10       54
# 15       36
# 13       30
# 11       29
# 14       25
# 12       22
# 19        4
# 16        3
# Name: count, dtype: int64

コード全文

# kaggleのnotebook環境での実行を推奨
class preprocessing():
    def __init__(self, train):
        self.train = train
    def grouping(self):
        train = self.train.groupby("id").head(1).reset_index(drop=True)
        print(train.shape)
    def data_check(self):
        print("\n" + "/"*10 + "shape" + "/"*20)
        print(self.train.shape)
        print("\n" + "/"*10 + "shape in unique id" + "/"*20)
        self.grouping()
        print("\n" + "/"*10 + "info()" + "/"*20)
        display(self.train.info())
        print("\n" + "/"*10 + "dtypes" + "/"*20)
        display(self.train.dtypes)
        print("\n" + "/"*10 + "descrive()" + "/"*20)
        display(self.train.describe())
        # descrive categorical data
        print("\n" + "/"*10 + "descrive(include=[0])" + "/"*20)
        display(self.train.describe(include=['O']))
        print("\n" + "/"*10 + "unique() expert_consensus" + "/"*20)
        display(self.train['expert_consensus'].unique())
        print("\n" + "/"*10 + "nunique() expert_consensus" + "/"*20)
        display(self.train['expert_consensus'].nunique())
        print("\n" + "/"*10 + "unique() seizure_vote" + "/"*20)
        display(self.train['seizure_vote'].unique())
        print("\n" + "/"*10 + "nunique() seizure_vote" + "/"*20)
        display(self.train['seizure_vote'].nunique())
        # num of missing value in every column
        print("\n" + "/"*10 + "isnull().sum()" + "/"*20)
        display(self.train.isnull().sum())
        # num of missing value whole data
        print("\n" + "/"*10 + "isnull().sum().sum()" + "/"*20)
        display(self.train.isnull().sum().sum())
        # check appearance frequency of specify column
        print("\n" + "/"*10 + "value_counts() expert_consensus" + "/"*20)
        display(self.train['expert_consensus'].value_counts())
        print("\n" + "/"*10 + "value_counts() seizure_vote" + "/"*20)
        display(self.train['seizure_vote'].value_counts())

def load_data():
    train = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
    prep = preprocessing(train)
    prep.data_check()

load_data()

//////////shape////////////////////
(106800, 15)
tuple

//////////shape in unique spectrogram_id////////////////////
(11138, 15)
tuple

//////////info()////////////////////
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106800 entries, 0 to 106799
Data columns (total 15 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   eeg_id                            106800 non-null  int64  
 1   eeg_sub_id                        106800 non-null  int64  
 2   eeg_label_offset_seconds          106800 non-null  float64
 3   spectrogram_id                    106800 non-null  int64  
 4   spectrogram_sub_id                106800 non-null  int64  
 5   spectrogram_label_offset_seconds  106800 non-null  float64
 6   label_id                          106800 non-null  int64  
 7   patient_id                        106800 non-null  int64  
 8   expert_consensus                  106800 non-null  object 
 9   seizure_vote                      106800 non-null  int64  
 10  lpd_vote                          106800 non-null  int64  
 11  gpd_vote                          106800 non-null  int64  
 12  lrda_vote                         106800 non-null  int64  
 13  grda_vote                         106800 non-null  int64  
 14  other_vote                        106800 non-null  int64  
dtypes: float64(2), int64(12), object(1)
memory usage: 12.2+ MB
None
NoneType

//////////dtypes////////////////////
eeg_id                                int64
eeg_sub_id                            int64
eeg_label_offset_seconds            float64
spectrogram_id                        int64
spectrogram_sub_id                    int64
spectrogram_label_offset_seconds    float64
label_id                              int64
patient_id                            int64
expert_consensus                     object
seizure_vote                          int64
lpd_vote                              int64
gpd_vote                              int64
lrda_vote                             int64
grda_vote                             int64
other_vote                            int64
dtype: object
pandas.core.series.Series

//////////descrive()////////////////////
eeg_id	eeg_sub_id	eeg_label_offset_seconds	spectrogram_id	spectrogram_sub_id	spectrogram_label_offset_seconds	label_id	patient_id	seizure_vote	lpd_vote	gpd_vote	lrda_vote	grda_vote	other_vote
count	1.068000e+05	106800.000000	106800.000000	1.068000e+05	106800.000000	106800.000000	1.068000e+05	106800.000000	106800.000000	106800.000000	106800.000000	106800.000000	106800.000000	106800.000000
mean	2.104387e+09	26.286189	118.817228	1.067262e+09	43.733596	520.431404	2.141415e+09	32304.428493	0.878024	1.138783	1.264925	0.948296	1.059185	1.966283
std	1.233371e+09	69.757658	314.557803	6.291475e+08	104.292116	1449.759868	1.241670e+09	18538.196252	1.538873	2.818845	3.131889	2.136799	2.228492	3.621180
min	5.686570e+05	0.000000	0.000000	3.537330e+05	0.000000	0.000000	3.380000e+02	56.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	1.026896e+09	1.000000	6.000000	5.238626e+08	2.000000	12.000000	1.067419e+09	16707.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	2.071326e+09	5.000000	26.000000	1.057904e+09	8.000000	62.000000	2.138332e+09	32068.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	3.172787e+09	16.000000	82.000000	1.623195e+09	29.000000	394.000000	3.217816e+09	48036.000000	1.000000	1.000000	0.000000	1.000000	1.000000	2.000000
max	4.294958e+09	742.000000	3372.000000	2.147388e+09	1021.000000	17632.000000	4.294934e+09	65494.000000	19.000000	18.000000	16.000000	15.000000	15.000000	25.000000
pandas.core.frame.DataFrame

//////////descrive(include=[0])////////////////////
expert_consensus
count	106800
unique	6
top	Seizure
freq	20933
pandas.core.frame.DataFrame

//////////unique() expert_consensus////////////////////
array(['Seizure', 'GPD', 'LRDA', 'Other', 'GRDA', 'LPD'], dtype=object)
numpy.ndarray

//////////nunique() expert_consensus////////////////////
6
int

//////////unique() seizure_vote////////////////////
array([ 3,  0,  1,  4,  5,  6, 13,  2, 12, 14,  7, 10, 15,  9,  8, 11, 16,
       19])
numpy.ndarray

//////////nunique() seizure_vote////////////////////
18
int

//////////isnull().sum()////////////////////
eeg_id                              0
eeg_sub_id                          0
eeg_label_offset_seconds            0
spectrogram_id                      0
spectrogram_sub_id                  0
spectrogram_label_offset_seconds    0
label_id                            0
patient_id                          0
expert_consensus                    0
seizure_vote                        0
lpd_vote                            0
gpd_vote                            0
lrda_vote                           0
grda_vote                           0
other_vote                          0
dtype: int64
pandas.core.series.Series

//////////isnull().sum().sum()////////////////////
0
numpy.int64

//////////value_counts() expert_consensus////////////////////
expert_consensus
Seizure    20933
GRDA       18861
Other      18808
GPD        16702
LRDA       16640
LPD        14856
Name: count, dtype: int64
pandas.core.series.Series

//////////value_counts() seizure_vote////////////////////
seizure_vote
0     73906
3     19520
1      6475
2      2329
5      1825
4      1745
6       336
7       313
8        91
9        57
10       54
15       36
13       30
11       29
14       25
12       22
19        4
16        3
Name: count, dtype: int64
pandas.core.series.Series

Step4まとめ

今回は、ロードしたデータを確認する方法を紹介しました。
Step5はデータの前処理について解説する予定です。
読んでいただきありがとうございました。