👌

【E資格対策】k-meansの実装方法について詳しく見る

2023/01/22に公開

Python

E資格では、numpyだけで実装したk-means法のコードが出題されます。
k-means法の理解のために、それぞれのコードが何を計算しているのか、をまとめてみました。

k-means法の実装（全体）

まず、全体感から眺めていきます。
k-means法は、大きく三つの関数で構成されます。

重心の初期化
重心からの距離の計算
クラスタリング(k-means)の実行

import numpy as np

#1 重心の初期化
def init_centroid(X, n_data, k):
    idx = np.random.permutation(n_data)[:k]
    centroids = X[idx]
    return centroids

#2 重心からの距離の計算
def compute_distance(X, k, n_data, centroids):
    distances = np.zeros((n_data, k))
    for idx_centroid in range(k):
        dist = np.sqrt(np.sum((X-centroids[idx_centroid])**2, axis=1))
        distances[:, idx_centroid] = dist
    return distances

#3 クラスタリングの実行
def k_means(k, X, max_iter=300):
    n_data, n_features = X.shape
    centorids = init_centroid(X, n_data, k)
    new_cluster = np.zeros(n_data)
    cluster = np.zeros(n_data)
    for epoch in range(max_iter):
        distance = compute_distance(X, k, n_data, centroids)
        new_cluster = np.argmin(distance, axis=1)
        for idx_centroids in range(k):
            centorids[idx_centroids] = X[new_cluster == idx_centroids].mean(axis=0)      
        if (new_cluster == cluster).all():
            break        
        cluster = new_cluster
    return cluster

引数の設定

k-meansに与える引数を設定します。
必要な引数は、X（クラスタリングの対象となる行列）、n_data（Xのデータ数）、k(クラスタリング数）となります。

Xについて

ここでは、以下のように9×3の2次元配列を考えます。
0〜8の9つの数字がリストのインデックスに対応します。

X = np.arange(27).reshape(3,-1).T
<output>
array([[ 0,  9, 18],
       [ 1, 10, 19],
       [ 2, 11, 20],
       [ 3, 12, 21],
       [ 4, 13, 22],
       [ 5, 14, 23],
       [ 6, 15, 24],
       [ 7, 16, 25],
       [ 8, 17, 26]])

n_dataについて

計算しなくてもわかりますが、データ数は9になります。

n_data = X.shape[0]
<output>
9

kについて

今回はk =3としたいと思います。
したがって、Xを3つのクラスに分類することになります。

1. 重心の初期化

この関数は、最初の重心となるベクトルを決めるものです。
ベクトルは、Xの中からK（＝3）個をランダムに抽出します。

np.random.permutation(n_data)は、9つの数字をランダムに並び替えます。

np.random.permutation(n_data)

<output>
array([8, 2, 7, 5, 6, 4, 1, 0, 3])

np.random.permutation(n_data)[:k]は、リストのスライスを使って、ランダムに並び替えた数字の前からk個を取り出します。

np.random.permutation(n_data)[:k]

<output>
array([8, 2, 7,])

X[idx]とすることで、Xからidxに対応するインデックスの行を取り出します。
それをcentroidsに格納します。
outputでは、Xからインデックス[8,2,7]に対応するリスト（行）が取り出されました。

centroids = X[idx]
centroids
<output>
array([[ 8, 17, 26],
       [ 2, 11, 20],
       [ 7, 16, 25]])

2. 重心からの距離の計算

3つの重心と、Xの9つの値との距離を計算します。
そのために、9×3の空の行列をまずは作ります。ここの3は、重心の数の3です。
この行列に距離を入れていきます。

distances = np.zeros((n_data, k))
distances

<output>
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

for文で、3つの重心からの距離を計算していきます。
まずは最初の重心0です。
centroids[0]には[8,17,26]が格納されています。
Xの9つのリスト（行）と[8,17,26]の差をとって二乗して、列方向に足し算し平方根をとったものをdistに格納します。

dist = np.sqrt(np.sum((X-centroids[0])**2, axis=1))
dist 
<output>
array([13.85640646, 12.12435565, 10.39230485,  8.66025404,  6.92820323,
        5.19615242,  3.46410162,  1.73205081,  0.        ])

重心0との距離distをゼロ行列だったdistanceの0列目に格納します。
最初のループを終了した時点では、distanceは以下の通りになります。

distance[:, 0] = dist
distance
<output>
array([[13.85640646,  0.        ,  0.        ],
       [12.12435565,  0.        ,  0.        ],
       [10.39230485,  0.        ,  0.        ],
       [ 8.66025404,  0.        ,  0.        ],
       [ 6.92820323,  0.        ,  0.        ],
       [ 5.19615242,  0.        ,  0.        ],
       [ 3.46410162,  0.        ,  0.        ],
       [ 1.73205081,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ]])

compute_distanceを実行すると、次の通りになります。

compute_distance(X, k, n_data, centroids)
<output>
array([[13.85640646,  3.46410162, 12.12435565],
       [12.12435565,  1.73205081, 10.39230485],
       [10.39230485,  0.        ,  8.66025404],
       [ 8.66025404,  1.73205081,  6.92820323],
       [ 6.92820323,  3.46410162,  5.19615242],
       [ 5.19615242,  5.19615242,  3.46410162],
       [ 3.46410162,  6.92820323,  1.73205081],
       [ 1.73205081,  8.66025404,  0.        ],
       [ 0.        , 10.39230485,  1.73205081]])

3. クラスタリングの実行

clusterは、Xのリスト（行）ごとに、どこの重心が近いか、つまりクラスターの番号を収納するベクトルになります。最初はゼロベクトルになります。

cluster = np.zeros(n_data)
cluster
array([0., 0., 0., 0., 0., 0., 0., 0., 0.])

comute_distance()で計算されたdistanceの行列には、Xの各インデックスと各重心との距離が格納されています。Xの各インデックスごとに、最も距離が小さい重心（列）を求めにいきます。
new_clusterには、新しい重心が格納されます。

distance = compute_distance(X, k, n_data, centoroids)
new_cluster = np.argmin(distance, axis=1)
new_cluster
<output>
array([1, 1, 1, 1, 1, 2, 2, 2, 0])

イメージ

Xのリストから、new_clusterと重心の番号が一致するベクトルを取り出します。

X[new_cluster == idx_centroids]
<output>
idx_centroidsが0の時
array([[ 8, 17, 26]])

idx_centroidsが1の時
array([[ 0,  9, 18],
       [ 1, 10, 19],
       [ 2, 11, 20],
       [ 3, 12, 21],
       [ 4, 13, 22]])

idx_centroidsが2の時
array([[ 5, 14, 23],
       [ 6, 15, 24],
       [ 7, 16, 25]])

重心ごとに取り出したベクトルについて、行方向に平均します。

X[new_cluster == idx_centroids].mean(axis=0)
<output>
idx_centroidsが0の時
array([ 8., 17., 26.])
idx_centroidsが1の時
array([ 2., 11., 20.])
idx_centroidsが2の時
array([ 6., 15., 24.])

平均したベクトルが、新しい重心になります。

centorids[idx_centroids] 
<output>
array([[ 8., 17., 26.],
            [ 2., 11., 20.],
           [ 6., 15., 24.]])

これをfor文でイテレータの数だけ繰り返します。