iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📈

Exploring Kernel SVM

に公開

Purpose

Quantum feature maps and kernels contains content on "Quantum Kernels," but since the appearance of it is quite difficult, I want to handle classical Kernel SVM first. Thinking about Optimization (2) — SVM was actually an article written as an even earlier stage: "Naive SVM without using the kernel method." I would like to continue from there.

Definition of the Classification Problem and Dataset

Since there are many very easy-to-use illustrations, I would like to borrow the illustrations and the dataset creation method directly from the literature Quantum feature maps and kernels.

While I don't know the specific name, there is a dataset often seen in machine learning textbooks, including [PML], where the separating hyperplane cannot be clearly determined by linear methods, as shown below.

Note that this can be easily created with the following code. It is a dataset scattered on two concentric circles, where data A on the inner circle is given label 0 and data B on the outer circle is given label 1. The goal is to perform classification using some form of supervised machine learning.

We define data A and B as follows and prepare the data list X and label list y.

from sklearn.datasets import make_circles

X, Y = make_circles(n_samples=200, noise=0.05, factor=0.4)

A = X[np.where(Y==0)]
B = X[np.where(Y==1)]

A_label = np.zeros(A.shape[0], dtype=int)
B_label = np.ones(B.shape[0], dtype=int)
X = np.concatenate([A, B])
y = np.concatenate([A_label, B_label])

Kernel SVM

This is where the kernel method comes in. Let \Omega be the space where the data exists. In this case, \Omega = \mathbb{R}^2. The idea is to map this to a convenient (generally higher-dimensional) vector space \mathcal{H} using a map called a feature map \Phi: \Omega \to \mathcal{H}, and then perform SVM within \mathcal{H}.

Below, for example, if we set \mathcal{H} = \mathbb{R}^3 and take \Phi: \mathbb{R}^2 \to \mathbb{R}^3 as \Phi(x,y) \mapsto (x,y,x^2+y^2), it becomes like a pot created by spinning a parabola around the z-axis on a potter's wheel. The red ring moves toward the bottom of the pot, and the blue ring moves to the middle. In this state, if we consider a separating hyperplane that divides the pot in \mathbb{R}^3 horizontally, we can successfully classify the dataset:

Mapping this back to the original \mathbb{R}^2, the separating hyperplane corresponds to the black ring:

Brief Overview of the Kernel Trick

In Thinking about Optimization (2) — SVM # Dual Problem, we considered the following dual problem for executing SVM:

\begin{align*} \tilde{L}(\lambda) &= \sum_i \lambda_i - \frac{1}{2} \sum_{i, j} \lambda_i \lambda_j t_i t_j \xi_i^T \xi_j \\ &= \sum_i \lambda_i - \frac{1}{2} \sum_{i, j} \lambda_i \lambda_j t_i t_j \braket{\xi_i, \xi_j} \tag{1} \end{align*}
\begin{align*} \max \tilde{L}(\lambda) \quad\text{subject to}\quad \lambda_i \geq 0,\ i = 1,2,\cdots \tag{2} \end{align*}

Where \braket{\xi_i, \xi_j} is the inner product of data \xi_i and \xi_j. This problem is "vanilla" SVM considered in \xi_i \in \Omega, but this time we will solve the dual problem in \mathcal{H}, mapped by the feature map. In other words:

\begin{align*} \tilde{L}(\lambda) = \sum_i \lambda_i - \frac{1}{2} \sum_{i, j} \lambda_i \lambda_j t_i t_j \braket{\Phi(\xi_i), \Phi(\xi_j)} \tag{1'} \end{align*}

We want to solve this under the constraints. However, in general, calculating the inner product \braket{\Phi(\xi_i), \Phi(\xi_j)} in a higher-dimensional vector space is computationally expensive and unrealistic[1].

Leaving the details to proper books like [Kernel Method], it is known that when we choose a truly convenient space for \mathcal{H}, we can find a function k: \Omega \times \Omega \to \mathbb{R} called a kernel function such that

\begin{align*} \braket{\Phi(\xi_i), \Phi(\xi_j)} = k(\xi_i, \xi_j) \tag{3} \end{align*}

The story is that if the calculation of k is simple enough, the inner product calculation for high-dimensional computation can be performed at high speed. This is what is called the "kernel trick."

Trying to Solve It

While the theoretical details are covered in references like [PRML], I would like to try solving the problem using well-known kernel functions. Referencing [PRML] and [PML], I will use the "polynomial kernel" and "RBF kernel." This time, I will use scikit-learn as per [sklearn.svm.SVC] to solve it.

Polynomial Kernel

The form of the kernel function is

\begin{align*} k(\xi_i, \xi_j) = (\xi_i^T \xi_j + c)^d \end{align*}

and so on[2]. Recalling the pot illustration earlier, it seemed solvable in \mathcal{H} = \mathbb{R}^3. Looking at the feature map \Phi(x,y) = (x,y,x^2+y^2) used in the pot example and calculating the inner product in the feature space:

\begin{align*} (\xi_i\ \ \eta_i\ \ \xi_i^2 + \eta_i^2) \begin{pmatrix} \xi_j \\ \eta_j \\ \xi_j^2 + \eta_j^2 \end{pmatrix} = (\xi_i \xi_j)^2 + (\eta_i \eta_j)^2 + \cdots \end{align*}

With this in mind and looking at equation (3), we can get the intuition that using a polynomial kernel function with d=2 should work[3]. Indeed, as shown below, we can see that the decision boundary is determined quite well.

from sklearn.svm import SVC
from mlxtend.plotting import plot_decision_regions

svm = SVC(kernel='poly', degree=2)
svm.fit(X, y)

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
fig = plot_decision_regions(X, y, clf=svm, ax=ax)
plt.legend(loc='upper left')
ax.set_aspect('equal')
plt.grid()
plt.show()

RBF Kernel

Since we understand the dataset well this time, it was easy to predict that a polynomial kernel would solve it. However, it can be solved with other kernels as well, so next, I will use the RBF kernel. The form of the RBF kernel function is

\begin{align*} k(\xi_i, \xi_j) = \exp \left( - \frac{1}{2 \sigma^2} \| \xi_i - \xi_j \|^2 \right) \end{align*}

Solving this is also just a matter of calling the API, simply changing to kernel='rbf', and it works almost without any fine-tuning.

svm = SVC(kernel='rbf')
svm.fit(X, y)

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
fig = plot_decision_regions(X, y, clf=svm, ax=ax)
plt.legend(loc='upper left')
ax.set_aspect('equal')
plt.grid()
plt.show()

A figure where the decision boundary looks a bit more distorted than the polynomial kernel appeared, but that is not essential.

Using a Custom Kernel

So far, it's just as described in common machine learning books, so it's not particularly interesting. Following reference [QSVM], I would like to try using a custom kernel. A custom kernel is defined as follows for a dataset \{ \xi_i \}_{i=1}^n:

\begin{align*} (k(\xi_i, \xi_j))_{1 \leq i,j \leq n} = \begin{pmatrix} k(\xi_1, \xi_1) & \cdots & k(\xi_1, \xi_n) \\ \vdots & \ddots & \vdots \\ k(\xi_n, \xi_1) & \cdots & k(\xi_n, \xi_n) \end{pmatrix} \tag{4} \end{align*}

k(\xi_i, \xi_j) represents the similarity between \xi_i and \xi_j in some sense. When using kernel='precomputed' in reference [sklearn.svm.SVC], it seems you should construct and pass a matrix composed of such similarities between data points[4].

Here, I will again create a custom kernel using the kernel trick \braket{\Phi(\xi_i), \Phi(\xi_j)} = k(\xi_i, \xi_j) from equation (3). And finally, I'm going to seriously use \Phi(x,y) \mapsto (x,y,x^2+y^2). In this case, the components of the custom kernel are derived from the inner product, forming what is known as a Gram matrix. Assuming the vector \Phi(\xi) is normalized to length 1, the (i,j)-component corresponds to the cosine similarity between \Phi(\xi_i) and \Phi(\xi_j). Even if not normalized, the Gram matrix is considered to represent the similarity of the dataset.

Now, the definition of the custom kernel is as follows. It simply involves creating a Gram matrix using this specific feature map:

def default_feature_map(x, y):
    return np.array([x, y, x**2 + y**2])

def calculate_kernel(x_data, y_data=None, feature_map=default_feature_map):
    if y_data is None:
        y_data = x_data
    x_matrix, y_matrix = [], []
    for x0, x1 in x_data:
        x_matrix.append(feature_map(x0, x1))
    for y0, y1 in y_data:
        y_matrix.append(feature_map(y0, y1))
    x_matrix, y_matrix = np.array(x_matrix), np.array(y_matrix)
    kernel = y_matrix.conjugate() @ x_matrix.transpose()
    return kernel

Splitting the dataset into training and testing sets:

def make_train_test_sets(test_ratio=.3):
    def split(arr, test_ratio):
        sep = int(arr.shape[0]*(1-test_ratio))
        return arr[:sep], arr[sep:]

    A_label = np.zeros(A.shape[0], dtype=int)
    B_label = np.ones(B.shape[0], dtype=int)
    A_train, A_test = split(A, test_ratio)
    B_train, B_test = split(B, test_ratio)
    A_train_label, A_test_label = split(A_label, test_ratio)
    B_train_label, B_test_label = split(B_label, test_ratio)
    X_train = np.concatenate([A_train, B_train])
    y_train = np.concatenate([A_train_label, B_train_label])
    X_test = np.concatenate([A_test, B_test])
    y_test = np.concatenate([A_test_label, B_test_label])
    return X_train, y_train, X_test, y_test

train_data, train_labels, test_data, test_labels = make_train_test_sets()

Then, the training with train_data and train_labels is as follows:

train_kernel = calculate_kernel(train_data)

model = SVC(kernel='precomputed')
model.fit(train_kernel, train_labels)

Verification

I'm just doing it with the understanding that this is how it works, but I pass the matrix created by calculating the similarity between the training set and the test set:

test_kernel = calculate_kernel(train_data, test_data)

pred = model.predict(test_kernel)

Visualizing these prediction results looks something like this:

fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111)
ax.set_aspect('equal')
ax.set_title("Predicted data classification")
ax.set_ylim(-2, 2)
ax.set_xlim(-2, 2)
for (x, y), pred_label in zip(test_data, pred):
    c = 'C0' if pred_label == 0 else 'C3'
    ax.add_patch(matplotlib.patches.Circle((x, y), radius=.01,
                 fill=True, linestyle='solid', linewidth=4.0,
                 color=c))
plt.grid()
plt.show()

It looks correct. To be sure, taking the score via the API yields a good value[5].

model.score(test_kernel, test_labels)

1.0

Summary

I applied Kernel SVM to a commonly seen linearly inseparable dataset and solved the classification problem using a polynomial kernel, RBF kernel, and a custom kernel.

I haven't written it yet, but I plan to write an article later like Playing with Quantum Kernel SVM — Qiskit, and this article was written as a prelude to that based on my mental roadmap. In the next article, by significantly simplifying Quantum feature maps and kernels, I plan to reduce the "black box" nature and aim for simple content that solves this toy-problem dataset without the mysterious and unsettling patterns seen in the original source.

Everything needed for the next article has been prepared in this article. In other words, the concepts and tools are all things that can be normally used in classical machine learning and are not special at all. From here, we can implement a Quantum Kernel SVM simply by bringing in a minimal quantum circuit. In other words, I want to show in the next article that Quantum Kernel SVM might sound like a big deal, but it's nothing to be afraid of.

To be continued...

References

脚注
  1. For example, when considering something like a GAN, if we take a space \Omega consisting of 2D noise and use a generator function to turn it into image data in a 28x28 = 784-dimensional space \mathcal{H}, and then try to classify that image data, it would involve inner products in a 784-dimensional space. One can sense that the cost of calculating the inner product would be greater than in 2D. ↩︎

  2. For the actual implementation in scikit-learn, refer to svm.cpp#L342-L345. The coefficients differ slightly, but they are not essential, so I will ignore them. ↩︎

  3. In reality, as explained in [Kernel Method], \mathcal{H} is constructed as a function space where the feature map maps to, and in this case, it would be a (d+1=3)-dimensional function space (formed by polynomials). We would need to map the basis of the function space to the standard basis of Euclidean space to return to \mathbb{R}^3, but this time I'll proceed broadly with a "the end justifies the means" attitude... ↩︎

  4. From what I've searched, there were cases where a matrix was constructed using similarity based on Levenshtein distance. I hesitate to link to it without permission, but you can find it by searching for something like "custom kernel scikit-learn SVM," and I think it's quite interesting. ↩︎

  5. Since I'm creating the dataset without fixing the seed, it might fluctuate slightly, but in my tests, it was at least 0.9 or higher. ↩︎

GitHubで編集を提案

Discussion