iTranslated by AI
Exploring Kernel SVM
Purpose
Quantum feature maps and kernels contains content on "Quantum Kernels," but since the appearance of it is quite difficult, I want to handle classical Kernel SVM first. Thinking about Optimization (2) — SVM was actually an article written as an even earlier stage: "Naive SVM without using the kernel method." I would like to continue from there.
Definition of the Classification Problem and Dataset
Since there are many very easy-to-use illustrations, I would like to borrow the illustrations and the dataset creation method directly from the literature Quantum feature maps and kernels.
While I don't know the specific name, there is a dataset often seen in machine learning textbooks, including [PML], where the separating hyperplane cannot be clearly determined by linear methods, as shown below.

Note that this can be easily created with the following code. It is a dataset scattered on two concentric circles, where data A on the inner circle is given label 0 and data B on the outer circle is given label 1. The goal is to perform classification using some form of supervised machine learning.
We define data A and B as follows and prepare the data list X and label list y.
from sklearn.datasets import make_circles
X, Y = make_circles(n_samples=200, noise=0.05, factor=0.4)
A = X[np.where(Y==0)]
B = X[np.where(Y==1)]
A_label = np.zeros(A.shape[0], dtype=int)
B_label = np.ones(B.shape[0], dtype=int)
X = np.concatenate([A, B])
y = np.concatenate([A_label, B_label])
Kernel SVM
This is where the kernel method comes in. Let
Below, for example, if we set

Mapping this back to the original

Brief Overview of the Kernel Trick
In Thinking about Optimization (2) — SVM # Dual Problem, we considered the following dual problem for executing SVM:
Where
We want to solve this under the constraints. However, in general, calculating the inner product
Leaving the details to proper books like [Kernel Method], it is known that when we choose a truly convenient space for
The story is that if the calculation of
Trying to Solve It
While the theoretical details are covered in references like [PRML], I would like to try solving the problem using well-known kernel functions. Referencing [PRML] and [PML], I will use the "polynomial kernel" and "RBF kernel." This time, I will use scikit-learn as per [sklearn.svm.SVC] to solve it.
Polynomial Kernel
The form of the kernel function is
and so on[2]. Recalling the pot illustration earlier, it seemed solvable in
With this in mind and looking at equation (3), we can get the intuition that using a polynomial kernel function with
from sklearn.svm import SVC
from mlxtend.plotting import plot_decision_regions
svm = SVC(kernel='poly', degree=2)
svm.fit(X, y)
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
fig = plot_decision_regions(X, y, clf=svm, ax=ax)
plt.legend(loc='upper left')
ax.set_aspect('equal')
plt.grid()
plt.show()

RBF Kernel
Since we understand the dataset well this time, it was easy to predict that a polynomial kernel would solve it. However, it can be solved with other kernels as well, so next, I will use the RBF kernel. The form of the RBF kernel function is
Solving this is also just a matter of calling the API, simply changing to kernel='rbf', and it works almost without any fine-tuning.
svm = SVC(kernel='rbf')
svm.fit(X, y)
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
fig = plot_decision_regions(X, y, clf=svm, ax=ax)
plt.legend(loc='upper left')
ax.set_aspect('equal')
plt.grid()
plt.show()
A figure where the decision boundary looks a bit more distorted than the polynomial kernel appeared, but that is not essential.

Using a Custom Kernel
So far, it's just as described in common machine learning books, so it's not particularly interesting. Following reference [QSVM], I would like to try using a custom kernel. A custom kernel is defined as follows for a dataset
kernel='precomputed' in reference [sklearn.svm.SVC], it seems you should construct and pass a matrix composed of such similarities between data points[4].
Here, I will again create a custom kernel using the kernel trick
Now, the definition of the custom kernel is as follows. It simply involves creating a Gram matrix using this specific feature map:
def default_feature_map(x, y):
return np.array([x, y, x**2 + y**2])
def calculate_kernel(x_data, y_data=None, feature_map=default_feature_map):
if y_data is None:
y_data = x_data
x_matrix, y_matrix = [], []
for x0, x1 in x_data:
x_matrix.append(feature_map(x0, x1))
for y0, y1 in y_data:
y_matrix.append(feature_map(y0, y1))
x_matrix, y_matrix = np.array(x_matrix), np.array(y_matrix)
kernel = y_matrix.conjugate() @ x_matrix.transpose()
return kernel
Splitting the dataset into training and testing sets:
def make_train_test_sets(test_ratio=.3):
def split(arr, test_ratio):
sep = int(arr.shape[0]*(1-test_ratio))
return arr[:sep], arr[sep:]
A_label = np.zeros(A.shape[0], dtype=int)
B_label = np.ones(B.shape[0], dtype=int)
A_train, A_test = split(A, test_ratio)
B_train, B_test = split(B, test_ratio)
A_train_label, A_test_label = split(A_label, test_ratio)
B_train_label, B_test_label = split(B_label, test_ratio)
X_train = np.concatenate([A_train, B_train])
y_train = np.concatenate([A_train_label, B_train_label])
X_test = np.concatenate([A_test, B_test])
y_test = np.concatenate([A_test_label, B_test_label])
return X_train, y_train, X_test, y_test
train_data, train_labels, test_data, test_labels = make_train_test_sets()
Then, the training with train_data and train_labels is as follows:
train_kernel = calculate_kernel(train_data)
model = SVC(kernel='precomputed')
model.fit(train_kernel, train_labels)
Verification
I'm just doing it with the understanding that this is how it works, but I pass the matrix created by calculating the similarity between the training set and the test set:
test_kernel = calculate_kernel(train_data, test_data)
pred = model.predict(test_kernel)
Visualizing these prediction results looks something like this:
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111)
ax.set_aspect('equal')
ax.set_title("Predicted data classification")
ax.set_ylim(-2, 2)
ax.set_xlim(-2, 2)
for (x, y), pred_label in zip(test_data, pred):
c = 'C0' if pred_label == 0 else 'C3'
ax.add_patch(matplotlib.patches.Circle((x, y), radius=.01,
fill=True, linestyle='solid', linewidth=4.0,
color=c))
plt.grid()
plt.show()

It looks correct. To be sure, taking the score via the API yields a good value[5].
model.score(test_kernel, test_labels)
1.0
Summary
I applied Kernel SVM to a commonly seen linearly inseparable dataset and solved the classification problem using a polynomial kernel, RBF kernel, and a custom kernel.
I haven't written it yet, but I plan to write an article later like Playing with Quantum Kernel SVM — Qiskit, and this article was written as a prelude to that based on my mental roadmap. In the next article, by significantly simplifying Quantum feature maps and kernels, I plan to reduce the "black box" nature and aim for simple content that solves this toy-problem dataset without the mysterious and unsettling patterns seen in the original source.
Everything needed for the next article has been prepared in this article. In other words, the concepts and tools are all things that can be normally used in classical machine learning and are not special at all. From here, we can implement a Quantum Kernel SVM simply by bringing in a minimal quantum circuit. In other words, I want to show in the next article that Quantum Kernel SVM might sound like a big deal, but it's nothing to be afraid of.
To be continued...
References
- [PRML] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
- [PML] S. Raschka, Python Machine Learning, Impress, 2020
- [Kernel Method] Kenji Fukumizu, Introduction to Kernel Methods, Asakura Publishing, 2010
- [QSVM] Quantum feature maps and kernels, Qiskit Textbook
- [sklearn.svm.SVC] sklearn.svm.SVC
-
For example, when considering something like a GAN, if we take a space
consisting of 2D noise and use a generator function to turn it into image data in a 28x28 = 784-dimensional space\Omega , and then try to classify that image data, it would involve inner products in a 784-dimensional space. One can sense that the cost of calculating the inner product would be greater than in 2D. ↩︎\mathcal{H} -
For the actual implementation in scikit-learn, refer to svm.cpp#L342-L345. The coefficients differ slightly, but they are not essential, so I will ignore them. ↩︎
-
In reality, as explained in [Kernel Method],
is constructed as a function space where the feature map maps to, and in this case, it would be a\mathcal{H} -dimensional function space (formed by polynomials). We would need to map the basis of the function space to the standard basis of Euclidean space to return to(d+1=3) , but this time I'll proceed broadly with a "the end justifies the means" attitude... ↩︎\mathbb{R}^3 -
From what I've searched, there were cases where a matrix was constructed using similarity based on Levenshtein distance. I hesitate to link to it without permission, but you can find it by searching for something like "custom kernel scikit-learn SVM," and I think it's quite interesting. ↩︎
-
Since I'm creating the dataset without fixing the seed, it might fluctuate slightly, but in my tests, it was at least 0.9 or higher. ↩︎
Discussion