iTranslated by AI
Trying out Japanese OCR with Google Cloud Vision API
Recently, I wrote an entry titled Trying out the Japanese OCR function of the Azure Computer Vision API. At the end of that post, I mentioned that Google Cloud's Vision API also supports Japanese OCR. I had tried that feature once about two and a half years ago, but thinking that the functionality and accuracy might have evolved since then, I decided to test the Japanese OCR feature of the Google Cloud Vision API again and conduct an accuracy measurement under the same conditions as the Azure Computer Vision API entry.
Using the Google Cloud Vision API OCR Feature
For the setup method of the Google Cloud Vision API, please refer to Quickstart: Setup the Vision API.
After creating a project and ensuring that billing is enabled, enable the Vision API. Create a service account with the role "Project > Owner" and download the service account's key file (in JSON format) to your local machine.
The code is created based on the sample code from @google-cloud/vision.
First, npm install the following npm package.
$ npm install @google-cloud/vision
Next, define the service account key file obtained above in an environment variable. Please customize the value according to your downloaded file name and storage location. This is the setup method for Mac and Linux; for Windows, please refer to the command examples in the quickstart mentioned above.
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/yourserviceaccountkey.json"
I created index.js and wrote the following code. Enter the URL of the image file in the argument part when calling the detectText function. I verified it with several image files this time, modifying the image file URL in the argument each time.
async function detectText(fileName) {
// [START vision_text_detection]
const vision = require('@google-cloud/vision');
// Creates a client
const client = new vision.ImageAnnotatorClient();
// Performs text detection on the local file
const [result] = await client.textDetection(fileName);
console.log(result.textAnnotations[0].description);
}
detectText('https://sample.com/images/01.jpg');
When you are ready, run the script with node index.js. First, as a test, let's perform text recognition on this image.
$ node index.js
一行目おはようございます
二行目こんにちは
三行目こんばんは
I confirmed that the text recognition results are returned correctly.
Text Recognition Test and Accuracy Measurement
As in the previous entry, I used the opening sentences of "The Selfish Giant" (original by Oscar Wilde, translated by Hiroshi Yuki). I used images (the exact same ones as the previous entry) using four types of fonts: Yu Gothic, Meiryo, MS P Mincho, and Anzu-Moji 2020. The text recognition rate was also calculated using the same formula: (Number of correct characters - Levenshtein distance between correct text and recognition result) / Number of correct characters.
The results are as shown in the table below.
| Font | Image | Recognition Rate |
|---|---|---|
| Yu Gothic | Image | 98.9% (378/382) |
| Meiryo | Image | 99.2% (379/382) |
| MS P Mincho | Image | 99.2% (379/382) |
| Anzu-Moji 2020 | Image | 98.6% (377/382) |
The reason the recognition rates slightly fall below 100% in all cases is that for numbers and exclamation marks, three characters in the original text are full-width, while the recognition results return them as half-width. Since these differences are negligible in terms of meaning, the recognition rate for Meiryo and MS P Mincho could essentially be considered 100%.
In the Yu Gothic image, an extra character was inserted in the phrase "鳥たちは" (Tori-tachi wa), making it "鳥島たちは" (Torishima-tachi wa). In the Anzu-Moji 2020 image, the kanji for the number ten "十" in the phrase "十二本" (Juni-hon) was recognized as a half-width plus sign "+". While a human could judge that the probability of a plus sign appearing there is low based on the context, it is understandable that it was recognized as "+" based solely on the appearance of the characters in the image file.
These are the results of using the Japanese OCR feature of the Google Cloud Vision API.
When I first tried the Japanese OCR feature of the Google Cloud Vision API two and a half years ago, there were many failures in text recognition for images containing special fonts or handwritten Japanese, giving me the impression that its use cases were quite limited. However, the recognition rate for the handwriting-style font this time truly showed the evolution of the API. I was so surprised by the initial results that I thought I might have used the wrong image file for recognition.
In AI, continuous learning is crucial. One of the strengths of cloud-based AI APIs is that the cloud vendor continuously updates the models, allowing users to always leverage the latest ones. I believe the results of this verification have demonstrated a glimpse of that growth potential, which is one of those key strengths.
Note that Anzu-Moji 2020 is a handwriting-style font, and actual handwritten characters have different characteristics, such as the writer's habits or not being perfectly aligned. Therefore, these results do not directly lead to the conclusion that the Google Cloud Vision API is suitable for actual handwriting. Furthermore, since AI APIs from various companies are being improved continuously, a comparative evaluation at a single point in time does not definitively determine the superiority of one AI API over another.
Discussion