Mistral AIのOCR機能使ってみた
こんにちは、酒井です!
株式会社 EGGHEAD(エッグヘッド)という「製造業で生成 AI を活用したシステム開発」をしている会社の代表をしております。
今回は X で話題になっていたMistral AIのOCR機能について手元で動かしてみたのでその内容をシェアしたいと思います。
TL;DR
- Mistral AIが最新の OCR API をリリース(2025年3月)
- Google、Azure、OpenAI、Gemini などの主要競合より高精度(全体精度: 94.89%)
- 1 分あたり最大 2000 ページ処理
- 料金: 1000 ページあたり 1 ドル(バッチ処理なら約 2000 ページ/1 ドル)
- OCRしたデータからJSON形式で構造化データを取得できる
- OCR後のデータから高精度で復元可能
性能
精度比較(全体スコア)
まず、GeminiやChatGPTなどを5%近く引き離してベンチマークトップになっています。
カテゴリ別性能比較
全体でトップなだけでなく、すべてのカテゴリでトップになっています。
カテゴリ | Mistral OCR | 次点モデル(精度) |
---|---|---|
数学的表現 | 94.29% | 89.11% (Gemini) |
多言語 | 89.55% | 87.52% (Azure) |
スキャン文書 | 98.96% | 96.15% (Gemini Pro) |
表形式データ | 96.12% | 91.70% (GPT-4o) |
処理速度
また、1分あたり最大2000ページと非常に高速です。
- 単一ノードで1 分あたり最大 2000 ページを処理可能
- 同カテゴリの他モデルより軽量かつ高速
コスト効率
- 1000 ページあたり 1 ドル
- バッチ処理では約2000 ページあたり 1 ドル
セルフホスト可能
オープンソースなのでなんとセルフホストが可能です。
- クラウド API: La Platforme からすぐに利用可能
- オンプレミス: データプライバシーやセキュリティ要件が厳しい組織向けに提供
構造化が可能
OCRで読み取ったデータをJSON形式で構造化して出力が可能です。
image_ocr_markdown = image_response.pages[0].markdown
chat_response = client.chat.complete(
model="pixtral-12b-latest",
messages=[
{
"role": "user",
"content": [
ImageURLChunk(image_url=base64_data_url),
TextChunk(text=f"This is image's OCR in markdown:\n<BEGIN_IMAGE_OCR>\n{image_ocr_markdown}\n<END_IMAGE_OCR>.\nConvert this into a sensible structured json response. The output should be strictly be json with no extra commentary")
],
},
],
response_format = {"type": "json_object"},
temperature=0
)
response_dict = json.loads(chat_response.choices[0].message.content)
json_string = json.dumps(response_dict, indent=4)
print(json_string)
これにより以下の出力となります。これは便利ですね。。
{
"parking_receipt": {
"header": {
"instructions": "PLACE FACE UP ON DASH",
"city": "CITY OF PALO ALTO",
"validity": "NOT VALID FOR ONSTREET PARKING"
},
"expiration": {
"date_time": "11:59 PM",
"date": "AUG 19, 2024"
},
"purchase": {
"date_time": "01:34pm Aug 19, 2024"
},
"payment": {
"total_due": "$15.00",
"rate": "Daily Parking",
"total_paid": "$15.00",
"payment_type": "CC (Swipe)"
},
"details": {
"ticket_number": "00005883",
"serial_number": "520117260957",
"setting": "Permit Machines",
"machine_name": "Civic Center"
},
"footer": {
"card_number": "#^^^^-1224",
"card_type": "Visa",
"instructions": "DISPLAY FACE UP ON DASH",
"expiration_reminder": "PERMIT EXPIRES AT MIDNIGHT"
}
}
}
試してみる
では、実際に試してみます。
PythonとTypeScriptでSDKが用意されていて、これだけでOCRができます。
import os
from mistralai import Mistral
api_key = os.environ["MISTRAL_API_KEY"]
client = Mistral(api_key=api_key)
ocr_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234"
},
include_image_base64=True
)
長いので、折りたたみますが、マークダウンで構造化データとしていることが分かります。画像は Base64 エンコーディングしてimagesプロパティに入れています。また、画像の大きさや文章中のどの位置にあるかも保存してあり、これにより、元のPDFが復元できるようになっています。
ただ、画像の情報は失われているので、RAGというユースケースでは画像に含まれる情報を質問しても適切に答えることはできません。欲をいうなら画像の情報をテキスト化したものをOCR後のデータに保持してほしかったですね。
OCR結果
{
"pages": [
{
"index": 0,
"markdown": "# LEVERAGING UNLABELED DATA TO PREDICT OUT-OF-DISTRIBUTION PERFORMANCE \n\nSaurabh Garg*<br>Carnegie Mellon University<br>sgarg2@andrew.cmu.edu<br>Sivaraman Balakrishnan<br>Carnegie Mellon University<br>sbalakri@andrew.cmu.edu<br>Zachary C. Lipton<br>Carnegie Mellon University<br>zlipton@andrew.cmu.edu\n\n## Behnam Neyshabur\n\nGoogle Research, Blueshift team\nneyshabur@google.com\n\nHanie Sedghi<br>Google Research, Brain team<br>hsedghi@google.com\n\n\n#### Abstract\n\nReal-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (WILDS, ImageNet, BREEDS, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2-4 \\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works ${ }^{1}$.\n\n\n## 1 INTRODUCTION\n\nMachine learning models deployed in the real world typically encounter examples from previously unseen distributions. While the IID assumption enables us to evaluate models using held-out data from the source distribution (from which training data is sampled), this estimate is no longer valid in presence of a distribution shift. Moreover, under such shifts, model accuracy tends to degrade (Szegedy et al., 2014; Recht et al., 2019; Koh et al., 2021). Commonly, the only data available to the practitioner are a labeled training set (source) and unlabeled deployment-time data which makes the problem more difficult. In this setting, detecting shifts in the distribution of covariates is known to be possible (but difficult) in theory (Ramdas et al., 2015), and in practice (Rabanser et al., 2018). However, producing an optimal predictor using only labeled source and unlabeled target data is well-known to be impossible absent further assumptions (Ben-David et al., 2010; Lipton et al., 2018).\n\nTwo vital questions that remain are: (i) the precise conditions under which we can estimate a classifier's target-domain accuracy; and (ii) which methods are most practically useful. To begin, the straightforward way to assess the performance of a model under distribution shift would be to collect labeled (target domain) examples and then to evaluate the model on that data. However, collecting fresh labeled data from the target distribution is prohibitively expensive and time-consuming, especially if the target distribution is non-stationary. Hence, instead of using labeled data, we aim to use unlabeled data from the target distribution, that is comparatively abundant, to predict model performance. Note that in this work, our focus is not to improve performance on the target but, rather, to estimate the accuracy on the target for a given classifier.\n\n[^0]\n[^0]: * Work done in part while Saurabh Garg was interning at Google\n ${ }^{1}$ Code is available at https://github.com/saurabhgarg1996/ATC_code.",
"images": [],
"dimensions": {
"dpi": 200,
"height": 2200,
"width": 1700
}
},
{
"index": 1,
"markdown": "\n\nFigure 1: Illustration of our proposed method ATC. Left: using source domain validation data, we identify a threshold on a score (e.g. negative entropy) computed on model confidence such that fraction of examples above the threshold matches the validation set accuracy. ATC estimates accuracy on unlabeled target data as the fraction of examples with the score above the threshold. Interestingly, this threshold yields accurate estimates on a wide set of target distributions resulting from natural and synthetic shifts. Right: Efficacy of ATC over previously proposed approaches on our testbed with a post-hoc calibrated model. To obtain errors on the same scale, we rescale all errors with Average Confidence (AC) error. Lower estimation error is better. See Table 1 for exact numbers and comparison on various types of distribution shift. See Sec. 5 for details on our testbed.\n\nRecently, numerous methods have been proposed for this purpose (Deng \\& Zheng, 2021; Chen et al., 2021b; Jiang et al., 2021; Deng et al., 2021; Guillory et al., 2021). These methods either require calibration on the target domain to yield consistent estimates (Jiang et al., 2021; Guillory et al., 2021) or additional labeled data from several target domains to learn a linear regression function on a distributional distance that then predicts model performance (Deng et al., 2021; Deng \\& Zheng, 2021; Guillory et al., 2021). However, methods that require calibration on the target domain typically yield poor estimates since deep models trained and calibrated on source data are not, in general, calibrated on a (previously unseen) target domain (Ovadia et al., 2019). Besides, methods that leverage labeled data from target domains rely on the fact that unseen target domains exhibit strong linear correlation with seen target domains on the underlying distance measure and, hence, can be rendered ineffective when such target domains with labeled data are unavailable (in Sec. 5.1 we demonstrate such a failure on a real-world distribution shift problem). Therefore, throughout the paper, we assume access to labeled source data and only unlabeled data from target domain(s).\nIn this work, we first show that absent assumptions on the source classifier or the nature of the shift, no method of estimating accuracy will work generally (even in non-contrived settings). To estimate accuracy on target domain perfectly, we highlight that even given perfect knowledge of the labeled source distribution (i.e., $p_{s}(x, y)$ ) and unlabeled target distribution (i.e., $p_{t}(x)$ ), we need restrictions on the nature of the shift such that we can uniquely identify the target conditional $p_{t}(y \\mid x)$. Thus, in general, identifying the accuracy of the classifier is as hard as identifying the optimal predictor.\nSecond, motivated by the superiority of methods that use maximum softmax probability (or logit) of a model for Out-Of-Distribution (OOD) detection (Hendrycks \\& Gimpel, 2016; Hendrycks et al., 2019), we propose a simple method that leverages softmax probability to predict model performance. Our method, Average Thresholded Confidence (ATC), learns a threshold on a score (e.g., maximum confidence or negative entropy) of model confidence on validation source data and predicts target domain accuracy as the fraction of unlabeled target points that receive a score above that threshold. ATC selects a threshold on validation source data such that the fraction of source examples that receive the score above the threshold match the accuracy of those examples. Our primary contribution in ATC is the proposal of obtaining the threshold and observing its efficacy on (practical) accuracy estimation. Importantly, our work takes a step forward in positively answering the question raised in Deng \\& Zheng (2021); Deng et al. (2021) about a practical strategy to select a threshold that enables accuracy prediction with thresholded model confidence.",
"images": [
{
"id": "img-0.jpeg",
"top_left_x": 292,
"top_left_y": 217,
"bottom_right_x": 1405,
"bottom_right_y": 649,
// エンコードしたデータが大きくなりすぎるため省略している
"image_base64": ""
}
],
"dimensions": {
"dpi": 200,
"height": 2200,
"width": 1700
}
},
{
"index": 2,
"markdown": "ATC is simple to implement with existing frameworks, compatible with arbitrary model classes, and dominates other contemporary methods. Across several model architectures on a range of benchmark vision and language datasets, we verify that ATC outperforms prior methods by at least $2-4 \\times$ in predicting target accuracy on a variety of distribution shifts. In particular, we consider shifts due to common corruptions (e.g., ImageNet-C), natural distribution shifts due to dataset reproduction (e.g., ImageNet-v2, ImageNet-R), shifts due to novel subpopulations (e.g., BREEDS), and distribution shifts faced in the wild (e.g., WILDS).\n\nAs a starting point for theory development, we investigate ATC on a simple toy model that models distribution shift with varying proportions of the population with spurious features, as in Nagarajan et al. (2020). Finally, we note that although ATC achieves superior performance in our empirical evaluation, like all methods, it must fail (returns inconsistent estimates) on certain types of distribution shifts, per our impossibility result.\n\n# 2 PRIOR WORK \n\nOut-of-distribution detection. The main goal of OOD detection is to identify previously unseen examples, i.e., samples out of the support of training distribution. To accomplish this, modern methods utilize confidence or features learned by a deep network trained on some source data. Hendrycks \\& Gimpel (2016); Geifman \\& El-Yaniv (2017) used the confidence score of an (already) trained deep model to identify OOD points. Lakshminarayanan et al. (2016) use entropy of an ensemble model to evaluate prediction uncertainty on OOD points. To improve OOD detection with model confidence, Liang et al. (2017) propose to use temperature scaling and input perturbations. Jiang et al. (2018) propose to use scores based on the relative distance of the predicted class to the second class. Recently, residual flow-based methods were used to obtain a density model for OOD detection (Zhang et al., 2020). Ji et al. (2021) proposed a method based on subfunction error bounds to compute unreliability per sample. Refer to Ovadia et al. (2019); Ji et al. (2021) for an overview and comparison of methods for prediction uncertainty on OOD data.\n\nPredicting model generalization. Understanding generalization capabilities of overparameterized models on in-distribution data using conventional machine learning tools has been a focus of a long line of work; representative research includes Neyshabur et al. (2015; 2017); Neyshabur (2017); Neyshabur et al. (2018); Dziugaite \\& Roy (2017); Bartlett et al. (2017); Zhou et al. (2018); Long \\& Sedghi (2019); Nagarajan \\& Kolter (2019a). At a high level, this line of research bounds the generalization gap directly with complexity measures calculated on the trained model. However, these bounds typically remain numerically loose relative to the true generalization error (Zhang et al., 2016; Nagarajan \\& Kolter, 2019b). On the other hand, another line of research departs from complexitybased approaches to use unseen unlabeled data to predict in-distribution generalization (Platanios et al., 2016; 2017; Garg et al., 2021; Jiang et al., 2021).\n\nRelevant to our work are methods for predicting the error of a classifier on OOD data based on unlabeled data from the target (OOD) domain. These methods can be characterized into two broad categories: (i) Methods which explicitly predict correctness of the model on individual unlabeled points (Deng \\& Zheng, 2021; Jiang et al., 2021; Deng et al., 2021; Chen et al., 2021a); and (ii) Methods which directly obtain an estimate of error with unlabeled OOD data without making a point-wise prediction (Chen et al., 2021b; Guillory et al., 2021; Chuang et al., 2020).\nTo achieve a consistent estimate of the target accuracy, Jiang et al. (2021); Guillory et al. (2021) require calibration on target domain. However, these methods typically yield poor estimates as deep models trained and calibrated on some source data are seldom calibrated on previously unseen domains (Ovadia et al., 2019). Additionally, Deng \\& Zheng (2021); Guillory et al. (2021) derive model-based distribution statistics on unlabeled target set that correlate with the target accuracy and propose to use a subset of labeled target domains to learn a (linear) regression function that predicts model performance. However, there are two drawbacks with this approach: (i) the correlation of these distribution statistics can vary substantially as we consider different nature of shifts (refer to Sec. 5.1, where we empirically demonstrate this failure); (ii) even if there exists a (hypothetical) statistic with strong correlations, obtaining labeled target domains (even simulated ones) with strong correlations would require significant a priori knowledge about the nature of shift that, in general, might not be available before models are deployed in the wild. Nonetheless, in our work, we only assume access to labeled data from the source domain presuming no access to labeled target domains or information about how to simulate them.",
"images": [],
"dimensions": {
"dpi": 200,
"height": 2200,
"width": 1700
}
},
{
"index": 3,
"markdown": "Moreover, unlike the parallel work of Deng et al. (2021), we do not focus on methods that alter the training on source data to aid accuracy prediction on the target data. Chen et al. (2021b) propose an importance re-weighting based approach that leverages (additional) information about the axis along which distribution is shifting in form of \"slicing functions\". In our work, we make comparisons with importance re-weighting baseline from Chen et al. (2021b) as we do not have any additional information about the axis along which the distribution is shifting.\n\n# 3 Problem Setup \n\nNotation. By $\\|\\cdot|$, and $\\langle\\cdot, \\cdot\\rangle$ we denote the Euclidean norm and inner product, respectively. For a vector $v \\in \\mathbb{R}^{d}$, we use $v_{j}$ to denote its $j^{\\text {th }}$ entry, and for an event $E$ we let $\\mathbb{I}[E]$ denote the binary indicator of the event.\nSuppose we have a multi-class classification problem with the input domain $\\mathcal{X} \\subseteq \\mathbb{R}^{d}$ and label space $\\mathcal{Y}=\\{1,2, \\ldots, k\\}$. For binary classification, we use $\\mathcal{Y}=\\{0,1\\}$. By $\\mathcal{D}^{\\mathcal{S}}$ and $\\mathcal{D}^{\\mathrm{T}}$, we denote source and target distribution over $\\mathcal{X} \\times \\mathcal{Y}$. For distributions $\\mathcal{D}^{\\mathcal{S}}$ and $\\mathcal{D}^{\\mathrm{T}}$, we define $p_{\\mathcal{S}}$ or $p_{\\mathrm{T}}$ as the corresponding probability density (or mass) functions. A dataset $S:=\\left\\{\\left(x_{i}, y_{i}\\right)\\right\\}_{i=1}^{n} \\sim\\left(\\mathcal{D}^{\\mathcal{S}}\\right)^{n}$ contains $n$ points sampled i.i.d. from $\\mathcal{D}^{\\mathcal{S}}$. Let $\\mathcal{F}$ be a class of hypotheses mapping $\\mathcal{X}$ to $\\Delta^{k-1}$ where $\\Delta^{k-1}$ is a simplex in $k$ dimensions. Given a classifier $f \\in \\mathcal{F}$ and datum $(x, y)$, we denote the $0-1$ error (i.e., classification error) on that point by $\\mathcal{E}(f(x), y):=\\mathbb{I}\\left[y \\notin \\arg \\max _{j \\in \\mathcal{Y}} f_{j}(x)\\right]$. Given a model $f \\in \\mathcal{F}$, our goal in this work is to understand the performance of $f$ on $\\mathcal{D}^{\\mathrm{T}}$ without access to labeled data from $\\mathcal{D}^{\\mathrm{T}}$. Note that our goal is not to adapt the model to the target data. Concretely, we aim to predict accuracy of $f$ on $\\mathcal{D}^{\\mathrm{T}}$. Throughout this paper, we assume we have access to the following: (i) model $f$; (ii) previously-unseen (validation) data from $\\mathcal{D}^{\\mathcal{S}}$; and (iii) unlabeled data from target distribution $\\mathcal{D}^{\\mathrm{T}}$.\n\n### 3.1 Accuracy Estimation: Possibility and Impossibility Results\n\nFirst, we investigate the question of when it is possible to estimate the target accuracy of an arbitrary classifier, even given knowledge of the full source distribution $p_{s}(x, y)$ and target marginal $p_{t}(x)$. Absent assumptions on the nature of shift, estimating target accuracy is impossible. Even given access to $p_{s}(x, y)$ and $p_{t}(x)$, the problem is fundamentally unidentifiable because $p_{t}(y \\mid x)$ can shift arbitrarily. In the following proposition, we show that absent assumptions on the classifier $f$ (i.e., when $f$ can be any classifier in the space of all classifiers on $\\mathcal{X}$ ), we can estimate accuracy on the target data iff assumptions on the nature of the shift, together with $p_{s}(x, y)$ and $p_{t}(x)$, uniquely identify the (unknown) target conditional $p_{t}(y \\mid x)$. We relegate proofs from this section to App. A.\nProposition 1. Absent further assumptions, accuracy on the target is identifiable iff $p_{t}(y \\mid x)$ is uniquely identified given $p_{s}(x, y)$ and $p_{t}(x)$.\n\nProposition 1 states that we need enough constraints on nature of shift such that $p_{s}(x, y)$ and $p_{t}(x)$ identifies unique $p_{t}(y \\mid x)$. It also states that under some assumptions on the nature of the shift, we can hope to estimate the model's accuracy on target data. We will illustrate this on two common assumptions made in domain adaptation literature: (i) covariate shift (Heckman, 1977; Shimodaira, 2000) and (ii) label shift (Saerens et al., 2002; Zhang et al., 2013; Lipton et al., 2018). Under covariate shift assumption, that the target marginal support $\\operatorname{supp}\\left(p_{t}(x)\\right)$ is a subset of the source marginal support $\\operatorname{supp}\\left(p_{s}(x)\\right)$ and that the conditional distribution of labels given inputs does not change within support, i.e., $p_{s}(y \\mid x)=p_{t}(y \\mid x)$, which, trivially, identifies a unique target conditional $p_{t}(y \\mid x)$. Under label shift, the reverse holds, i.e., the class-conditional distribution does not change $\\left(p_{s}(x \\mid y)=p_{t}(x \\mid y)\\right)$ and, again, information about $p_{t}(x)$ uniquely determines the target conditional $p_{t}(y \\mid x)$ (Lipton et al., 2018; Garg et al., 2020). In these settings, one can estimate an arbitrary classifier's accuracy on the target domain either by using importance re-weighting with the ratio $p_{t}(x) / p_{s}(x)$ in case of covariate shift or by using importance re-weighting with the ratio $p_{t}(y) / p_{s}(y)$ in case of label shift. While importance ratios in the former case can be obtained directly when $p_{t}(x)$ and $p_{s}(x)$ are known, the importance ratios in the latter case can be obtained by using techniques from Saerens et al. (2002); Lipton et al. (2018); Azizzadenesheli et al. (2019); Alexandari et al. (2019). In App. B, we explore accuracy estimation in the setting of these shifts and present extensions to generalized notions of label shift (Tachet des Combes et al., 2020) and covariate shift (Rojas-Carulla et al., 2018).\n\nAs a corollary of Proposition 1, we now present a simple impossibility result, demonstrating that no single method can work for all families of distribution shift.",
"images": [],
"dimensions": {
"dpi": 200,
"height": 2200,
"width": 1700
}
},
{
"index": 4,
"markdown": "Corollary 1. Absent assumptions on the classifier $f$, no method of estimating accuracy will work in all scenarios, i.e., for different nature of distribution shifts.\n\nIntuitively, this result states that every method of estimating accuracy on target data is tied up with some assumption on the nature of the shift and might not be useful for estimating accuracy under a different assumption on the nature of the shift. For illustration, consider a setting where we have access to distribution $p_{s}(x, y)$ and $p_{t}(x)$. Additionally, assume that the distribution can shift only due to covariate shift or label shift without any knowledge about which one. Then Corollary 1 says that it is impossible to have a single method that will simultaneously for both label shift and covariate shift as in the following example (we spell out the details in App. A):\n\nExample 1. Assume binary classification with $p_{s}(x)=\\alpha \\cdot \\phi\\left(\\mu_{1}\\right)+(1-\\alpha) \\cdot \\phi\\left(\\mu_{2}\\right)$, $p_{s}(x \\mid y=0)=\\phi\\left(\\mu_{1}\\right), p_{s}(x \\mid y=1)=\\phi\\left(\\mu_{2}\\right)$, and $p_{t}(x)=\\beta \\cdot \\phi\\left(\\mu_{1}\\right)+(1-\\beta) \\cdot \\phi\\left(\\mu_{2}\\right)$ where $\\phi(\\mu)=\\mathcal{N}(\\mu, 1), \\alpha, \\beta \\in(0,1)$, and $\\alpha \\neq \\beta$. Error of a classifier $f$ on target data is given by $\\mathcal{E}_{1}=\\mathbb{E}_{(x, y) \\sim p_{s}(x, y)}\\left[\\frac{p_{t}(x)}{p_{s}(x)} \\mathbb{I}[f(x) \\neq y]\\right]$ under covariate shift and by $\\mathcal{E}_{2}=$ $\\mathbb{E}_{(x, y) \\sim p_{s}(x, y)}\\left[\\left(\\frac{\\beta}{\\alpha} \\mathbb{I}[y=0]+\\frac{1-\\beta}{1-\\alpha} \\mathbb{I}[y=1]\\right) \\mathbb{I}[f(x) \\neq y]\\right]$ under label shift. In App. A, we show that $\\mathcal{E}_{1} \\neq \\mathcal{E}_{2}$ for all $f$. Thus, given access to $p_{s}(x, y)$, and $p_{t}(x)$, any method that consistently estimates error of a classifer under covariate shift will give an incorrect estimate of error under label shift and vice-versa. The reason is that the same $p_{t}(x)$ and $p_{s}(x, y)$ can correspond to error $\\mathcal{E}_{1}$ (under covariate shift) or error $\\mathcal{E}_{2}$ (under label shift) and determining which scenario one faces requires further assumptions on the nature of shift.\n\n# 4 Predicting accuracy with Average Thresholded CONFIDENCE \n\nIn this section, we present our method ATC that leverages a black box classifier $f$ and (labeled) validation source data to predict accuracy on target domain given access to unlabeled target data. Throughout the discussion, we assume that the classifier $f$ is fixed.\nBefore presenting our method, we introduce some terminology. Define a score function $s: \\Delta^{k-1} \\rightarrow$ $\\mathbb{R}$ that takes in the softmax prediction of the function $f$ and outputs a scalar. We want a score function such that if the score function takes a high value at a datum $(x, y)$ then $f$ is likely to be correct. In this work, we explore two such score functions: (i) Maximum confidence, i.e., $s(f(x))=\\max _{j \\in \\mathcal{Y}} f_{j}(x)$; and (ii) Negative Entropy, i.e., $s(f(x))=\\sum_{j} f_{j}(x) \\log \\left(f_{j}(x)\\right)$. Our method identifies a threshold $t$ on source data $\\mathcal{D}^{\\mathbb{S}}$ such that the expected number of points that obtain a score less than $t$ match the error of $f$ on $\\mathcal{D}^{\\mathbb{S}}$, i.e.,\n\n$$\n\\mathbb{E}_{x \\sim \\mathcal{D}^{\\mathbb{S}}}[\\mathbb{I}[s(f(x))<t]]=\\mathbb{E}_{(x, y) \\sim \\mathcal{D}^{\\mathbb{S}}}\\left[\\mathbb{I}\\left[\\arg \\max _{j \\in \\mathcal{Y}} f_{j}(x) \\neq y\\right]\\right]\n$$\n\nand then our error estimate $\\mathrm{ATC}_{\\mathcal{D}^{\\mathrm{T}}}(s)$ on the target domain $\\mathcal{D}^{\\mathrm{T}}$ is given by the expected number of target points that obtain a score less than $t$, i.e.,\n\n$$\n\\operatorname{ATC}_{\\mathcal{D}^{\\mathrm{T}}}(s)=\\mathbb{E}_{x \\sim \\mathcal{D}^{\\mathrm{T}}}[\\mathbb{I}[s(f(x))<t]]\n$$\n\nIn short, in (1), ATC selects a threshold on the score function such that the error in the source domain matches the expected number of points that receive a score below $t$ and in (2), ATC predicts error on the target domain as the fraction of unlabeled points that obtain a score below that threshold $t$. Note that, in principle, there exists a different threshold $t^{\\prime}$ on the target distribution $\\mathcal{D}^{\\mathrm{T}}$ such that (1) is satisfied on $\\mathcal{D}^{\\mathrm{T}}$. However, in our experiments, the same threshold performs remarkably well. The main empirical contribution of our work is to show that the threshold obtained with (1) might be used effectively in condunction with modern deep networks in a wide range of settings to estimate error on the target data. In practice, to obtain the threshold with ATC, we minimize the difference between the expression on two sides of (1) using finite samples. In the next section, we show that ATC precisely predicts accuracy on the OOD data on the desired line $y=x$. In App. C, we discuss an alternate interpretation of the method and make connections with OOD detection methods.\n\n## 5 EXPERIMENTS\n\nWe now empirical evaluate ATC and compare it with existing methods. In each of our main experiment, keeping the underlying model fixed, we vary target datasets and make a prediction",
"images": [],
"dimensions": {
"dpi": 200,
"height": 2200,
"width": 1700
}
},
{
"index": 5,
"markdown": "\n\nFigure 2: Scatter plot of predicted accuracy versus (true) OOD accuracy. Each point denotes a different OOD dataset, all evaluated with the same DenseNet121 model. We only plot the best three methods. With ATC (ours), we refer to ATC-NE. We observe that ATC significantly outperforms other methods and with ATC, we recover the desired line $y=x$ with a robust linear fit. Aggregated estimation error in Table 1 and plots for other datasets and architectures in App. H.\nof the target accuracy with various methods given access to only unlabeled data from the target. Unless noted otherwise, all models are trained only on samples from the source distribution with the main exception of pre-training on a different distribution. We use labeled examples from the target distribution to only obtain true error estimates.\n\nDatasets. First, we consider synthetic shifts induced due to different visual corruptions (e.g., shot noise, motion blur etc.) under ImageNet-C (Hendrycks \\& Dietterich, 2019). Next, we consider natural shifts due to differences in the data collection process of ImageNet (Russakovsky et al., 2015), e.g, ImageNetv2 (Recht et al., 2019). We also consider images with artistic renditions of object classes, i.e., ImageNet-R (Hendrycks et al., 2021) and ImageNet-Sketch (Wang et al., 2019). Note that renditions dataset only contains a subset 200 classes from ImageNet. To include renditions dataset in our testbed, we include results on ImageNet restricted to these 200 classes (which we call ImageNet-200) along with full ImageNet.\n\nSecond, we consider BREEDs (Santurkar et al., 2020) to assess robustness to subpopulation shifts, in particular, to understand how accuracy estimation methods behave when novel subpopulations not observed during training are introduced. BREEDS leverages class hierarchy in ImageNet to create 4 datasets Entity-13, Entity-30, Living-17, Non-Living-26. We focus on natural and synthetic shifts as in ImageNet on same and different subpopulations in BREEDs. Third, from Wilds (Koh et al., 2021) benchmark, we consider FMoW-WILDS (Christie et al., 2018), RxRx1-WILDS (Taylor et al., 2019), Amazon-WILDS (Ni et al., 2019), CivilComments-WILDS (Borkan et al., 2019) to consider distribution shifts faced in the wild.\n\nFinally, similar to ImageNet, we consider (i) synthetic shifts (CIFAR-10-C) due to common corruptions; and (ii) natural shift (i.e., CIFARv2 (Recht et al., 2018)) on CIFAR-10 (Krizhevsky \\& Hinton, 2009). On CIFAR-100, we just have synthetic shifts due to common corruptions. For completeness, we also consider natural shifts on MNIST (LeCun et al., 1998) as in the prior work (Deng \\& Zheng, 2021). We use three real shifted datasets, i.e., USPS (Hull, 1994), SVHN (Netzer et al., 2011) and QMNIST (Yadav \\& Bottou, 2019). We give a detailed overview of our setup in App. F.\nArchitectures and Evaluation. For ImageNet, BREEDs, CIFAR, FMoW-WILDS, RxRx1-WILDS datasets, we use DenseNet121 (Huang et al., 2017) and ResNet50 (He et al., 2016) architectures. For Amazon-WILDS and CivilComments-WILDS, we fine-tune a DistilBERT-base-uncased (Sanh et al., 2019) model. For MNIST, we train a fully connected multilayer perceptron. We use standard training with benchmarked hyperparameters. To compare methods, we report average absolute difference between the true accuracy on the target data and the estimated accuracy on the same unlabeled examples. We refer to this metric as Mean Absolute estimation Error (MAE). Along with MAE, we also show scatter plots to visualize performance at individual target sets. Refer to App. G for additional details on the setup.\nMethods With ATC-NE, we denote ATC with negative entropy score function and with ATC-MC, we denote ATC with maximum confidence score function. For all methods, we implement post-hoc calibration on validation source data with Temperature Scaling (TS; Guo et al. (2017)). Below we briefly discuss baselines methods compared in our work and relegate details to App. E.",
"images": [
{
"id": "img-1.jpeg",
"top_left_x": 294,
"top_left_y": 176,
"bottom_right_x": 1390,
"bottom_right_y": 563,
// エンコードしたデータが大きくなりすぎるため省略している
"image_base64": ""
}
],
"dimensions": {
"dpi": 200,
"height": 2200,
"width": 1700
}
},
{
"index": 6,
"markdown": "| Dataset | Shift | IM | | AC | | DOC | | GDE | ATC-MC (Ours) | | ATC-NE (Ours) | |\n| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n| | | Pre T | Post T | Pre T | Post T | Pre T | Post T | Post T | Pre T | Post T | Pre T | Post T |\n| CIFAR10 | Natural | 6.60 | 5.74 | 9.88 | 6.89 | 7.25 | 6.07 | 4.77 | 3.21 | 3.02 | 2.99 | 2.85 |\n| | Synthetic | 12.33 | 10.20 | 16.50 | 11.91 | 13.87 | 11.08 | 6.55 | 4.65 | 4.25 | 4.21 | 3.87 |\n| CIFAR100 | Synthetic | 13.69 | 11.51 | 23.61 | 13.10 | 14.60 | 10.14 | 9.85 | 5.50 | 4.75 | 4.72 | 4.94 |\n| ImageNet200 | Natural | 12.37 | 8.19 | 22.07 | 8.61 | 15.17 | 7.81 | 5.13 | 4.37 | 2.04 | 3.79 | 1.45 |\n| | Synthetic | 19.86 | 12.94 | 32.44 | 13.35 | 25.02 | 12.38 | 5.41 | 5.93 | 3.09 | 5.00 | 2.68 |\n| ImageNet | Natural | 7.77 | 6.50 | 18.13 | 6.02 | 8.13 | 5.76 | 6.23 | 3.88 | 2.17 | 2.06 | 0.80 |\n| | Synthetic | 13.39 | 10.12 | 24.62 | 8.51 | 13.55 | 7.90 | 6.32 | 3.34 | 2.53 | 2.61 | 4.89 |\n| FMoW-WILDS | Natural | 5.53 | 4.31 | 33.53 | 12.84 | 5.94 | 4.45 | 5.74 | 3.06 | 2.70 | 3.02 | 2.72 |\n| RxRx1-WILDS | Natural | 5.80 | 5.72 | 7.90 | 4.84 | 5.98 | 5.98 | 6.03 | 4.66 | 4.56 | 4.41 | 4.47 |\n| Amazon-WILDS | Natural | 2.40 | 2.29 | 8.01 | 2.38 | 2.40 | 2.28 | 17.87 | 1.65 | 1.62 | 1.60 | 1.50 |\n| CivilCom.-WILDS | Natural | 12.64 | 10.80 | 16.76 | 11.03 | 13.31 | 10.99 | 16.65 | 7.14 | | | |\n| MNIST | Natural | 18.48 | 15.99 | 21.17 | 14.81 | 20.19 | 14.56 | 24.42 | 5.02 | 2.40 | 3.14 | 3.50 |\n| EntitY-13 | Same | 16.23 | 11.14 | 24.97 | 10.88 | 19.08 | 10.47 | 10.71 | 5.39 | 3.88 | 4.58 | 4.19 |\n| | Novel | 28.53 | 22.02 | 38.33 | 21.64 | 32.43 | 21.22 | 20.61 | 13.58 | 10.28 | 12.25 | 6.63 |\n| EntitY-30 | Same | 18.59 | 14.46 | 28.82 | 14.30 | 21.63 | 13.46 | 12.92 | 9.12 | 7.75 | 8.15 | 7.64 |\n| | Novel | 32.34 | 26.85 | 44.02 | 26.27 | 36.82 | 25.42 | 23.16 | 17.75 | 14.30 | 15.60 | 10.57 |\n| NONLIVING-26 | Same | 18.66 | 17.17 | 26.39 | 16.14 | 19.86 | 15.58 | 16.63 | 10.87 | 10.24 | 10.07 | 10.26 |\n| | Novel | 33.43 | 31.53 | 41.66 | 29.87 | 35.13 | 29.31 | 29.56 | 21.70 | 20.12 | 19.08 | 18.26 |\n| LIVING-17 | Same | 12.63 | 11.05 | 18.32 | 10.46 | 14.43 | 10.14 | 9.87 | 4.57 | 3.95 | 3.81 | 4.21 |\n| | Novel | 29.03 | 26.96 | 35.67 | 26.11 | 31.73 | 25.73 | 23.53 | 16.15 | 14.49 | 12.97 | 11.39 |\n\nTable 1: Mean Absolute estimation Error (MAE) results for different datasets in our setup grouped by the nature of shift. 'Same' refers to same subpopulation shifts and 'Novel' refers novel subpopulation shifts. We include details about the target sets considered in each shift in Table 2. Post T denotes use of TS calibration on source. Across all datasets, we observe that ATC achieves superior performance (lower MAE is better). For language datasets, we use DistilBERT-base-uncased, for vision dataset we report results with DenseNet model with the exception of MNIST where we use FCN. We include results on other architectures in App. H. For GDE post T and pre T estimates match since TS doesn’t alter the argmax prediction. Results reported by aggregating MAE numbers over 4 different seeds. We include results with standard deviation values in Table 3.\n\nAverage Confidence (AC). Error is estimated as the expected value of the maximum softmax confidence on the target data, i.e, $\\mathrm{AC}_{\\mathcal{D}^{\\dagger}}=\\mathbb{E}_{x \\sim \\mathcal{D}^{\\dagger}}\\left[\\max _{j \\in \\mathcal{Y}} f_{j}(x)\\right]$.\nDifference Of Confidence (DOC). We estimate error on target by subtracting difference of confidences on source and target (as a surrogate to distributional distance Guillory et al. (2021)) from the error on source distribution, i.e, $\\mathrm{DOC}_{\\mathcal{D}^{\\dagger}}=\\mathbb{E}_{x \\sim \\mathcal{D}^{\\delta}}\\left[\\mathbb{I}\\left[\\arg \\max _{j \\in \\mathcal{Y}} f_{j}(x) \\neq y\\right]\\right]+\\mathbb{E}_{x \\sim \\mathcal{D}^{\\dagger}}\\left[\\max _{j \\in \\mathcal{Y}} f_{j}(x)\\right]-$ $\\mathbb{E}_{x \\sim \\mathcal{D}^{\\delta}}\\left[\\max _{j \\in \\mathcal{Y}} f_{j}(x)\\right]$. This is referred to as DOC-Feat in (Guillory et al., 2021).\nImportance re-weighting (IM). We estimate the error of the classifier with importance re-weighting of $0-1$ error in the pushforward space of the classifier. This corresponds to MANDOLIN using one slice based on the underlying classifier confidence Chen et al. (2021b).\n\nGeneralized Disagreement Equality (GDE). Error is estimated as the expected disagreement of two models (trained on the same training set but with different randomization) on target data (Jiang et al., 2021), i.e., $\\operatorname{GDE}_{\\mathcal{D}^{\\dagger}}=\\mathbb{E}_{x \\sim \\mathcal{D}^{\\dagger}}\\left[\\mathbb{I}\\left[f(x) \\neq f^{\\prime}(x)\\right]\\right]$ where $f$ and $f^{\\prime}$ are the two models. Note that GDE requires two models trained independently, doubling the computational overhead while training.\n\n### 5.1 RESULTS\n\nIn Table 1, we report MAE results aggregated by the nature of the shift in our testbed. In Fig. 2 and Fig. 1(right), we show scatter plots for predicted accuracy versus OOD accuracy on several datasets. We include scatter plots for all datasets and parallel results with other architectures in App. H. In App. H.1, we also perform ablations on CIFAR using a pre-trained model and observe that pre-training doesn't change the efficacy of ATC.",
"images": [],
"dimensions": {
"dpi": 200,
"height": 2200,
"width": 1700
}
},
// 略
],
"model": "mistral-ocr-2503-completion",
"usage_info": {
"pages_processed": 29,
"doc_size_bytes": 3002783
}
}
以下の復元前後の画像を見ると、ほとんど同じに復元できていることが分かります。
(復元前)
(復元後)
感想
OCRの精度としては非常に高く、特に構造化されたJSONデータに変換できることや、OCR後のデータからほとんど元に戻せる点が非常に素晴らしいと思いました。
また、1分あたり2000ページと高速なことや1000ページあたり1ドルと非常に安価な点も魅力ですね。
セキュリティもOSSでセルフホストできる点でクリアできるのでエンタープライズにも対応できそうです。
Discussion