iTranslated by AI
Why Switching to MobileNetV2 Lowered Accuracy in DCASE 2026 Anomaly Detection: The Limits of Autoencoders and 3 Lessons Learned
Get Straight to the Point
Switching from a Dense AutoEncoder (AUC 0.57) to a MobileNetV2-based CNN AE caused the AUC to drop to 0.55.
I attempted to improve the baseline for the DCASE 2026 Challenge Task 2 (Bearing Emu) with industrial predictive maintenance in mind. While my goal was an AUC of 0.65+, I encountered the counterintuitive result that switching to a CNN with strong pre-training actually decreased accuracy.
This article is a record of that failure. I will write from a practical, on-site perspective about "why it dropped" and "why AEs themselves have limitations."
Premise: What is DCASE 2026 Task 2 Bearing Emu?
DCASE (Detection and Classification of Acoustic Scenes and Events) is an international competition for acoustic event detection. Task 2 is an unsupervised anomaly detection task where models are trained only on normal machine sounds and must detect anomalous sounds.
The 2026 Bearing (Emu) dataset consists of:
- 10-second monaural audio (16kHz)
- Training data: 1,000 normal sound clips
- Test data: 100 normal + 100 anomalous clips
- Distribution source: Zenodo
Attempt 1: Dense AutoEncoder (Baseline)
I built a Dense AutoEncoder close to the official DCASE baseline.
# Input: log-mel spectrogram 128mel × 5 consecutive frames = 640-dim vector
# Encoder: 640 → 512 → 256 → 128 → 64 → 32 (latent)
# Decoder: 32 → 64 → 128 → 256 → 512 → 640
I trained it only on normal sounds and used file-level reconstruction error (MSE) as the anomaly score.
Results:
| Metric | Value |
|---|---|
| AUC-ROC | 0.5659 |
| Partial AUC (FPR≤0.1) | 0.5732 |
| Normal score average | 0.4258 |
| Anomalous score average | 0.4513 |
| Separability | 0.0255 |
DCASE Bearing Emu is a difficult dataset; past official baselines typically see AUCs in the 0.55-0.65 range. I would say this is a reasonably valid starting point.
Attempt 2: MobileNetV2-based CNN AutoEncoder
I switched to a MobileNetV2-based model, hypothesizing that a more powerful model would improve accuracy.
class MobileNetAutoEncoder(nn.Module):
def __init__(self, latent_dim: int = 128):
super().__init__()
# Encoder: MobileNetV2 pre-trained on ImageNet
backbone = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1)
self.features = backbone.features # (B, 1280, H/32, W/32)
self.pool = nn.AdaptiveAvgPool2d((4, 2))
self.bottleneck = nn.Sequential(
nn.Flatten(),
nn.Linear(1280 * 4 * 2, 512),
nn.BatchNorm1d(512), nn.ReLU(inplace=True),
nn.Linear(512, latent_dim),
)
# Decoder: Step-wise upsampling with ConvTranspose2d
...
def forward(self, x):
x_3ch = x.repeat(1, 3, 1, 1) # 1ch → 3ch (for ImageNet)
z = self.encoder(x_3ch)
return self.decoder(z)
The input was created by slicing log-mel spectrograms into 128×64 patches, broadcasting 1ch to 3ch, and feeding them into MobileNetV2.
After training for 80 epochs on a GPU (RTX 2070) (actually 75 epochs due to Early Stopping), the results were:
Results:
| Metric | Dense AE | MobileNetV2 CNN |
|---|---|---|
| AUC-ROC | 0.5659 | 0.5466 |
| Partial AUC | 0.5732 | 0.5374 |
| Separability | 0.0255 | 0.0206 |
It actually got worse.
Failure Reason 1: The Information Leak Problem
This was the biggest cause of failure.
The premise of AE-based anomaly detection is that "an AE trained only on normal sounds can reconstruct normal sounds cleanly, but reconstruction collapses for anomalous sounds." However, the more powerful the CNN AE is, the better it can "vaguely" reconstruct even unknown anomalous sounds.
This is a known issue in the world of visual anomaly detection called Information Leak or the identity shortcut.
Looking at the separability (anomalous mean - normal mean) across three trials:
| Model | Separability |
|---|---|
| Dense AE (smaller model) | 0.0255 |
| MobileNetV2 CPU | 0.0264 |
| MobileNetV2 GPU 75 epochs | 0.0206 ← The gap is shrinking instead of improving |
As training progresses, the difference from the anomalous sounds shrinks. This is evidence that the AE is "over-generalizing."
Lesson: In AE-based anomaly detection, making the model stronger is not necessarily the correct answer.
Failure Reason 2: Domain Mismatch of ImageNet Pre-training
The pre-trained weights of MobileNetV2 were acquired from natural images in ImageNet—edge detectors for dogs, cats, cars, and buildings.
On the other hand, a mel-spectrogram "image" has time on the horizontal axis, frequency on the vertical axis, and log-power as the value. The statistical properties are fundamentally different from natural images:
- Natural images: spatial continuity, edges, textures
- Mel-spectrograms: frequency-dependent vertical structures, rhythmic patterns in the time direction
To benefit from transfer learning, the source domain (ImageNet) and target domain (mel-spectrogram) must have similar distributions. These two were simply too far apart.
In fact, if I had used a pre-trained model for spectrograms (like VGGish or PANNs trained on AudioSet), the situation would likely have been different, but that is a separate topic.
Lesson: Pre-training is not a panacea. If the domains are distant, it can even be harmful.
Failure Reason 3: Environmental Issues (The "torchvision installed as CPU version" incident)
This is more of an operational mistake than a technical failure, but I will expose it anyway.
During the initial MobileNetV2 run, I just did pip install torchvision, which installed the CPU version of torchvision, and simultaneously overwrote my existing GPU-based torch with the CPU version (2.11.0+cpu).
Result:
- The log showed "Learning device: cpu" (I noticed too late)
- About 80 seconds per epoch, 40 minutes for 30 epochs
- The model never fully converged in the first place
When using a GPU, always install both from the same index:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
Lesson: Always check the "device: cpu/cuda" at the beginning of the logs.
Didn't Reach AUC 0.65, But Learned Something
Looking only at the results, it was a failure. The AUC dropped from 0.57 to 0.55.
However, re-examining past SOTA methods for DCASE Task 2, I realized that the winning teams are not using AEs.
SOTA approaches fall into two main categories:
1. Mahalanobis distance-based (PaDiM-style)
Gaussian fit the distribution of intermediate layer features of a pre-trained CNN using normal sounds, and judge based on the Mahalanobis distance of test sounds. It does not use reconstruction, so no Information Leak occurs.
2. Self-supervised classification-based
Apply data augmentation (pitch shift, noise addition) to normal sounds to classify them into multiple classes, and use the embedding distance of the trained model as the anomaly score.
In other words, the biggest takeaway this time was that the very idea of "AE reconstruction error" is outdated.
Implications for Industrial Predictive Maintenance
I have worked in the quality control department of a machinery manufacturer for 10 years. Anomaly detection using vibration and acoustic data is an area I consider a main battlefield for DX consulting after reaching COAST FIRE.
What I keenly felt through this experiment is:
-
"Reconstruction error with AE" is for beginners, not for production.
If you bring an AE into an on-site PoC thinking "this might work," there is a high probability you will disappoint the customer's expectations. -
Know the failure patterns before building a baseline.
Information Leak, domain mismatch, traps in evaluation metrics—you will fail if you take them on without knowing these. -
Do not blindly trust AUC values from public datasets.
Even with the same AUC of 0.6, the false positive rate (FPR) is critical for actual factory machines. Evaluating with Partial AUC is appropriate for field operations.
Next time, I will retry with a PaDiM-style Mahalanobis distance base. I will write again once I cross the AUC 0.65+ barrier.
Code Used
The full code is available on GitHub:
References
- DCASE 2026 Challenge Task 2 Development Dataset (Zenodo)
- Zavrtanik et al., "Reconstruction by inpainting for visual anomaly detection" (Discussion on Information Leak)
- Defard et al., "PaDiM: a Patch Distribution Modeling Framework for Anomaly Detection"
Discussion
ImageNetでの事前学習が悪さをしたかどうかは、同じモデル構造でランダム初期化から学習を始めるものと、事前学習済み重みから学習を始めるものを比べないと判断できないと思います。自分の経験では、事前学習とのドメインに差があっても、事前学習済みのほうが良い結果を出す印象です。
ご指摘の通りです。ランダム初期化との比較をしていないので、
事前学習が原因とは言い切れませんでした。
比較実験を追加してみます。ありがとうございます。