iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📉

Why Switching to MobileNetV2 Lowered Accuracy in DCASE 2026 Anomaly Detection: The Limits of Autoencoders and 3 Lessons Learned

に公開2

Get Straight to the Point

Switching from a Dense AutoEncoder (AUC 0.57) to a MobileNetV2-based CNN AE caused the AUC to drop to 0.55.

I attempted to improve the baseline for the DCASE 2026 Challenge Task 2 (Bearing Emu) with industrial predictive maintenance in mind. While my goal was an AUC of 0.65+, I encountered the counterintuitive result that switching to a CNN with strong pre-training actually decreased accuracy.

This article is a record of that failure. I will write from a practical, on-site perspective about "why it dropped" and "why AEs themselves have limitations."

Premise: What is DCASE 2026 Task 2 Bearing Emu?

DCASE (Detection and Classification of Acoustic Scenes and Events) is an international competition for acoustic event detection. Task 2 is an unsupervised anomaly detection task where models are trained only on normal machine sounds and must detect anomalous sounds.

The 2026 Bearing (Emu) dataset consists of:

  • 10-second monaural audio (16kHz)
  • Training data: 1,000 normal sound clips
  • Test data: 100 normal + 100 anomalous clips
  • Distribution source: Zenodo

Attempt 1: Dense AutoEncoder (Baseline)

I built a Dense AutoEncoder close to the official DCASE baseline.

# Input: log-mel spectrogram 128mel × 5 consecutive frames = 640-dim vector
# Encoder: 640 → 512 → 256 → 128 → 64 → 32 (latent)
# Decoder: 32 → 64 → 128 → 256 → 512 → 640

I trained it only on normal sounds and used file-level reconstruction error (MSE) as the anomaly score.

Results:

Metric Value
AUC-ROC 0.5659
Partial AUC (FPR≤0.1) 0.5732
Normal score average 0.4258
Anomalous score average 0.4513
Separability 0.0255

DCASE Bearing Emu is a difficult dataset; past official baselines typically see AUCs in the 0.55-0.65 range. I would say this is a reasonably valid starting point.

Attempt 2: MobileNetV2-based CNN AutoEncoder

I switched to a MobileNetV2-based model, hypothesizing that a more powerful model would improve accuracy.

class MobileNetAutoEncoder(nn.Module):
    def __init__(self, latent_dim: int = 128):
        super().__init__()
        # Encoder: MobileNetV2 pre-trained on ImageNet
        backbone = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1)
        self.features = backbone.features  # (B, 1280, H/32, W/32)
        self.pool = nn.AdaptiveAvgPool2d((4, 2))
        self.bottleneck = nn.Sequential(
            nn.Flatten(),
            nn.Linear(1280 * 4 * 2, 512),
            nn.BatchNorm1d(512), nn.ReLU(inplace=True),
            nn.Linear(512, latent_dim),
        )
        # Decoder: Step-wise upsampling with ConvTranspose2d
        ...

    def forward(self, x):
        x_3ch = x.repeat(1, 3, 1, 1)  # 1ch → 3ch (for ImageNet)
        z = self.encoder(x_3ch)
        return self.decoder(z)

The input was created by slicing log-mel spectrograms into 128×64 patches, broadcasting 1ch to 3ch, and feeding them into MobileNetV2.

After training for 80 epochs on a GPU (RTX 2070) (actually 75 epochs due to Early Stopping), the results were:

Results:

Metric Dense AE MobileNetV2 CNN
AUC-ROC 0.5659 0.5466
Partial AUC 0.5732 0.5374
Separability 0.0255 0.0206

It actually got worse.

Failure Reason 1: The Information Leak Problem

This was the biggest cause of failure.

The premise of AE-based anomaly detection is that "an AE trained only on normal sounds can reconstruct normal sounds cleanly, but reconstruction collapses for anomalous sounds." However, the more powerful the CNN AE is, the better it can "vaguely" reconstruct even unknown anomalous sounds.

This is a known issue in the world of visual anomaly detection called Information Leak or the identity shortcut.

Looking at the separability (anomalous mean - normal mean) across three trials:

Model Separability
Dense AE (smaller model) 0.0255
MobileNetV2 CPU 0.0264
MobileNetV2 GPU 75 epochs 0.0206 ← The gap is shrinking instead of improving

As training progresses, the difference from the anomalous sounds shrinks. This is evidence that the AE is "over-generalizing."

Lesson: In AE-based anomaly detection, making the model stronger is not necessarily the correct answer.

Failure Reason 2: Domain Mismatch of ImageNet Pre-training

The pre-trained weights of MobileNetV2 were acquired from natural images in ImageNet—edge detectors for dogs, cats, cars, and buildings.

On the other hand, a mel-spectrogram "image" has time on the horizontal axis, frequency on the vertical axis, and log-power as the value. The statistical properties are fundamentally different from natural images:

  • Natural images: spatial continuity, edges, textures
  • Mel-spectrograms: frequency-dependent vertical structures, rhythmic patterns in the time direction

To benefit from transfer learning, the source domain (ImageNet) and target domain (mel-spectrogram) must have similar distributions. These two were simply too far apart.

In fact, if I had used a pre-trained model for spectrograms (like VGGish or PANNs trained on AudioSet), the situation would likely have been different, but that is a separate topic.

Lesson: Pre-training is not a panacea. If the domains are distant, it can even be harmful.

Failure Reason 3: Environmental Issues (The "torchvision installed as CPU version" incident)

This is more of an operational mistake than a technical failure, but I will expose it anyway.

During the initial MobileNetV2 run, I just did pip install torchvision, which installed the CPU version of torchvision, and simultaneously overwrote my existing GPU-based torch with the CPU version (2.11.0+cpu).

Result:

  • The log showed "Learning device: cpu" (I noticed too late)
  • About 80 seconds per epoch, 40 minutes for 30 epochs
  • The model never fully converged in the first place

When using a GPU, always install both from the same index:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

Lesson: Always check the "device: cpu/cuda" at the beginning of the logs.

Didn't Reach AUC 0.65, But Learned Something

Looking only at the results, it was a failure. The AUC dropped from 0.57 to 0.55.

However, re-examining past SOTA methods for DCASE Task 2, I realized that the winning teams are not using AEs.

SOTA approaches fall into two main categories:

1. Mahalanobis distance-based (PaDiM-style)
Gaussian fit the distribution of intermediate layer features of a pre-trained CNN using normal sounds, and judge based on the Mahalanobis distance of test sounds. It does not use reconstruction, so no Information Leak occurs.

2. Self-supervised classification-based
Apply data augmentation (pitch shift, noise addition) to normal sounds to classify them into multiple classes, and use the embedding distance of the trained model as the anomaly score.

In other words, the biggest takeaway this time was that the very idea of "AE reconstruction error" is outdated.

Implications for Industrial Predictive Maintenance

I have worked in the quality control department of a machinery manufacturer for 10 years. Anomaly detection using vibration and acoustic data is an area I consider a main battlefield for DX consulting after reaching COAST FIRE.

What I keenly felt through this experiment is:

  1. "Reconstruction error with AE" is for beginners, not for production.
    If you bring an AE into an on-site PoC thinking "this might work," there is a high probability you will disappoint the customer's expectations.

  2. Know the failure patterns before building a baseline.
    Information Leak, domain mismatch, traps in evaluation metrics—you will fail if you take them on without knowing these.

  3. Do not blindly trust AUC values from public datasets.
    Even with the same AUC of 0.6, the false positive rate (FPR) is critical for actual factory machines. Evaluating with Partial AUC is appropriate for field operations.

Next time, I will retry with a PaDiM-style Mahalanobis distance base. I will write again once I cross the AUC 0.65+ barrier.

Code Used

The full code is available on GitHub:

References

Discussion

HaruiHarui

ImageNetでの事前学習が悪さをしたかどうかは、同じモデル構造でランダム初期化から学習を始めるものと、事前学習済み重みから学習を始めるものを比べないと判断できないと思います。自分の経験では、事前学習とのドメインに差があっても、事前学習済みのほうが良い結果を出す印象です。

ゆーのゆーの

ご指摘の通りです。ランダム初期化との比較をしていないので、
事前学習が原因とは言い切れませんでした。
比較実験を追加してみます。ありがとうございます。