🐬

【Method for Robustness】OUSM in machine learning

2024/09/04に公開

1. What is the OUSM?

In the context of machine learning, OUSM stands for Online Uncertainty Sampling Method. It is a technique often used in active learning, a subset of machine learning where the algorithm selectively queries the most informative data points for labeling to improve model performance with fewer labeled instances.

The "informative" means:
When a model is uncertain about a particular data point (e.g., it gives a low-confidence prediction or shows indecision between multiple classes), this indicates that the model doesn't fully understand this region of the data space.

2. Features:

  1. Uncertainty Sampling: The method focuses on selecting data points for which the model is most uncertain. Uncertainty is typically measured using metrics like entropy, margin, or variance of the model's predictions. These uncertain data points are then queried for labels to improve the model's performance.

  2. Efficiency: OUSM aims to improve model performance efficiently by reducing the need for labeling large amounts of data. Instead, it strategically selects the most impactful data points for labeling.
    So OUSM is particularly useful in scenarios where labeling data is expensive or time-consuming, such as in medical imaging or natural language processing.

In summary, OUSM is a strategy within active learning that helps to optimize the learning process by focusing on data points that are most uncertain, thereby improving model performance with less labeled data.

3. Example of OUSM

The example code is here:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split data into labeled (small) and unlabeled (large) pools
X_train, X_unlabeled, y_train, y_unlabeled = train_test_split(X, y, test_size=0.9, random_state=42)

# Step 2: Train an initial model on the small labeled set
model = LogisticRegression()
model.fit(X_train, y_train)

# Function to select uncertain samples based on prediction probabilities
def select_uncertain_samples(model, X_unlabeled, percent):
    # Predict probabilities for the unlabeled data
    probs = model.predict_proba(X_unlabeled)
    
    # Calculate uncertainty as the smallest distance from 0.5 (i.e., how close predictions are to uncertain)
    uncertainty = np.abs(probs[:, 1] - 0.5)

    # Determine the number of samples to select based on the percentage
    n_samples = int(percent * len(X_unlabeled))
    
    # Select the n_samples with the highest uncertainty (closest to 0.5)
    uncertain_indices = np.argsort(uncertainty)[:n_samples]
    
    return uncertain_indices

# Step 3: Simulate the active learning process
n_iterations = 10  # Number of active learning iterations
percent_samples_per_iter = 0.5  # Percentage of samples to label in each iteration (e.g., 10%)

for i in range(n_iterations):
    print(f"Iteration {i+1}/{n_iterations}")
    
    # Select the most uncertain samples
    uncertain_indices = select_uncertain_samples(model, X_unlabeled, percent_samples_per_iter)
    
    # Simulate labeling the uncertain samples (reveal their true labels)
    X_new = X_unlabeled[uncertain_indices]
    y_new = y_unlabeled[uncertain_indices]
    
    # Remove the selected samples from the unlabeled pool
    X_unlabeled = np.delete(X_unlabeled, uncertain_indices, axis=0)
    y_unlabeled = np.delete(y_unlabeled, uncertain_indices, axis=0)
    
    # Step 4: Add the newly labeled data to the training set
    X_train = np.vstack([X_train, X_new])
    y_train = np.hstack([y_train, y_new])
    
    # Step 5: Retrain the model with the updated training set
    model.fit(X_train, y_train)
    
    # Evaluate the model (optional)
    y_pred = model.predict(X_train)
    accuracy = accuracy_score(y_train, y_pred)
    print(f"Updated model accuracy: {accuracy:.4f}\n")

# Final model performance
print("Final model trained on more data using uncertainty sampling.")

This code determines if the model output is uncertain in select_uncertain_samples function, and uses only uncertain data for training.
This is a bit direct, but I think it conveys the OUSM way of thinking.

In practice, it is sometimes implemented to select mini-batches with high uncertainty.

4. Summary

This time, I explained about OUSM.
The features are:
・Use uncertain(model doesn't have confidence) data for training to maximize the value of each labeled instance.
・Effective to label-expensive datasets like medical imaging or natural language processing.

Reference

[1] @charmq, Kaggleへの取り組み方 ~validation編~

Discussion