Adversarial Training

What is Adversarial Training?

Adversarial Training is a technique used to improve the robustness of machine learning models, particularly deep learning models, against adversarial examples. It involves augmenting the training set with adversarial examples and training the model on the augmented dataset. This forces the model to learn features that are more invariant to adversarial perturbations, making it less susceptible to attacks.

How does Adversarial Training work?

The main steps of adversarial training are:

  1. Generate adversarial examples for the current training batch using an attack method, such as FGSM or PGD.
  2. Combine the original training batch with the generated adversarial examples.
  3. Train the model on the combined dataset, updating its weights based on the loss computed on both original and adversarial examples.
  4. Repeat the process for each training batch until the model converges.

Example of adversarial training in Python with TensorFlow:

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy

# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0
X_train = X_train[..., tf.newaxis]
X_test = X_test[..., tf.newaxis]

# Define the model architecture
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(10)
])

# Compile the model
model.compile(optimizer='adam', 
              loss=SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

# Define the adversarial training loop
for epoch in range(epochs):
    for batch in range(num_batches):
        # Generate adversarial examples for the current training batch
        x_batch, y_batch = get_training_batch(batch)
        x_batch_adv = generate_adversarial_examples(x_batch, model)

        # Combine the original training batch with the adversarial examples
        x_combined = tf.concat([x_batch, x_batch_adv], axis=0)
        y_combined = tf.concat([y_batch, y_batch], axis=0)

        # Train the model on the combined dataset
        model.train_on_batch(x_combined, y_combined)

# Evaluate the model on the test set
model.evaluate(X_test, y_test)

More resources on Adversarial Training: