Stripgay
📖 Tutorial

Catch PyTorch NaNs at the Source: Build a 3ms Layer-Level Detector

Last updated: 2026-05-03 13:12:19 Intermediate
Complete guide
Follow along with this comprehensive guide

Introduction

NaNs in deep learning training are the silent assassins of model performance. Unlike a crash that shouts for attention, a NaN quietly corrupts gradients, spiraling your loss into oblivion after hours of computation. I learned this the hard way during a ResNet run — until I built a lightweight hook-based detector that pinpoints the exact layer and batch where things go wrong. This guide walks you through creating that same detector in just a few steps, with under 3ms overhead per forward pass. No more guessing games.

Catch PyTorch NaNs at the Source: Build a 3ms Layer-Level Detector
Source: towardsdatascience.com

What You Need

  • Python 3.6+ installed
  • PyTorch (any recent version, 1.0 or later)
  • Basic familiarity with PyTorch modules and forward hooks
  • A model you're training (e.g., ResNet, custom CNN, transformer)
  • A debugger or simple print statements for testing

Step-by-Step Guide

Step 1: Understand Where NaNs Typically Originate

Before writing code, know the common culprits: large activations, exploding gradients, or division by zero (e.g., in softmax or batch norm). These often occur mid-network, not just at the loss. Your goal is to check every layer's output for NaNs during forward and backward passes. The key insight: a forward hook lets you inspect activations without modifying the model, and a gradient hook does the same for gradients.

Step 2: Write the Detection Function

Create a function that checks a tensor for NaNs and Infinity. We'll reuse it in both forward and backward hooks. Keep it efficient — a simple torch.isnan() combined with torch.isinf() works. For speed, use a lazy evaluation: if the first element is not NaN, skip the full check in production; in debug mode, scan all.

import torch

def check_for_nan_inf(tensor, name, mode):
    if tensor is None:
        return
    if torch.isnan(tensor).any() or torch.isinf(tensor).any():
        print(f"NaN/Inf detected in {name} at {mode}")
        raise RuntimeError(f"NaN/Inf at {name}")

Step 3: Attach Forward Hooks to All Layers

Iterate through your model's modules (using named_modules()) and register a forward hook on each layer that calls the detection function. Use module.register_forward_hook(). Store the handle if you want to remove hooks later. Example:

handles = []
for name, module in model.named_modules():
    handle = module.register_forward_hook(
        lambda mod, inp, out: check_for_nan_inf(out, name, "forward")
    )
    handles.append(handle)

Tip: To avoid closure issues, use functools.partial or a helper class that stores the module name.

Step 4: Attach Backward Hooks (Gradient Checks)

NaNs often appear in gradients before they show in activations. Register backward hooks using module.register_full_backward_hook() (or the deprecated register_backward_hook for older versions). The backward hook receives gradient tensors. Check each input gradient:

for name, module in model.named_modules():
    handle = module.register_full_backward_hook(
        lambda mod, grad_input, grad_output: check_for_nan_inf(grad_output[0], name, "backward")
    )
    handles.append(handle)

Note: grad_output is a tuple; we usually check the first element (gradient w.r.t. output). For multi-output modules, loop through all.

Step 5: Optimize for Minimal Overhead

The detection function can be slow if called on every batch. To achieve the promised 3ms overhead, implement a condition: only check when a global flag (e.g., nan_detection_enabled) is True, and only run after the first epoch or every N batches. Alternatively, sample layers randomly. But for debugging, keep it full. The key is to avoid expensive operations like .any() on large tensors unless necessary — if your batch size is moderate, the overhead is negligible.

Catch PyTorch NaNs at the Source: Build a 3ms Layer-Level Detector
Source: towardsdatascience.com
# Inside check function:
if not enabled:
    return
# Then check

Step 6: Integrate Hooks into Training Loop

Place hook registration before the training loop (or after model loading). During training, the hooks fire automatically. To catch NaN at the exact batch and layer, wrap the loss.backward() and optimizer.step() in a try-except that prints the batch index:

for batch_idx, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = criterion(output, target)
    optimizer.zero_grad()
    loss.backward()
    try:
        # Hooks will raise an error if NaN found
        optimizer.step()
    except RuntimeError as e:
        print(f"NaN detected at batch {batch_idx}")
        raise

This way, you get immediate feedback.

Step 7: Remove Hooks After Training (Optional)

If you registered hooks before a long training run, remove them once the issue is fixed to regain performance. Simply call handle.remove() on each handle stored earlier. Wrap in a cleanup function.

def remove_hooks(handles):
    for h in handles:
        h.remove()

Tips for Robust Detection

  • Start with a small debug run: Use a single batch to verify hooks fire correctly.
  • Check both activations and gradients: NaNs in gradients may indicate a learning rate issue.
  • Log layer names: Include module names in your print messages to pinpoint the exact location.
  • Use torch.autograd.set_detect_anomaly(True): This is a built-in alternative but slows training significantly. Combine with your hooks for a two-layer approach: anomaly detection for broad strokes, hooks for precision.
  • Beware of normalization layers: BatchNorm can produce NaN if variance is zero. Check denominator.
  • Test on CPU first: NaN detection is identical across devices; CPU is easier for debugging.
  • Consider mixed precision: FP16 training often produces NaNs — your hooks will catch them early.

Conclusion

With this guide, you've built a silent-killer detector that tells you exactly where and when NaNs strike — at the layer level and in real time. The 3ms overhead per forward pass is a small price for saving hours of wasted GPU cycles. Copy the code, attach the hooks, and never lose a training run to invisible corruption again.