Stripgay
📖 Tutorial

Mastering Controller Resilience: A Guide to Staleness Mitigation and Observability in Kubernetes v1.36

Last updated: 2026-05-01 20:41:18 Intermediate
Complete guide
Follow along with this comprehensive guide

Overview

Kubernetes controllers are the backbone of automation, continuously reconciling desired and actual cluster states. However, they are susceptible to staleness—an outdated internal cache that can lead to incorrect actions, missed reconciliations, or delayed responses. In production, staleness often goes unnoticed until a controller takes a wrong turn, like scaling down a deployment prematurely. Kubernetes v1.36 introduces two key enhancements to address this: atomic FIFO processing in client-go and improved observability for controllers. This tutorial provides a practical guide to understanding, enabling, and leveraging these features to build more reliable controllers.

Mastering Controller Resilience: A Guide to Staleness Mitigation and Observability in Kubernetes v1.36

Prerequisites

  • A Kubernetes cluster running v1.36 or later (or a control plane upgraded to v1.36).
  • kubectl access with cluster-admin privileges.
  • Basic understanding of Kubernetes controllers and the reconciliation loop.
  • Familiarity with YAML manifests and command-line operations.
  • Optional: A development environment with Go 1.20+ if you plan to test custom controllers.

Step-by-Step Instructions

1. Understand Staleness in Controllers

Controllers maintain a local cache (informer store) populated via watches. Staleness occurs when this cache diverges from the actual API server state. Common causes include:

  • Controller restarts: Cache rebuilds can lag behind live changes.
  • API server outages: Watch connections drop and updates are missed.
  • Out-of-order events: During initial list operations, batch processing could introduce temporary inconsistencies.

Before v1.36, the FIFO queue in client-go processed events in the order received, which could result in a cache state that never existed in the API server (e.g., applying an update before the corresponding create). The new AtomicFIFO feature ensures that a batch of initial events is processed atomically, maintaining consistency.

2. Enable AtomicFIFO in client-go

The AtomicFIFO feature gate is alpha in v1.36. To use it in a custom controller, you must enable the feature gate in the controller process.

  1. Set the environment variable or flag:

    KUBE_FEATURE_GATES=AtomicFIFO=true go run main.go
    

    Or add it to your deployment manifest:

    env:
    - name: KUBE_FEATURE_GATES
      value: "AtomicFIFO=true"
    
  2. Verify the feature is active in your controller logs:

    log.Printf("Using AtomicFIFO: %v", utilfeature.DefaultFeatureGate.Enabled(features.AtomicFIFO))
    

Once enabled, the informer’s FIFO queue will process initial list events as a single atomic unit. This prevents the cache from seeing intermediate states that never existed.

3. Add Observability for Controller Actions

v1.36 also introduces new metrics and events to help monitor controller behavior. These are automatically emitted by updated controllers in kube-controller-manager and available for custom controllers by using the latest client-go.

  • Metric: controller_staleness_seconds – measures how stale the cache is (time since last sync).
  • Event: ControllerCacheStale – triggered when staleness exceeds a threshold (default: 10 seconds).

To expose these in your custom controller:

  1. Import the metrics package:

    import "k8s.io/component-base/metrics"
    
  2. Register the staleness metric:

    var stalenessMetric = metrics.NewGauge(
        &metrics.GaugeOpts{
            Name:           "controller_staleness_seconds",
            Help:           "Time since last successful cache sync.",
            StabilityLevel: metrics.ALPHA,
        },
    )
    metrics.Register(stalenessMetric)
    
  3. Update the metric in your reconciliation loop:

    staleness := time.Since(lastSyncTime).Seconds()
    stalenessMetric.Set(staleness)
    

4. Test Staleness Mitigation

Simulate a scenario where staleness could cause incorrect behavior and verify that AtomicFIFO prevents it.

  1. Deploy a sample controller (e.g., a replica set scaler) without AtomicFIFO enabled.

  2. Force a cache rebuild by restarting the controller while simultaneously creating new objects:

    kubectl run test-pod --image=nginx &
    sleep 1
    kubectl rollout restart deployment/my-controller
    
  3. Check if the controller took an incorrect action (e.g., scaling down too early).

  4. Re-enable AtomicFIFO and repeat the test. Observe that the controller now waits for the atomic batch to complete before acting.

Use the new metrics to confirm:

kubectl get --raw /metrics | grep controller_staleness

5. Set Up Monitoring and Alerts

To catch staleness in production:

  • Scrape the controller_staleness_seconds metric with Prometheus.
  • Create an alert if the value exceeds a threshold (e.g., > 30 seconds).
  • Watch for ControllerCacheStale events:
kubectl get events --field-selector reason=ControllerCacheStale

Common Mistakes

  • Assuming cache always reflects reality: Even with AtomicFIFO, the cache can still be stale during network partitions or prolonged API server outages. Always implement fallback logic (e.g., direct API server calls for critical decisions).
  • Enabling AtomicFIFO without updating dependencies: Ensure your client-go version matches Kubernetes v1.36. Using an older version will ignore the feature gate silently.
  • Overlooking observability for custom controllers: The built-in metrics only apply if you explicitly integrate them. Do not assume they appear automatically.
  • Incorrect feature gate syntax: Use AtomicFIFO=true (not AtomicFIFO=1 or true alone).
  • Not testing under realistic race conditions: Develop test scenarios where objects are created, updated, and deleted while the controller restarts.

Summary

Staleness in Kubernetes controllers can cause subtle, hard-to-diagnose failures. With v1.36, you can enable AtomicFIFO processing to ensure cache consistency during initial sync, and leverage new observability features to monitor cache health. This combination reduces the risk of incorrect controller actions, improves response times, and gives operators visibility into potential issues before they escalate. Implement these practices in your custom controllers to make your cluster management more resilient.