Rate Limiting Implementation Plan (M-3)¶

Status: 📝 Planned (Documentation Complete) Compliance: Basel III Availability, Operational Resilience Effort: 1-2 weeks Priority: MEDIUM

Overview¶

This document provides a comprehensive implementation plan for rate limiting in the Bindy operator to prevent runaway reconciliation loops, API server overload, and BIND9 server exhaustion.

Problem: Without rate limiting, the operator can: - Overwhelm the Kubernetes API server with excessive requests - Flood BIND9 servers with RNDC commands during cascading failures - Exhaust pod CPU/memory during reconciliation storms - Cause cluster-wide performance degradation

Solution: Implement multi-layer rate limiting at: 1. Reconciliation Loop Level - Limit reconciliation frequency per resource 2. Kubernetes API Client Level - Set QPS/burst limits for API calls 3. RNDC Client Level - Implement circuit breakers for BIND9 communication 4. Pod Resource Level - CPU/memory throttling via resource limits

1. Reconciliation Loop Rate Limiting¶

Current Behavior¶

The operator currently uses kube-rs default reconciliation behavior: - Reconcile immediately on resource changes (watch events) - Requeue on errors with exponential backoff (1s, 2s, 4s, 8s, ...) - No global rate limit across all resources

Problem: - 1,000 DNS zones × reconcile every 5 minutes = 3.3 reconciliations/second - If all zones fail simultaneously → 1,000 immediate retries → API server overload

Proposed Solution¶

use governor::{Quota, RateLimiter};
use nonzero_ext::nonzero;
use std::num::NonZeroU32;

// Global rate limiter for all reconciliations
// Allow 10 reconciliations per second with bursts up to 50
lazy_static! {
    static ref RECONCILE_RATE_LIMITER: RateLimiter<
        governor::state::direct::NotKeyed,
        governor::state::InMemoryState,
        governor::clock::DefaultClock
    > = RateLimiter::direct(Quota::per_second(nonzero!(10u32)));
}

pub async fn reconcile(zone: Arc<DNSZone>, ctx: Arc<Context>) -> Result<Action, ReconcileError> {
    // Wait for rate limiter before proceeding
    RECONCILE_RATE_LIMITER.until_ready().await;

    // ... existing reconciliation logic ...
}

Benefits: - Prevents reconciliation storms (max 10/sec globally) - Bursts allowed for normal operations (up to 50) - Protects Kubernetes API and BIND9 servers - Automatic backpressure (reconciliations queue up)

Configuration:

Add to ConfigMap (bindy-config):

data:
  # Reconciliation rate limiting
  reconcile-rate-limit-per-second: "10"
  reconcile-rate-limit-burst: "50"

2. Kubernetes API Client Rate Limiting¶

Current Behavior¶

kube-rs uses default Kubernetes client-go rate limits: - QPS (Queries Per Second): 5 - Burst: 10

Problem: - Too low for 1,000+ DNS zones (need ~3.3 QPS just for normal reconciliation) - Can cause artificial delays and reconciliation lag - Client-side rate limiting should match server capacity

Proposed Solution¶

use kube::Client;
use kube::config::{Config, KubeConfigOptions};

pub async fn create_kubernetes_client() -> Result<Client> {
    let mut config = Config::infer().await?;

    // Set API client rate limits
    // QPS: 50 (allow 50 API calls per second)
    // Burst: 100 (allow bursts up to 100 calls)
    config.api_client_qps = 50.0;
    config.api_client_burst = 100;

    let client = Client::try_from(config)?;
    Ok(client)
}

Configuration:

Add to ConfigMap (bindy-config):

data:
  # Kubernetes API client rate limiting
  api-client-qps: "50"
  api-client-burst: "100"

Tuning Guidelines:

Cluster Size	DNS Zones	Recommended QPS	Recommended Burst
Small (< 100 zones)	< 100	10	20
Medium (100-1000 zones)	100-1000	50	100
Large (1000-5000 zones)	1000-5000	100	200
Extra Large (> 5000 zones)	> 5000	200	400

Monitoring:

Add Prometheus metrics for API client throttling:

// Increment when rate limited
api_client_throttled_total.inc();

// Track wait time
api_client_throttle_wait_seconds.observe(wait_duration.as_secs_f64());

3. RNDC Circuit Breaker¶

Current Behavior¶

The operator makes RNDC calls to BIND9 servers without circuit breakers: - Retries failed RNDC calls indefinitely - No timeout between retries - Can overwhelm a failing BIND9 server

Problem: - BIND9 server crashes → 1,000 zones retry RNDC → server never recovers - Cascading failures across all BIND9 instances

Proposed Solution¶

Implement circuit breaker pattern using tokio-retry:

use tokio_retry::strategy::{ExponentialBackoff, jitter};
use tokio_retry::Retry;
use std::time::Duration;

const MAX_RNDC_RETRIES: usize = 3;
const RNDC_INITIAL_BACKOFF_MS: u64 = 100;
const RNDC_MAX_BACKOFF_MS: u64 = 5000;

pub async fn execute_rndc_with_circuit_breaker(
    cmd: &str,
    server: &str
) -> Result<String, RndcError> {
    // Exponential backoff: 100ms, 200ms, 400ms, 800ms, 1600ms (max 5s)
    let retry_strategy = ExponentialBackoff::from_millis(RNDC_INITIAL_BACKOFF_MS)
        .max_delay(Duration::from_millis(RNDC_MAX_BACKOFF_MS))
        .map(jitter)
        .take(MAX_RNDC_RETRIES);

    // Retry with circuit breaker
    let result = Retry::spawn(retry_strategy, || async {
        execute_rndc(cmd, server).await
    }).await;

    match result {
        Ok(output) => Ok(output),
        Err(e) => {
            // Circuit breaker: mark server as unhealthy
            mark_server_unhealthy(server).await;

            Err(RndcError::CircuitOpen {
                server: server.to_string(),
                last_error: e.to_string(),
            })
        }
    }
}

// Track server health
static SERVER_HEALTH: Lazy<DashMap<String, ServerHealth>> = Lazy::new(DashMap::new);

#[derive(Clone)]
struct ServerHealth {
    failures: usize,
    last_failure: Instant,
    circuit_open: bool,
}

async fn mark_server_unhealthy(server: &str) {
    let mut health = SERVER_HEALTH.entry(server.to_string())
        .or_insert(ServerHealth {
            failures: 0,
            last_failure: Instant::now(),
            circuit_open: false,
        });

    health.failures += 1;
    health.last_failure = Instant::now();

    // Open circuit breaker if 5 failures in 1 minute
    if health.failures >= 5 {
        health.circuit_open = true;
        warn!("Circuit breaker OPEN for server {}", server);

        // Close circuit after 60 seconds
        tokio::spawn(async move {
            tokio::time::sleep(Duration::from_secs(60)).await;
            if let Some(mut health) = SERVER_HEALTH.get_mut(server) {
                health.circuit_open = false;
                health.failures = 0;
                info!("Circuit breaker CLOSED for server {}", server);
            }
        });
    }
}

Benefits: - Prevents RNDC retry storms - Failing servers get 60-second cool-down period - Automatic recovery when server comes back - Protects BIND9 from operator-induced overload

4. Pod Resource Limits (CPU/Memory Throttling)¶

Current Configuration¶

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

Analysis: - CPU Limit (500m): 0.5 CPU cores - reasonable for most workloads - Memory Limit (512Mi): May be too low for large clusters (1000+ zones) - No runtime throttling: OOMKilled if memory exceeded

Proposed Configuration¶

resources:
  limits:
    cpu: 1000m      # Increase to 1 CPU core (runaway reconciliation protection)
    memory: 1Gi      # Increase to 1GB (large cluster support)
  requests:
    cpu: 200m       # Increase to 0.2 CPU cores (better scheduling)
    memory: 256Mi    # Increase to 256MB (avoid OOMKill on startup)

Tuning Guidelines:

Cluster Size	DNS Zones	CPU Request	CPU Limit	Memory Request	Memory Limit
Small (< 100 zones)	< 100	100m	500m	128Mi	512Mi
Medium (100-1000 zones)	100-1000	200m	1000m	256Mi	1Gi
Large (1000-5000 zones)	1000-5000	500m	2000m	512Mi	2Gi
Extra Large (> 5000 zones)	> 5000	1000m	4000m	1Gi	4Gi

Monitoring:

Add Prometheus alerts for resource exhaustion:

- alert: OperatorHighCPUUsage
  expr: rate(container_cpu_usage_seconds_total{pod=~"bindy-.*"}[5m]) > 0.8
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Operator CPU usage > 80% for 10 minutes"

- alert: OperatorHighMemoryUsage
  expr: container_memory_working_set_bytes{pod=~"bindy-.*"} / container_spec_memory_limit_bytes{pod=~"bindy-.*"} > 0.8
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Operator memory usage > 80% for 10 minutes"

5. Runaway Reconciliation Detection¶

Proposed Monitoring¶

Add Prometheus metrics to detect reconciliation loops:

use prometheus::{Counter, Histogram, IntGauge};

lazy_static! {
    // Total reconciliations (by resource type and result)
    static ref RECONCILE_TOTAL: CounterVec = register_counter_vec!(
        "bindy_reconcile_total",
        "Total number of reconciliations",
        &["resource_type", "result"]  // result: success, error, requeue
    ).unwrap();

    // Reconciliation duration
    static ref RECONCILE_DURATION: HistogramVec = register_histogram_vec!(
        "bindy_reconcile_duration_seconds",
        "Reconciliation duration in seconds",
        &["resource_type"]
    ).unwrap();

    // Reconciliations in progress
    static ref RECONCILE_IN_PROGRESS: IntGaugeVec = register_int_gauge_vec!(
        "bindy_reconcile_in_progress",
        "Number of reconciliations currently in progress",
        &["resource_type"]
    ).unwrap();

    // Requeue rate (indicator of issues)
    static ref RECONCILE_REQUEUE_RATE: CounterVec = register_counter_vec!(
        "bindy_reconcile_requeue_rate",
        "Rate of reconciliation requeues (errors or pending work)",
        &["resource_type", "reason"]  // reason: error, pending, rate_limit
    ).unwrap();
}

pub async fn reconcile(zone: Arc<DNSZone>, ctx: Arc<Context>) -> Result<Action, ReconcileError> {
    let _in_progress = RECONCILE_IN_PROGRESS.with_label_values(&["dnszone"]).guard();
    let timer = RECONCILE_DURATION.with_label_values(&["dnszone"]).start_timer();

    let result = reconcile_inner(zone, ctx).await;

    match &result {
        Ok(Action::Requeue(duration)) => {
            RECONCILE_TOTAL.with_label_values(&["dnszone", "requeue"]).inc();
            RECONCILE_REQUEUE_RATE.with_label_values(&["dnszone", "pending"]).inc();
        },
        Ok(Action::None) => {
            RECONCILE_TOTAL.with_label_values(&["dnszone", "success"]).inc();
        },
        Err(e) => {
            RECONCILE_TOTAL.with_label_values(&["dnszone", "error"]).inc();
            RECONCILE_REQUEUE_RATE.with_label_values(&["dnszone", "error"]).inc();
        },
    }

    timer.observe_duration();
    result
}

Prometheus Alerts for Runaway Reconciliation:

- alert: RunawayReconciliation
  expr: rate(bindy_reconcile_requeue_rate{reason="error"}[5m]) > 10
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Runaway reconciliation loop detected"
    description: "{{ $value }} reconciliation errors per second (> 10/sec threshold)"
    runbook_url: "https://github.com/firestoned/bindy/blob/main/docs/operations/runaway-reconciliation.md"

Implementation Checklist¶

Phase 1: Rate Limiting (Week 1)¶

Add governor crate dependency to Cargo.toml
Implement global reconciliation rate limiter (10/sec)
Add ConfigMap keys for rate limit configuration
Update Kubernetes API client with QPS/burst limits
Add Prometheus metrics for rate limiting
Test with 1,000 DNS zones (load testing)
Document rate limit tuning guidelines

Phase 2: Circuit Breakers (Week 1)¶

Add tokio-retry crate dependency to Cargo.toml
Implement RNDC circuit breaker with exponential backoff
Track server health (failures, last_failure, circuit_open)
Add circuit breaker open/close logging
Test circuit breaker with failing BIND9 server
Add Prometheus metrics for circuit breaker state

Phase 3: Resource Tuning (Week 2)¶

Update deployment.yaml with increased resource limits
Add resource tuning guidelines to documentation
Create Prometheus alerts for resource exhaustion
Test with various cluster sizes (100, 1000, 5000 zones)
Document memory/CPU usage per DNS zone

Phase 4: Monitoring & Alerting (Week 2)¶

Add reconciliation metrics (total, duration, in_progress, requeue_rate)
Create Grafana dashboard for reconciliation health
Add runaway reconciliation alert
Create runbook for runaway reconciliation incidents
Document troubleshooting procedures

Testing Plan¶

1. Load Testing (1,000 DNS Zones)¶

# Create 1,000 DNS zones
for i in {1..1000}; do
  kubectl apply -f - <<EOF
apiVersion: bindy.firestoned.io/v1beta1
kind: DNSZone
metadata:
  name: zone-$i
  namespace: default
spec:
  zoneName: "zone-$i.example.com"
  clusterRef:
    name: test-cluster
EOF
done

# Monitor reconciliation rate
kubectl top pod -n dns-system
watch -n 1 'kubectl get events -n dns-system --sort-by=.lastTimestamp | tail -20'

# Check Prometheus metrics
curl -s localhost:8080/metrics | grep bindy_reconcile

Expected Results: - Reconciliation rate ≤ 10/sec (rate limiter working) - CPU usage < 80% of limit - Memory usage < 80% of limit - No OOMKilled pods

2. Circuit Breaker Testing (Failing BIND9 Server)¶

# Kill BIND9 pod
kubectl delete pod -n dns-system bind9-primary-0

# Watch circuit breaker logs
kubectl logs -n dns-system -l app=bindy --follow | grep -i circuit

# Expected logs:
# WARN Circuit breaker OPEN for server bind9-primary-0.bind9-primary:9530
# INFO Circuit breaker CLOSED for server bind9-primary-0.bind9-primary:9530 (after 60s)

Rate Limiting Implementation Plan (M-3)¶

Overview¶

1. Reconciliation Loop Rate Limiting¶

Current Behavior¶

Proposed Solution¶

2. Kubernetes API Client Rate Limiting¶

Current Behavior¶

Proposed Solution¶

3. RNDC Circuit Breaker¶

Current Behavior¶

Proposed Solution¶

4. Pod Resource Limits (CPU/Memory Throttling)¶

Current Configuration¶

Proposed Configuration¶

5. Runaway Reconciliation Detection¶

Proposed Monitoring¶

Implementation Checklist¶

Phase 1: Rate Limiting (Week 1)¶

Phase 2: Circuit Breakers (Week 1)¶

Phase 3: Resource Tuning (Week 2)¶

Phase 4: Monitoring & Alerting (Week 2)¶

Testing Plan¶

1. Load Testing (1,000 DNS Zones)¶

2. Circuit Breaker Testing (Failing BIND9 Server)¶

See Also¶