Error Handling and Retry Logic¶
Bindy implements robust error handling for DNS record reconciliation, ensuring the operator never crashes when encountering failures. Instead, it updates status conditions, creates Kubernetes Events, and automatically retries with exponential backoff.
Overview¶
When reconciling DNS records, several failure scenarios can occur: - DNSZone not found: No matching DNSZone resource exists - RNDC key loading fails: Cannot load the RNDC authentication Secret - BIND9 connection fails: Unable to connect to the BIND9 server - Record operation fails: BIND9 rejects the record operation - HTTP API errors: Bindcar sidecar temporary failures (503, 502, etc.) - Kubernetes API errors: Rate limiting (429), server errors (5xx), network failures
Bindy handles all these scenarios gracefully with: - ✅ Status condition updates following Kubernetes conventions - ✅ Kubernetes Events for visibility - ✅ Automatic retry with exponential backoff for HTTP and Kubernetes API calls - ✅ Configurable retry intervals for reconciliation loops - ✅ Idempotent operations safe for multiple retries - ✅ Smart error classification (retryable vs permanent errors)
Automatic Retry Infrastructure¶
Bindy includes automatic retry logic with exponential backoff for transient failures in HTTP and Kubernetes API calls. This ensures resilience to temporary issues without manual intervention.
HTTP API Retry (Bindcar Operations)¶
All BIND9 zone management operations (via bindcar HTTP API sidecar) automatically retry on transient failures:
Retryable HTTP Status Codes: - 429 - Too Many Requests (rate limiting) - 500 - Internal Server Error - 502 - Bad Gateway (proxy/load balancer issues) - 503 - Service Unavailable (temporary unavailability) - 504 - Gateway Timeout - Network errors - Connection failures, DNS resolution errors
Non-Retryable HTTP Status Codes: - 4xx (except 429) - Client errors (400 Bad Request, 404 Not Found, 401 Unauthorized, etc.) - These indicate permanent errors that won't be fixed by retrying
HTTP Retry Configuration: - Initial retry: 50ms - Max interval: 10 seconds between retries - Max duration: 2 minutes total retry time - Backoff multiplier: 2.0 (exponential growth) - Randomization: ±10% (prevents thundering herd)
Retry Schedule Example:
Attempt 1: Immediate (0ms)
Attempt 2: 50ms after failure
Attempt 3: ~100ms after failure
Attempt 4: ~200ms after failure
Attempt 5: ~400ms after failure
Attempt 6: ~800ms after failure
Attempt 7: ~1.6s after failure
Attempt 8: ~3.2s after failure
Attempt 9: ~6.4s after failure
Attempt 10+: 10s intervals until 2 minutes elapsed
Affected Operations:
All bindcar HTTP API operations automatically benefit from retry:
- Zone creation (create_zone())
- Zone updates (update_zone())
- Zone deletion (delete_zone())
- Zone reload (reload_zone())
- Zone listing (list_zones())
- Zone status checks (get_zone_status())
- Primary/secondary zone operations
Example Log Output:
WARN Retryable HTTP API error, will retry
method=POST
url=http://bind9-primary-api:8080/api/v1/zones
attempt=3
retry_after=200ms
error=HTTP request 'POST http://...' failed with status 503: Service temporarily unavailable
INFO HTTP API call succeeded after retries
method=POST
url=http://bind9-primary-api:8080/api/v1/zones
attempt=4
elapsed=850ms
Kubernetes API Retry¶
Kubernetes API calls will automatically retry on transient failures (implementation planned for reconcilers):
Retryable Kubernetes Errors: - HTTP 429 - Too Many Requests (rate limiting) - HTTP 5xx - Server errors (API server temporary issues) - Network errors - Connection failures, timeouts
Non-Retryable Kubernetes Errors: - HTTP 4xx (except 429) - Client errors (not found, unauthorized, forbidden, etc.)
Kubernetes Retry Configuration: - Initial retry: 100ms - Max interval: 30 seconds between retries - Max duration: 5 minutes total retry time - Backoff multiplier: 2.0 (exponential growth) - Randomization: ±10% (prevents thundering herd)
Why Different from HTTP API: - Kubernetes API targets remote cluster (variable network latency) - API server issues may take longer to resolve - Longer max interval (30s vs 10s) and total time (5min vs 2min) - Slightly slower initial retry (100ms vs 50ms)
Benefits of Automatic Retry¶
- Resilience to Transient Failures: Temporary network blips, pod restarts, and service unavailability don't cause permanent failures
- No Manual Intervention: Operators don't need to manually requeue or restart failed operations
- Smart Error Classification: Permanent errors (404, 400, etc.) fail immediately without wasting time on retries
- Exponential Backoff: Prevents overwhelming services during recovery while still providing fast retries for quick issues
- Thundering Herd Prevention: Randomization (±10%) prevents all operators retrying simultaneously
- Observability: Structured logging provides visibility into retry attempts, duration, and outcomes
Configuration¶
Retry Interval¶
Control how long to wait before retrying failed DNS record operations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: bindy-operator
namespace: bindy-system
spec:
template:
spec:
containers:
- name: bindy
image: ghcr.io/firestoned/bindy:latest
env:
- name: BINDY_RECORD_RETRY_SECONDS
value: "60" # Default: 30 seconds
Recommendations: - Development: 10-15 seconds for faster iteration - Production: 30-60 seconds to avoid overwhelming the API server - High-load environments: 60-120 seconds to reduce reconciliation pressure
Error Scenarios¶
1. DNSZone Not Found¶
Scenario: DNS record references a zone that doesn't exist
apiVersion: bindy.firestoned.io/v1beta1
kind: ARecord
metadata:
name: www-example
namespace: dns-system
spec:
zone: example.com # No DNSZone with zoneName: example.com exists
name: www
ipv4Address: 192.0.2.1
Status:
status:
conditions:
- type: Ready
status: "False"
reason: ZoneNotFound
message: "No DNSZone found for zone example.com in namespace dns-system"
lastTransitionTime: "2025-11-29T23:45:00Z"
observedGeneration: 1
Event:
Type Reason Message
Warning ZoneNotFound No DNSZone found for zone example.com in namespace dns-system
Resolution: 1. Create the DNSZone resource:
apiVersion: bindy.firestoned.io/v1beta1
kind: DNSZone
metadata:
name: example-com
namespace: dns-system
spec:
zoneName: example.com
clusterRef: bind9-primary
2. RNDC Key Load Failed¶
Scenario: Cannot load the RNDC authentication Secret
Status:
status:
conditions:
- type: Ready
status: "False"
reason: RndcKeyLoadFailed
message: "Failed to load RNDC key for cluster bind9-primary: Secret bind9-primary-rndc-key not found"
lastTransitionTime: "2025-11-29T23:45:00Z"
Event:
Resolution: 1. Check if the Secret exists:
2. Verify the Bind9Instance is running and has created its Secret: 3. If missing, the Bind9Instance reconciler should create it automatically3. BIND9 Connection Failed¶
Scenario: Cannot connect to the BIND9 server (network issue, pod not ready, etc.)
Status:
status:
conditions:
- type: Ready
status: "False"
reason: RecordAddFailed
message: "Cannot connect to BIND9 server at bind9-primary.dns-system.svc.cluster.local:9530: connection refused. Will retry in 30s"
lastTransitionTime: "2025-11-29T23:45:00Z"
Event:
Type Reason Message
Warning RecordAddFailed Cannot connect to BIND9 server at bind9-primary.dns-system.svc.cluster.local:9530
Resolution: 1. Check BIND9 pod status:
2. Check BIND9 logs: 3. Verify network connectivity:kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
nc -zv bind9-primary.dns-system.svc.cluster.local 9530
4. Record Created Successfully¶
Scenario: DNS record successfully created in BIND9
Status:
status:
conditions:
- type: Ready
status: "True"
reason: RecordCreated
message: "A record www.example.com created successfully"
lastTransitionTime: "2025-11-29T23:45:00Z"
observedGeneration: 1
Event:
Monitoring¶
View Record Status¶
# List all DNS records with status
kubectl get arecords,aaaarecords,cnamerecords,mxrecords,txtrecords -A
# Check specific record status
kubectl get arecord www-example -n dns-system -o jsonpath='{.status.conditions[0]}' | jq .
# Find failing records
kubectl get arecords -A -o json | \
jq -r '.items[] | select(.status.conditions[0].status == "False") |
"\(.metadata.namespace)/\(.metadata.name): \(.status.conditions[0].reason) - \(.status.conditions[0].message)"'
View Events¶
# Recent events in namespace
kubectl get events -n dns-system --sort-by='.lastTimestamp' | tail -20
# Watch events in real-time
kubectl get events -n dns-system --watch
# Filter for DNS record events
kubectl get events -n dns-system --field-selector involvedObject.kind=ARecord
Prometheus Metrics¶
Bindy exposes reconciliation metrics (if enabled):
# Reconciliation errors by reason
bindy_reconcile_errors_total{resource="ARecord", reason="ZoneNotFound"}
# Reconciliation duration
histogram_quantile(0.95, bindy_reconcile_duration_seconds_bucket{resource="ARecord"})
Status Reason Codes¶
| Reason | Status | Meaning | Action Required |
|---|---|---|---|
RecordCreated |
Ready=True |
DNS record successfully created in BIND9 | None - record is operational |
ZoneNotFound |
Ready=False |
No matching DNSZone resource exists | Create DNSZone or fix zone reference |
RndcKeyLoadFailed |
Ready=False |
Cannot load RNDC key Secret | Verify Bind9Instance is running and Secret exists |
RecordAddFailed |
Ready=False |
Failed to communicate with BIND9 or add record | Check BIND9 pod status and network connectivity |
Idempotent Operations¶
All BIND9 operations are idempotent, making them safe for operator retries:
add_zones / add_primary_zone / add_secondary_zone¶
- add_zones: Centralized dispatcher that routes to
add_primary_zoneoradd_secondary_zonebased on zone type - add_primary_zone: Checks if zone exists before attempting to add primary zone
- add_secondary_zone: Checks if zone exists before attempting to add secondary zone
- All functions return success if zone already exists
- Safe to call multiple times (idempotent)
reload_zone¶
- Returns clear error if zone doesn't exist
- Otherwise performs reload operation
- Safe to call multiple times
Record Operations¶
- All record add/update operations are idempotent
- Retrying a failed operation won't create duplicates
- Operator can safely requeue failed reconciliations
Best Practices¶
1. Monitor Status Conditions¶
Always check status conditions when debugging DNS record issues:
Look for the Status section showing current conditions.
2. Use Events for Troubleshooting¶
Events provide a timeline of what happened:
3. Adjust Retry Interval for Your Needs¶
- Fast feedback during development:
BINDY_RECORD_RETRY_SECONDS=10 - Production stability:
BINDY_RECORD_RETRY_SECONDS=60 - High-load clusters:
BINDY_RECORD_RETRY_SECONDS=120
4. Create DNSZones Before Records¶
To avoid ZoneNotFound errors, always create DNSZone resources before creating DNS records:
# 1. Create DNSZone
kubectl apply -f dnszone.yaml
# 2. Wait for it to be ready
kubectl wait --for=condition=Ready dnszone/example-com -n dns-system --timeout=60s
# 3. Create DNS records
kubectl apply -f records/
5. Use Labels for Organization¶
Tag related resources for easier monitoring:
apiVersion: bindy.firestoned.io/v1beta1
kind: ARecord
metadata:
name: www-example
namespace: dns-system
labels:
app: web-frontend
environment: production
spec:
zone: example.com
name: www
ipv4Address: 192.0.2.1
Then filter:
Troubleshooting Guide¶
Record Stuck in "ZoneNotFound"¶
- Verify DNSZone exists:
- Check zone name matches:
- Ensure they're in the same namespace
Record Stuck in "RndcKeyLoadFailed"¶
- Check Secret exists:
- Verify Bind9Instance is Ready:
- Check Bind9Instance logs:
Record Stuck in "RecordAddFailed"¶
- Check BIND9 pod is running:
- Test network connectivity:
- Check BIND9 logs for errors:
- Verify RNDC is listening on port 9530:
See Also¶
- Debugging Guide - Detailed debugging procedures
- Logging Configuration - Configure operator logging levels
- Bind9Instance Reference - BIND9 instance configuration
- DNSZone Reference - DNS zone configuration