Metrics¶

Monitor performance and health metrics for Bindy DNS infrastructure.

Operator Metrics¶

Bindy exposes Prometheus-compatible metrics on port 8080 at /metrics. These metrics provide comprehensive observability into the operator's behavior and resource management.

Accessing Metrics¶

The metrics endpoint is exposed on all operator pods:

# Port forward to the operator
kubectl port-forward -n dns-system deployment/bindy-operator 8080:8080

# View metrics
curl http://localhost:8080/metrics

Available Metrics¶

All metrics use the namespace prefix bindy_firestoned_io_.

Reconciliation Metrics¶

bindy_firestoned_io_reconciliations_total (Counter) Total number of reconciliation attempts by resource type and outcome.

Labels: - resource_type: Kind of resource (Bind9Cluster, Bind9Instance, DNSZone, ARecord, AAAARecord, TXTRecord, CNAMERecord, MXRecord, NSRecord, SRVRecord, CAARecord) - status: Outcome (success, error, requeue)

# Reconciliation success rate
rate(bindy_firestoned_io_reconciliations_total{status="success"}[5m])

# Error rate by resource type
rate(bindy_firestoned_io_reconciliations_total{status="error"}[5m])

bindy_firestoned_io_reconciliation_duration_seconds (Histogram) Duration of reconciliation operations in seconds.

Labels: - resource_type: Kind of resource

Buckets: 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0

# Average reconciliation duration
rate(bindy_firestoned_io_reconciliation_duration_seconds_sum[5m])
/ rate(bindy_firestoned_io_reconciliation_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, bindy_firestoned_io_reconciliation_duration_seconds_bucket)

bindy_firestoned_io_requeues_total (Counter) Total number of requeue operations.

Labels: - resource_type: Kind of resource - reason: Reason for requeue (error, rate_limit, dependency_wait)

# Requeue rate by reason
rate(bindy_firestoned_io_requeues_total[5m])

Resource Lifecycle Metrics¶

bindy_firestoned_io_resources_created_total (Counter) Total number of resources created.

Labels: - resource_type: Kind of resource

bindy_firestoned_io_resources_updated_total (Counter) Total number of resources updated.

Labels: - resource_type: Kind of resource

bindy_firestoned_io_resources_deleted_total (Counter) Total number of resources deleted.

Labels: - resource_type: Kind of resource

bindy_firestoned_io_resources_active (Gauge) Currently active resources being tracked.

Labels: - resource_type: Kind of resource

# Resource creation rate
rate(bindy_firestoned_io_resources_created_total[5m])

# Active resources by type
bindy_firestoned_io_resources_active

Error Metrics¶

bindy_firestoned_io_errors_total (Counter) Total number of errors by resource type and category.

Labels: - resource_type: Kind of resource - error_type: Category (api_error, validation_error, network_error, timeout, reconcile_error)

# Error rate by type
rate(bindy_firestoned_io_errors_total[5m])

# Errors by resource type
sum(rate(bindy_firestoned_io_errors_total[5m])) by (resource_type)

Leader Election Metrics¶

bindy_firestoned_io_leader_elections_total (Counter) Total number of leader election events.

Labels: - status: Event type (acquired, lost, renewed)

bindy_firestoned_io_leader_status (Gauge) Current leader election status (1 = leader, 0 = follower).

Labels: - pod_name: Name of the pod

# Current leader
bindy_firestoned_io_leader_status == 1

# Leader election rate
rate(bindy_firestoned_io_leader_elections_total[5m])

Performance Metrics¶

bindy_firestoned_io_generation_observation_lag_seconds (Histogram) Lag between resource spec generation change and operator observation.

Labels: - resource_type: Kind of resource

Buckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0

# Average observation lag
rate(bindy_firestoned_io_generation_observation_lag_seconds_sum[5m])
/ rate(bindy_firestoned_io_generation_observation_lag_seconds_count[5m])

Prometheus Configuration¶

The operator deployment includes Prometheus scrape annotations:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"
  prometheus.io/path: "/metrics"

Prometheus will automatically discover and scrape these metrics if configured with Kubernetes service discovery.

Example Queries¶

# Reconciliation success rate (last 5 minutes)
sum(rate(bindy_firestoned_io_reconciliations_total{status="success"}[5m]))
/ sum(rate(bindy_firestoned_io_reconciliations_total[5m]))

# DNSZone reconciliation p95 latency
histogram_quantile(0.95,
  sum(rate(bindy_firestoned_io_reconciliation_duration_seconds_bucket{resource_type="DNSZone"}[5m])) by (le)
)

# Error rate by resource type (last hour)
topk(10,
  sum(rate(bindy_firestoned_io_errors_total[1h])) by (resource_type)
)

# Active resources per type
sum(bindy_firestoned_io_resources_active) by (resource_type)

# Requeue backlog
sum(rate(bindy_firestoned_io_requeues_total[5m])) by (resource_type, reason)

Grafana Dashboard¶

Import the Bindy operator dashboard (coming soon) or create custom panels using the queries above.

Recommended panels: 1. Reconciliation Rate - Total reconciliations/sec by resource type 2. Reconciliation Latency - P50, P95, P99 latencies 3. Error Rate - Errors/sec by resource type and error category 4. Active Resources - Gauge showing current active resources 5. Leader Status - Current leader pod and election events 6. Resource Lifecycle - Created/Updated/Deleted rates

Resource Metrics¶

Pod Metrics¶

View CPU and memory usage:

# All DNS pods
kubectl top pods -n dns-system

# Specific instance
kubectl top pods -n dns-system -l instance=primary-dns

# Sort by CPU
kubectl top pods -n dns-system --sort-by=cpu

# Sort by memory
kubectl top pods -n dns-system --sort-by=memory

Node Metrics¶

# Node resource usage
kubectl top nodes

# Detailed node info
kubectl describe node <node-name>

DNS Query Metrics¶

Using BIND9 Statistics¶

Enable BIND9 statistics channel (future enhancement):

spec:
  config:
    statisticsChannels:
      - address: "127.0.0.1"
        port: 8053

Query Counters¶

Monitor query rate and types: - Total queries received - Queries by record type (A, AAAA, MX, etc.) - Successful vs failed queries - NXDOMAIN responses

Performance Metrics¶

Query Latency¶

Measure DNS query response time:

# Test query latency
time dig @<dns-server-ip> example.com

# Multiple queries for average
for i in {1..10}; do time dig @<dns-server-ip> example.com +short; done

Zone Transfer Metrics¶

Monitor zone transfer performance: - Transfer duration - Transfer size - Transfer failures - Lag between primary and secondary

Kubernetes Metrics¶

Resource Utilization¶

# View resource requests vs limits
kubectl describe pod -n dns-system <pod-name> | grep -A5 "Limits:\|Requests:"

Pod Health¶

# Pod status and restarts
kubectl get pods -n dns-system -o wide

# Events
kubectl get events -n dns-system --sort-by='.lastTimestamp'

Prometheus Integration¶

BIND9 Exporter¶

Deploy bind_exporter as sidecar (future enhancement):

containers:
- name: bind-exporter
  image: prometheuscommunity/bind-exporter:latest
  args:
    - "--bind.stats-url=http://localhost:8053"
  ports:
    - name: metrics
      containerPort: 9119

Service Monitor¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: bindy-metrics
spec:
  selector:
    matchLabels:
      app: bind9
  endpoints:
  - port: metrics
    interval: 30s

Key Metrics to Monitor¶

Query Rate - Queries per second
Query Latency - Response time
Error Rate - Failed queries percentage
Cache Hit Ratio - Cache effectiveness
Zone Transfer Status - Success/failure of transfers
Resource Usage - CPU and memory utilization
Pod Health - Running vs desired replicas

Grafana Dashboards¶

Create dashboards for:

DNS Overview¶

Total query rate
Average latency
Error rate
Top queried domains

Instance Health¶

Pod status
CPU/memory usage
Restart count
Network I/O

Zone Management¶

Zones count
Records per zone
Zone transfer status
Serial numbers

Alerting Thresholds¶

Recommended alert thresholds:

Metric	Warning	Critical
CPU Usage	> 70%	> 90%
Memory Usage	> 70%	> 90%
Query Latency	> 100ms	> 500ms
Error Rate	> 1%	> 5%
Pod Restarts	> 3/hour	> 10/hour

Best Practices¶

Baseline metrics - Establish normal operating ranges
Set appropriate alerts - Avoid alert fatigue
Monitor trends - Look for gradual degradation
Capacity planning - Use metrics to plan scaling
Regular review - Review dashboards weekly