Audience: Platform Operations teams and SREs responsible for HyperFleet Sentinel deployments.
This runbook provides operational guidance for teams deploying and managing HyperFleet Sentinel in production environments. It serves as the primary reference for:
- Understanding built-in reliability features
- Configuring health probes and monitoring
- Diagnosing and recovering from common failure modes
The Sentinel service is designed with multiple layers of reliability to ensure continuous reconciliation of HyperFleet resources.
What: Sentinel maintains no persistent state between polling cycles.
Implementation:
- All reconciliation decisions are made based on current resource state from the HyperFleet API
- No local databases or persistent storage requirements
- Configuration loaded once at startup from YAML files and environment variables
- Each polling cycle starts fresh from API data
Benefits:
- Simple horizontal scaling (no state coordination needed)
- Fast recovery after restarts (no state reconstruction)
- Eliminates state corruption issues
- Simplified deployment (no persistent volumes)
Operational Impact: Sentinel instances can be stopped/started without data loss. Resource reconciliation continues from the last adapter-reported status.
What: Sentinel responds to SIGTERM/SIGINT signals with controlled shutdown.
Implementation:
- Listens for termination signals during main polling loop
- Completes current polling cycle before exiting
- Maximum shutdown time: 20 seconds for HTTP server shutdown
- Cleans up broker connections gracefully
Configuration:
spec:
template:
spec:
terminationGracePeriodSeconds: 30Operational Impact: Graceful shutdown minimizes event loss by attempting to publish pending events before exit, subject to the grace period.
What: Automatic retry with exponential backoff for HyperFleet API calls.
Implementation:
- Timeout: 5 seconds per API call (configurable via
hyperfleet_api.timeout) - Initial interval: 500ms (first retry after 500ms)
- Max interval: 8 seconds (maximum retry interval)
- Multiplier: 2.0 (doubles interval each retry: 500ms → 1s → 2s → 4s → 8s)
- Randomization: 10% jitter added to prevent thundering herd
- Max elapsed time: 30 seconds total (time-based retry, not attempt-based)
- Failure handling: Logs errors, continues with next resource after max elapsed time
Configuration:
hyperfleet_api:
endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000
timeout: 5sMetrics: Failed API calls tracked via hyperfleet_sentinel_api_errors_total metric.
Operational Impact: Transient API issues don't stop reconciliation. Service continues polling after API recovery.
What: Automatic retry for message broker publishing failures.
Implementation:
- External library: Retry behavior handled by
hyperfleet-brokerlibrary - Broker support: GCP Pub/Sub and RabbitMQ with library-managed retry logic
- Failure isolation: Failed events logged but don't stop processing of other resources
- Error handling: Log error, record metric, continue to next resource
Note: Specific retry parameters (attempts, timeouts, backoff strategy) are implemented in the external hyperfleet-broker library and not configurable at the Sentinel level.
Configuration Example (GCP Pub/Sub):
# Via broker config file (broker.yaml)
broker:
type: googlepubsub
googlepubsub:
project_id: hyperfleet-prodMetrics: Publishing failures tracked via hyperfleet_sentinel_broker_errors_total metric.
Operational Impact: Temporary broker outages don't cause event loss. Events are retried by the broker library, but durability depends on broker availability and Sentinel remaining active.
What: Failures processing one resource don't affect processing of other resources.
Implementation:
- Each resource evaluated independently in the polling loop
- Decision engine errors logged but processing continues
- Event publishing failures logged but don't stop the polling cycle
- A single API fetch failure affects the entire cycle, but the next cycle retries automatically
Example Flow:
Polling Cycle:
├── Fetch 100 clusters from API
├── Process cluster-1 → Event published
├── Process cluster-2 → Log error, continue
├── Process cluster-3 → Event published
└── Complete cycle, sleep, repeat
Operational Impact: Problematic resources (e.g., malformed data) don't prevent reconciliation of healthy resources.
What: Kubernetes readiness and liveness probes that verify actual service functionality.
Implementation:
Liveness Probe (/healthz):
- Checks poll staleness (dead man's switch)
- Returns 200 OK if last successful poll is within threshold (3 × poll_interval)
- Returns 200 OK before first poll completes (grace period)
- Failure threshold: 3 consecutive failures
- Period: 20 seconds
Readiness Probe (/readyz):
- Checks broker connection health
- Verifies at least one successful poll cycle has completed
- Returns 200 OK when both checks pass
- Returns 200 OK when ready to process traffic
- Period: 10 seconds
Configuration:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10Operational Impact: Kubernetes automatically restarts unhealthy pods and removes unready pods from service.
Sentinel supports OpenTelemetry distributed tracing, which is useful for debugging event flow across service boundaries.
Tracing is disabled by default. Enable it by setting the HYPERFLEET_TRACING_ENABLED environment variable:
HYPERFLEET_TRACING_ENABLED=trueConfigure the exporter endpoint to send traces to your collector:
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317If no endpoint is set, traces are written to stdout (useful for local debugging).
By default, Sentinel uses parentbased_traceidratio sampling. To control sampling:
# Sample all traces (useful for debugging a specific issue)
OTEL_TRACES_SAMPLER=always_on
# Sample 10% of traces (reduce overhead in production)
OTEL_TRACES_SAMPLER_ARG=0.1- Find the
trace_idin structured logs (JSON format):{"level":"info","message":"Publishing event","trace_id":"4bf92f3577b34da6a...","span_id":"00f067aa0ba902b7",...} - Search for that
trace_idin your tracing backend (Jaeger, Tempo, etc.) to see the full request flow
Sentinel creates the following spans during each polling cycle:
sentinel.poll— Top-level span for the entire poll cycleGET— HTTP span auto-created byotelhttpfor the HyperFleet API callsentinel.evaluate— One per resource, includeshyperfleet.resource_idandhyperfleet.decision_reasonattributes{topic} publish— Created when an event is published, includesmessaging.system,messaging.destination.name, andmessaging.message.idattributes
Sentinel propagates W3C traceparent context in two ways:
- Outbound API calls: HTTP client is instrumented with
otelhttp, so calls to the HyperFleet API automatically carry trace context - CloudEvents: Published events include a
traceparentextension, allowing downstream adapters to continue the trace
Traces span service boundaries only when the upstream caller includes a traceparent header. Without it, Sentinel starts a new trace root.
Symptoms: Pod restart count increasing, CrashLoopBackOff status
Diagnosis: Check pod logs, resource constraints, configuration errors
Recovery:
- Check logs:
kubectl logs -l app.kubernetes.io/name=sentinel --previous - Verify resource limits:
kubectl describe pods -l app.kubernetes.io/name=sentinel - Validate configuration:
kubectl get configmap -l app.kubernetes.io/name=sentinel -o yaml
Alternative commands for specific deployment:
# If you know the Helm release name (e.g., "my-sentinel")
kubectl logs deployment/my-sentinel-sentinel --previous
kubectl get configmap my-sentinel-sentinel-config -o yamlSymptoms: High API error rate, no events published
Diagnosis: API health, network connectivity, authentication
Recovery:
- Test API connectivity:
kubectl exec -l app.kubernetes.io/name=sentinel -- curl hyperfleet-api:8000/health - Check API pod status:
kubectl get pods -l app.kubernetes.io/name=hyperfleet-api - Verify service endpoints:
kubectl get endpoints hyperfleet-api - Check API service:
kubectl get service hyperfleet-api
Note: API endpoint uses port 8000 as configured in values.yaml
Symptoms: High broker error rate, events not reaching adapters
Diagnosis: Broker connectivity, credentials, topic configuration
Recovery:
- Check broker credentials:
kubectl get secret -l app.kubernetes.io/name=sentinel -o yaml - Test RabbitMQ connectivity:
kubectl exec -l app.kubernetes.io/name=sentinel -- nslookup rabbitmq.hyperfleet-system.svc.cluster.local - Check broker health:
kubectl get pods -l app.kubernetes.io/name=rabbitmq - Validate broker config:
kubectl exec -l app.kubernetes.io/name=sentinel -- cat /etc/sentinel/broker.yaml
For specific secret (if you know the Helm release name):
kubectl get secret my-sentinel-sentinel-broker-credentials -o yaml