Audience: Operators deploying and configuring Sentinel service.
This comprehensive guide teaches operators how to deploy, configure, and operate the HyperFleet Sentinel service—a polling-based event publisher that drives cluster lifecycle orchestration.
Appendices:
Sentinel is the component within the HyperFleet system responsible for triggering reconciliation events for changes in managed API resources. It acts as the centralized trigger mechanism for the orchestration platform.
Key Benefits:
- Single source of truth - Centralized decision logic for when reconciliation should occur
- Configurable polling strategy - Different reconciliation rates for stable vs transitional resources
- Broker abstraction - Supports multiple message broker backends
- Horizontal scalability - Multiple instances can distribute workload through resource filtering
Core Responsibilities:
- Poll HyperFleet API for resource updates at configurable intervals
- Evaluate resources using a decision engine to determine when events should be published
- Publish CloudEvents to a message broker
- Track metrics for observability and debugging
Architecture Overview:
graph LR
A[Sentinel] -->|Poll| B[HyperFleet API]
B -->|Resources| A
A -->|Evaluate| C[Decision Engine]
C -->|Publish Events| D[Message Broker]
D -->|Consume Events| E[Adapters]
E -->|Update Status| B
Sentinel publishes events to a message broker, which fans out messages to downstream adapters. It uses a fully configurable CEL-based decision engine (message_decision) that determines when to publish. The default configuration covers four triggers:
- New resource: Publish immediately for resources that have never been processed (generation 1, not reconciled)
- Generation mismatch: Publish immediately when the resource spec has changed but not yet been fully processed
- Reconciled and stale: Publish periodically for stable resources to ensure eventual consistency
- Not reconciled, debounced: Publish more frequently for transitional resources, with a debounce to prevent event storms
Deploy Sentinel when you need:
- Event-driven orchestration for cluster lifecycle management
- Centralized reconciliation logic instead of distributed polling by each adapter
- Configurable polling intervals with different rates for reconciled vs not-reconciled resources
- Horizontal scaling through resource filtering
- Broker abstraction to support multiple message broker backends
Sentinel's decision engine is fully CEL-based: all publish/skip logic is expressed as named CEL parameters and a boolean result expression configured in message_decision. There is no hardcoded Go-level decision logic — the entire strategy is defined in configuration.
Each poll cycle, the engine:
- Evaluates named
paramsin authored order, accumulating intermediate values - Evaluates the
resultexpression using all param values - Publishes an event if
resultistrue; otherwise skips
How Sentinel Reads Resource State:
When Sentinel polls the HyperFleet API, it retrieves cluster or nodepool resources with their current state. Two CEL variables are always available during evaluation:
resource— the API resource as a map (id,kind,href,generation,created_time,updated_time,labels,owner_references,metadata)now— the current evaluation timestamp (timestamptype)
The condition(name) CEL function looks up a status condition by type name (e.g., condition("Reconciled")). Each condition exposes: status, observed_generation, last_updated_time, last_transition_time, reason, message. If the condition is absent, all fields are zero values (empty strings, 0 for observed_generation), so CEL expressions can guard safely with ref_time != "".
message_decision has two parts:
params— a list of named CEL expressions evaluated in order. Each param can referenceresource,now,condition(), and any previously defined param. Params are declared in dependency order: ifBusesA,Amust appear first.result— a boolean CEL expression evaluated after all params, combining them into the final publish decision.
Outcomes:
| Result | ShouldPublish |
Reason logged |
|---|---|---|
result is true |
true |
"message decision matched" |
result is false |
false |
"message decision result is false" |
If any param or the result expression fails to evaluate (type error, missing variable, etc.) the resource is skipped and the error is logged with the failing param name.
When message_decision is omitted from the config file, Sentinel uses the following default. It implements four triggers:
| Param | CEL Expression | Trigger |
|---|---|---|
ref_time |
condition("Reconciled").last_updated_time |
Reference timestamp from Reconciled condition |
is_reconciled |
condition("Reconciled").status == "True" |
Resource reconciled status |
has_ref_time |
ref_time != "" |
Guard: Reconciled condition exists |
is_new_resource |
resource.generation == 1 && !has_ref_time |
New resource: no Reconciled condition yet |
generation_mismatch |
resource.generation > condition("Reconciled").observed_generation |
Spec changed but not yet processed |
reconciled_and_stale |
is_reconciled && has_ref_time && now - timestamp(ref_time) > duration("30m") |
Stable resource drifting past 30 min |
not_reconciled_and_debounced |
!is_reconciled && has_ref_time && now - timestamp(ref_time) > duration("10s") |
Transitional resource debounced past 10 s |
Result: is_new_resource || generation_mismatch || reconciled_and_stale || not_reconciled_and_debounced
What each trigger covers:
is_new_resource— Catches brand-new resources withgeneration == 1that have no Reconciled condition yet (adapter has never seen them). Publishes within one poll interval of creation. Once the adapter creates the Reconciled condition,not_reconciled_and_debouncedtakes over with its 10 s debounce.generation_mismatch— Catches spec changes immediately: the HyperFleet API incrementsgenerationon every spec update; adapters incrementobserved_generationon the Reconciled condition as they process it. A gap means unprocessed changes.reconciled_and_stale— Ensures eventual consistency on stable resources by re-publishing periodically even when the spec is in sync, handling external drift and transient failures.not_reconciled_and_debounced— Drives faster re-publishing for transitional resources, while the debounce prevents event storms.
Customization: Override any threshold by providing your own
message_decisionin the config file. See the dev example (configs/dev-example.yaml) which uses2m/5sinstead of30m/10sfor faster local iteration.
graph TD
START[Poll Resource] --> EVAL[Evaluate CEL params in order]
EVAL --> RESULT{result expression}
RESULT -->|true| PUB[Publish event<br/>reason: message decision matched]
RESULT -->|false| SKIP[Skip<br/>reason: message decision result is false]
RESULT -->|eval error| ERR[Skip<br/>reason: param/result evaluation failed]
With the default config, the result expression is:
is_new_resource || generation_mismatch || reconciled_and_stale || not_reconciled_and_debounced
The params feed into it in this logical order:
ref_time,is_reconciled,has_ref_time— extract condition stateis_new_resource,generation_mismatch— immediate triggers (no time check needed)reconciled_and_stale,not_reconciled_and_debounced— time-based triggers (guarded byhas_ref_time)
Key Takeaways:
- All logic is configurable — the engine itself has no hardcoded triggers
- Params are short-circuit safe —
has_ref_timeguards time expressions against missing conditions - Single outcome — regardless of which param fired, the reason is always
"message decision matched"
Scalability note: The CEL decision engine evaluates all resources matching the label selector in-memory each poll cycle. At large scale (thousands of resources), use
resource_selectorto shard resources across multiple Sentinel instances. A futureserver_filtersconfig field is planned to allow server-side pre-filtering before CEL evaluation.
Resource filtering enables horizontal scaling by allowing operators to distribute resources across multiple Sentinel instances using label-based selectors.
How It Works:
The resource_selector field defines label key-value pairs that filter which resources a Sentinel instance watches:
resource_selector:
- label: region
value: us-east-1
- label: environment
value: productionSelection Logic:
- Empty selector (
[]): Watch ALL resources (default behavior) - Single selector: Match resources with that label
- Multiple selectors: Match resources with ALL labels (AND logic)
Common Filtering Patterns:
| Pattern | Example | Use Case |
|---|---|---|
| Regional | region: us-east, region: eu-west |
Geographic distribution |
| Environment | environment: prod, environment: staging |
Isolation by environment |
| Index-based | index: "1", index: "2", index: "3" |
Numeric distribution for high volume |
| Combined | region: us-east + environment: prod |
Multi-dimensional filtering |
Architecture Diagram:
graph TB
API[HyperFleet API<br/>All Resources]
API --> S1[Sentinel US-East Clusters<br/>resource_type: clusters<br/>selector: region=us-east]
API --> S2[Sentinel US-West Clusters<br/>resource_type: clusters<br/>selector: region=us-west]
API --> S3[Sentinel EU-Central NodePools<br/>resource_type: nodepools<br/>selector: region=eu-central]
S1 --> B1[Broker Topic:<br/>us-east-clusters]
S2 --> B2[Broker Topic:<br/>us-west-clusters]
S3 --> B3[Broker Topic:<br/>eu-central-nodepools]
B1 --> A1a[Adapter US-East-1]
B1 --> A1b[Adapter US-East-2]
B2 --> A2a[Adapter US-West-1]
B2 --> A2b[Adapter US-West-2]
B3 --> A3[Adapter EU-Central]
Important Considerations:
- No Coordination: Sentinel instances operate independently with no coordination
- Coverage Responsibility: Operators must ensure all resources are covered by selectors
- Overlap Allowed: Multiple instances can watch the same resource (events will be duplicated)
- Gaps Dangerous: Resources not matching any selector will never reconcile
Broker Topic Isolation:
When using multiple filtered instances, consider using separate broker topics to enable independent processing:
# US-East instance
broker:
topic: hyperfleet-us-east-clusters
# US-West instance
broker:
topic: hyperfleet-us-west-clustersFor multi-region deployment examples using resource_selector, see Resource Selector
Strategies.
For detailed deployment examples, see docs/multi-instance-deployment.md.
Sentinel uses YAML-based configuration with environment variable overrides for sensitive data. Configuration is loaded from a file specified via the --config flag.
Basic Structure:
# Required: Resource type to watch
resource_type: clusters
# Optional: Polling interval
poll_interval: 5s
# Optional: Resource filtering
resource_selector:
- label: region
value: us-east-1
# Required: HyperFleet API configuration
hyperfleet_api:
endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000
timeout: 5s
# Optional: CloudEvent payload definition
message_data:
id: "resource.id"
kind: "resource.kind"
href: "resource.href"
generation: "resource.generation"Configuration Precedence:
- Environment variables (highest)
- Configuration file
- Built-in defaults (lowest)
These fields MUST be present in the configuration file or Sentinel will fail to start:
| Field | Type | Description | Example |
|---|---|---|---|
resource_type |
string | Resource to watch (clusters or nodepools) |
clusters |
hyperfleet_api.endpoint |
string | HyperFleet API base URL | http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000 |
These fields have sensible defaults and can be omitted:
| Field | Type | Default | Description |
|---|---|---|---|
poll_interval |
duration | 5s |
How often to poll the API for resource updates |
message_decision |
object | See defaults | CEL-based decision logic (params + result expression) |
hyperfleet_api.timeout |
duration | 5s |
Request timeout for API calls |
resource_selector |
array | [] |
Label selectors for filtering (empty = all resources) |
message_data |
map | {} |
CEL expressions for CloudEvents payload |
topic |
string | "" |
Override broker topic name (defaults to Helm template) |
The resource_selector field enables horizontal scaling by filtering resources based on labels.
Structure:
resource_selector:
- label: <label-key>
value: <label-value>
- label: <another-key>
value: <another-value>Behavior:
- Empty (
[]or omitted): Watch ALL resources - Single selector: Match resources with that specific label
- Multiple selectors: Match resources with ALL labels (AND logic)
The message_data field defines the CloudEvents payload structure using Common Expression Language (CEL) expressions.
How It Works:
- Each key-value pair in
message_databecomes a field in the CloudEventdatapayload - Values are CEL expressions evaluated with access to two variables:
resource- The HyperFleet resource object (cluster or nodepool)reason- The decision reason string ("message decision matched"when publishing)
- Nested maps create nested objects in the payload
Available CEL Variables:
| Variable | Type | Description | Example Fields |
|---|---|---|---|
resource |
Resource | The HyperFleet resource | id, kind, href, generation, labels, created_time, updated_time, owner_references, metadata |
reason |
string | Decision engine reason | "message decision matched", "message decision result is false" |
CEL Expression Syntax:
message_data:
# Field access
id: "resource.id"
# Nested field access
cluster_id: "resource.owner_references.id"
# Conditionals (ternary operator) — use condition() function to access status conditions
reconciled_status: 'condition("Reconciled").status == "True" ? "Reconciled" : "NotReconciled"'
# String literals (must use quotes inside CEL expression)
source: '"hyperfleet-sentinel"'
# Numeric/boolean literals
generation: "resource.generation"
# Nested objects
owner_references: "resource.owner_references"Cluster Pattern:
message_data:
id: "resource.id"
kind: "resource.kind"
href: "resource.href"
generation: "resource.generation"NodePool Pattern with Owner References:
message_data:
id: "resource.id"
kind: "resource.kind"
href: "resource.href"
generation: "resource.generation"
owner_references: "resource.owner_references"Validation:
-
All leaf values MUST be non-empty CEL expression strings
-
Empty values or
nilwill cause configuration validation failure:Error: invalid config: message_data.id: empty CEL expression is not allowed
CloudEvents Output:
The message_data configuration produces CloudEvents with the following structure:
{
"specversion": "1.0",
"type": "com.redhat.hyperfleet.cluster.reconcile",
"source": "hyperfleet-sentinel",
"id": "uuid-generated",
"time": "2025-01-01T10:00:00Z",
"datacontenttype": "application/json",
"data": {
// Your message_data CEL expressions evaluated here
"id": "cluster-abc123",
"kind": "Cluster",
"href": "/api/v1/clusters/cluster-abc123",
"generation": 5
}
}Broker configuration is managed by the hyperfleet-broker library. Configuration is split between:
- broker.yaml - Non-sensitive broker settings (type, project ID, etc.)
- Environment variables - Sensitive credentials and connection strings
Configuration File: broker.yaml
broker:
type: rabbitmq # or googlepubsub
# RabbitMQ-specific settings
rabbitmq:
exchange_type: topic
# URL should be set via BROKER_RABBITMQ_URL env var
# Google Pub/Sub-specific settings
googlepubsub:
project_id: my-gcp-project
# Credentials via GOOGLE_APPLICATION_CREDENTIALS or ADCEnvironment Variables:
| Variable | Broker | Description | Example |
|---|---|---|---|
BROKER_RABBITMQ_URL |
RabbitMQ | Complete connection URL with credentials | amqp://user:pass@localhost:5672/vhost |
BROKER_GOOGLEPUBSUB_PROJECT_ID |
Pub/Sub | GCP project ID | my-gcp-project |
GOOGLE_APPLICATION_CREDENTIALS |
Pub/Sub | Service account key path (local dev/testing only, production GKE uses Workload Identity) | /path/to/key.json |
HYPERFLEET_BROKER_TOPIC |
Both | Topic name for publishing events | hyperfleet-prod-clusters |
BROKER_CONFIG_FILE |
Both | Path to broker config file | /app/broker.yaml |
PUBSUB_EMULATOR_HOST |
Pub/Sub | Pub/Sub emulator endpoint (local dev/testing only) | localhost:8085 |
Topic Naming:
The HYPERFLEET_BROKER_TOPIC environment variable sets the topic name where events are published. You can use any naming convention that fits your deployment requirements.
Example topic names:
hyperfleet-prod-clustersus-east-clusters
When using the provided Helm chart, the default template uses {namespace}-{resource_type} (e.g., hyperfleet-dev-clusters), but this can be overridden by setting the HYPERFLEET_BROKER_TOPIC environment variable or the broker.topic Helm value.
Broker Type: RabbitMQ
# broker.yaml
broker:
type: rabbitmq
rabbitmq:
exchange_type: topic# Environment variables
export BROKER_RABBITMQ_URL="amqp://guest:guest@localhost:5672/"
export HYPERFLEET_BROKER_TOPIC="hyperfleet-dev-clusters"Broker Type: Google Pub/Sub
# broker.yaml
broker:
type: googlepubsub
googlepubsub:
project_id: hcm-hyperfleetLocal Development Authentication:
For local development and testing, use one of these authentication methods:
# Set the topic name
export HYPERFLEET_BROKER_TOPIC="hyperfleet-dev-clusters"
# Option 1: Use personal Application Default Credentials (local development)
gcloud auth application-default login
# Option 2: Use service account key file (local development/testing)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"Production GKE Authentication:
When deploying to production GKE, use Workload Identity Federation for authentication. This method does not require GOOGLE_APPLICATION_CREDENTIALS or gcloud auth commands - credentials are automatically provided to the pod:
# Only the topic name is required via environment variable
export HYPERFLEET_BROKER_TOPIC="hyperfleet-prod-clusters"
# No GOOGLE_APPLICATION_CREDENTIALS needed - Workload Identity handles authenticationFor Workload Identity setup instructions, see Configure Workload Identity.
Broker Configuration Reference:
For complete broker configuration options, see the hyperfleet-broker documentation.
Follow this checklist to ensure successful Sentinel deployment and operation.
Define Resource Monitoring Scope
- Determine
resource_typeto monitor:clustersornodepools - Define
resource_selectorlabels for filtering resources- Leave empty (
[]) to monitor all resources - Use label selectors for horizontal scaling
- Reference: Resource Filtering
- Leave empty (
Configure Reconciliation Parameters
- Review and adjust polling and decision parameters:
poll_interval- How often to poll the HyperFleet API (default:5s)message_decision- CEL-based decision logic with configurable thresholds- Reference: Optional Fields
Design CloudEvents Payload
- Define
message_dataCEL expressions for event payload structure - Reference: Message Data (CEL Expressions)
Configure HyperFleet API Connection
- Ensure HyperFleet API is deployed and accessible
- Critical: Sentinel performs connectivity verification at startup and will fail to start if the API is unavailable
- Rolling updates will not succeed until API connectivity is restored
- Set
hyperfleet_api.endpointto HyperFleet API URL - Adjust
hyperfleet_api.timeoutif needed (default:5s) - Reference: Required Fields
Select and Configure Broker
- Choose broker type:
rabbitmqorgooglepubsub - Reference: Broker Configuration
Provision Broker Infrastructure
- RabbitMQ:
- Create exchange and queues
- Configure topic routing keys
- Set retention and delivery policies
- Google Pub/Sub:
- Create topic with your chosen naming convention (e.g.,
hyperfleet-prod-clusters) - Configure message retention duration
- Set up dead-letter topic (optional)
- Create topic with your chosen naming convention (e.g.,
Configure Authentication and Permissions
- RabbitMQ:
- Create credentials for Sentinel service
- Grant publish permissions to exchange
- Prepare
BROKER_RABBITMQ_URLconnection string
- Google Pub/Sub:
- Configure Workload Identity Federation (GKE) or service account
- Grant
roles/pubsub.publisherrole to Sentinel service account
Deploy with Helm
-
Update Helm chart
values.yamlwith:- Sentinel configuration (
configsection) - Broker configuration (
brokersection) - Sensitive credentials in
secretsor reference to existing secrets
- Sentinel configuration (
-
Install Sentinel using Helm chart:
helm install sentinel oci://quay.io/redhat-services-prod/hyperfleet-tenant/hyperfleet/hyperfleet-sentinel-chart \ --namespace hyperfleet-system \ --values values.yaml
-
Verify deployment:
kubectl get deployment -n hyperfleet-system sentinel kubectl get pods -n hyperfleet-system -l app.kubernetes.io/name=sentinel
-
Reference: Helm Chart README
Verify Service Health
-
Check health endpoint:
curl http://<sentinel-service>:8080/healthz -
Check readiness endpoint:
curl http://<sentinel-service>:8080/readyz- Note: The
/readyzendpoint returnsfalseuntil the first successful poll completes and broker health checks pass. Pods intentionally stay unready during initial startup. - If startup latency causes false-positive readiness probe failures, tune the Kubernetes readiness probe timing (e.g., increase
initialDelaySecondsorperiodSeconds) in your Helm values.
- Note: The
-
Review pod logs for startup errors:
kubectl logs -n hyperfleet-system -l app.kubernetes.io/name=sentinel
-
Verify Sentinel is publishing events:
kubectl logs -n hyperfleet-system -l app.kubernetes.io/name=sentinel | grep -E "Fetched resources|Trigger cycle completed"
Expected log output when Sentinel is operating correctly:
Fetched resources count=15 label_selectors=1 topic=hyperfleet-dev-clusters subset=clusters Trigger cycle completed total=15 published=3 skipped=12 duration=0.125s topic=hyperfleet-dev-clusters subset=clusterscount- Number of resources fetched from the API matching the resource selectorpublished- Number of events published (message_decisionresult wastrue)skipped- Number of resources skipped (no reconciliation needed)
For detailed deployment guidance, see docs/deployment.md
- Running Sentinel - Detailed guide for GKE deployments
- Metrics Documentation - Complete metrics catalog with PromQL examples
- Multi-Instance Deployment - Horizontal scaling strategies
- Testcontainers Documentation - Integration testing with testcontainers
- Helm Chart README - Helm chart configuration reference
- Common Expression Language (CEL) - CEL specification for
message_data - CloudEvents - CloudEvents specification
| Symptom | Likely Cause | Solution |
|---|---|---|
| Events not published, resources not found | Resource selector mismatch | Verify resource_selector matches resource labels. Empty selector watches ALL resources. Check logs: kubectl logs -n hyperfleet-system -l app.kubernetes.io/name=sentinel |
| Events not published, resources found but skipped | Decision result is false | Normal behavior (reason: "message decision result is false"). Events publish when the message_decision result expression evaluates to true. With the default config: new resource (generation==1 && !reconciled), generation mismatch, reconciled and stale >30m, or not reconciled and debounced >10s. |
| API connection errors, DNS lookup fails | Wrong service name or namespace | Verify endpoint format: http://<service>.<namespace>.svc.cluster.local:8000. Check API is running: kubectl get pods -n hyperfleet-system -l app=hyperfleet-api |
| API returns 401 Unauthorized | Missing authentication | Add auth headers to hyperfleet_api config if API requires authentication. |
| API returns 404 Not Found | Wrong API version in path | Verify endpoint uses correct API version: /api/v1/clusters or /api/hyperfleet/v1/clusters |
| Broker PermissionDenied (Pub/Sub) | Missing publisher role | Grant role: gcloud projects add-iam-policy-binding ${GCP_PROJECT} --role="roles/pubsub.publisher" --member="principal://iam.googleapis.com/..." |
| Broker Topic not found (Pub/Sub) | Topic doesn't exist | Create topic: gcloud pubsub topics create hyperfleet-prod-clusters --project=${GCP_PROJECT} |
| Broker type mismatch | Config doesn't match actual broker | Ensure broker.type matches: rabbitmq or googlepubsub. Check: kubectl get configmap sentinel -o jsonpath='{.data.broker\.yaml}' |
| High CPU/memory usage | Too many resources or slow API | Check kubectl top pod -n hyperfleet-system -l app=sentinel. Consider horizontal scaling with resource_selector or increase poll intervals. |
| Error: resource_type is required | Missing required config field | Add resource_type: clusters or resource_type: nodepools to configuration. |
| Error: invalid resource_type | Invalid value | Use only clusters or nodepools. |
| Error: hyperfleet_api.endpoint is required | Missing required config field | Add hyperfleet_api.endpoint: http://hyperfleet-api.hyperfleet-system.svc.cluster.local:8000 |
| Error: poll_interval must be positive | Zero or negative interval | Set poll_interval: 5s (must be > 0). |
| Error: OpenAPI client not generated | Missing generated code | Run make generate && make build before starting Sentinel. |
| Pods stay unready after startup | Normal startup behavior | The /readyz endpoint returns false until the first successful poll completes and broker health checks pass. This is expected. If readiness probe failures persist beyond initial startup, check pod logs and broker connectivity. Tune probe timing (e.g., increase initialDelaySeconds) in Helm values if needed. |
| Health/readiness endpoints return errors | Configuration validation failed | Check pod logs for startup errors: kubectl logs -n hyperfleet-system -l app.kubernetes.io/name=sentinel. Verify all required config fields. |