Rhobs implementation#767
Conversation
|
Skipping CI for Draft Pull Request. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AlexSmithGH The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe changes introduce a new RHOBS (Red Hat Observability Stack) Loki client package and integrate it into the investigation framework. The controller now manages a Grafana token from environment variables, the OCM client retrieves RHOBS cell endpoints from cluster labels, and investigation resource builders construct RHOBS clients for log querying. The HCP etcd analysis investigation migrates from Dynatrace URL construction to fetching and displaying logs from Loki. Changes
Sequence DiagramsequenceDiagram
participant Ctrl as Controller
participant Dep as Dependencies
participant ResB as ResourceBuilder
participant OCM as OCM Client
participant RHOBS as RHOBS Client
participant Loki as Loki Server
Ctrl->>Dep: initializeDependencies()
Dep->>Dep: Read CAD_GRAFANA_TOKEN env
Ctrl->>ResB: NewResourceBuilder(..., grafanaToken)
ResB->>ResB: Store grafanaToken
Ctrl->>ResB: WithRHOBSClient()
Ctrl->>ResB: Build()
ResB->>OCM: GetClusterInfo(managementClusterID)
OCM-->>ResB: Cluster details
ResB->>OCM: GetRHOBSCell(clusterID)
OCM->>OCM: Query ext config labels<br/>for rhobs-cell
OCM-->>ResB: RHOBSCell endpoint
ResB->>RHOBS: NewClient(BaseURL=RHOBSCell,<br/>Token=grafanaToken)
RHOBS-->>ResB: RHOBS Client
ResB-->>Ctrl: Resources with RHOBSClient
Ctrl->>RHOBS: QueryLogs(logQLQuery, timeRange, limit)
RHOBS->>Loki: HTTP GET query_range<br/>(with auth headers)
Loki-->>RHOBS: QueryRangeResponse JSON
RHOBS->>RHOBS: Parse streams and entries
RHOBS-->>Ctrl: LogQueryResult
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 8 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
pkg/investigations/etcddatabasequotalowspace/etcddatabasequotalowspace.go (1)
610-612: Consider making the time window configurable or dynamic.The hardcoded 30-minute time window may be insufficient if the analysis job takes longer to complete or if there's significant delay before log ingestion. Consider:
- Using the job creation timestamp as the start time
- Making the window configurable
This is a minor concern since most jobs should complete well within 30 minutes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/investigations/etcddatabasequotalowspace/etcddatabasequotalowspace.go` around lines 610 - 612, Replace the hardcoded 30-minute window (variables now and start computed as now.Add(-30 * time.Minute)) with a configurable or dynamic duration: add a parameter or config value (e.g., windowDuration or etcdQueryWindow) and use that to compute start, and if a job or resource creation timestamp is available (e.g., job.CreationTimestamp or the investigation object’s CreatedAt), use max(job.CreationTimestamp, now.Add(-windowDuration)) as the start time; update any callers to pass the new config/parameter or default to 30 minutes when unspecified.test/set_stage_env.sh (1)
9-9: Shellcheck warnings apply to this and all similar vault export loops.The static analysis hints (SC2163, SC2086) are valid but apply equally to lines 6-8. The pattern is consistent with existing code, so this is acceptable for now. Consider a future refactor to handle values with spaces safely:
💡 Safer alternative pattern (optional)
while IFS='=' read -r key value; do export "$key=$value" done < <(vault kv get -format=json osd-sre/configuration-anomaly-detection/grafana/stg | jq -r '.data.data | to_entries[] | "\(.key)=\(.value)"')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/set_stage_env.sh` at line 9, The current for-loop that reads vault output into variable v via `for v in $(vault kv get -format=json ... | jq -r "...")` is unsafe for values with spaces and triggers shellcheck warnings SC2163/SC2086; replace this loop in set_stage_env.sh with a while-read pattern that uses a safe IFS and read -r to split key and value (e.g., `while IFS='=' read -r key value; do export "$key=$value"; done < <(vault kv get ... | jq -r '...')`) so the `vault kv get ... | jq -r` pipeline is consumed safely and keys/values with spaces are preserved, and ensure the loop uses the same variable names (key, value) instead of `v`.pkg/investigations/investigation/investigation.go (1)
212-216:WithRHOBSClient()creates an unnecessary dependency on management REST config.The RHOBS client only needs the management cluster name (to look up the RHOBS cell label via OCM API), but
WithManagementRestConfig()forces creation of a backplane REST config to the management cluster. If backplane connectivity to the management cluster fails, RHOBS client creation will fail even though it doesn't actually require that connection.Consider creating a lighter dependency that only ensures the cluster is loaded and HCP info is fetched without requiring the management REST config.
♻️ Suggested approach
func (r *ResourceBuilderT) WithRHOBSClient() ResourceBuilder { - r.WithManagementRestConfig() + r.WithCluster() // Only need cluster info to determine if HCP and get management cluster name r.buildRHOBSClient = true return r }Then ensure
buildRHOBSClientResource()fetches the management cluster name if not already available (similar logic to what's inbuildManagementClusterResourcesbut without creating REST configs).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/investigations/investigation/investigation.go` around lines 212 - 216, WithRHOBSClient currently forces a management REST config via WithManagementRestConfig(); remove that call so WithRHOBSClient() only sets buildRHOBSClient = true. Update buildRHOBSClientResource() to lazily ensure the management cluster name is available (reusing the name-fetching logic from buildManagementClusterResources without creating a REST config or backplane connection) and fail only if the cluster name/ HCP info cannot be resolved, thereby avoiding unnecessary backplane REST config creation when building the RHOBS client.pkg/rhobs/rhobs.go (1)
69-71: Minor:Content-Typeheader is unnecessary for GET requests.
Content-Typedescribes the body of the request, but GET requests don't have a body. While this won't break anything (Loki will ignore it), it could be removed for clarity. Consider usingAccept: application/jsoninstead if you want to indicate the expected response format.♻️ Optional cleanup
req.Header.Set("Authorization", "Bearer "+c.token) - req.Header.Set("Content-Type", "application/json") + req.Header.Set("Accept", "application/json") req.Header.Set("User-Agent", "configuration-anomaly-detection")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/rhobs/rhobs.go` around lines 69 - 71, Remove the unnecessary Content-Type header on GET requests by deleting the req.Header.Set("Content-Type", "application/json") call and, if you want to declare the expected response format, replace it with req.Header.Set("Accept", "application/json"); update the header handling where req is prepared (the existing req.Header.Set("Authorization", ...), req.Header.Set("Content-Type", ...), req.Header.Set("User-Agent", ...) block) so GET requests no longer set Content-Type and instead set Accept when appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/controller.go`:
- Around line 141-145: The controller now requires the environment variable
CAD_GRAFANA_TOKEN (referenced as grafanaToken) but deployments and secrets are
not provisioned; add CAD_GRAFANA_TOKEN to the deployment template as an env var
sourced from a Kubernetes Secret (e.g., secretRef to cad-grafana-token)
following the pattern used for cad-pd-token and cad-ocm-client-secret,
create/update the cad-grafana-token Secret in all environments/CI pipelines, and
ensure the CI/CD manifests and secret provisioning steps are updated so the
controller can read grafanaToken at startup.
In `@pkg/investigations/investigation/investigation.go`:
- Around line 417-426: The error message in buildRHOBSClientResource() is
ambiguous when ManagementClusterName is empty because
buildManagementClusterResources() returned early for non-HCP clusters; update
the check in buildRHOBSClientResource (and any callers like WithRHOBSClient) to
return a clearer error indicating RHOBS/Loki is only supported for HCP clusters
(e.g., mention non-HCP cluster and that management cluster name is unavailable),
or alternatively detect the non-HCP case before attempting RHOBS setup and
return that specific message; reference the buildRHOBSClientResource function
and the builtResources.ManagementClusterName field so you change the error text
and/or add a guard for non-HCP clusters.
---
Nitpick comments:
In `@pkg/investigations/etcddatabasequotalowspace/etcddatabasequotalowspace.go`:
- Around line 610-612: Replace the hardcoded 30-minute window (variables now and
start computed as now.Add(-30 * time.Minute)) with a configurable or dynamic
duration: add a parameter or config value (e.g., windowDuration or
etcdQueryWindow) and use that to compute start, and if a job or resource
creation timestamp is available (e.g., job.CreationTimestamp or the
investigation object’s CreatedAt), use max(job.CreationTimestamp,
now.Add(-windowDuration)) as the start time; update any callers to pass the new
config/parameter or default to 30 minutes when unspecified.
In `@pkg/investigations/investigation/investigation.go`:
- Around line 212-216: WithRHOBSClient currently forces a management REST config
via WithManagementRestConfig(); remove that call so WithRHOBSClient() only sets
buildRHOBSClient = true. Update buildRHOBSClientResource() to lazily ensure the
management cluster name is available (reusing the name-fetching logic from
buildManagementClusterResources without creating a REST config or backplane
connection) and fail only if the cluster name/ HCP info cannot be resolved,
thereby avoiding unnecessary backplane REST config creation when building the
RHOBS client.
In `@pkg/rhobs/rhobs.go`:
- Around line 69-71: Remove the unnecessary Content-Type header on GET requests
by deleting the req.Header.Set("Content-Type", "application/json") call and, if
you want to declare the expected response format, replace it with
req.Header.Set("Accept", "application/json"); update the header handling where
req is prepared (the existing req.Header.Set("Authorization", ...),
req.Header.Set("Content-Type", ...), req.Header.Set("User-Agent", ...) block) so
GET requests no longer set Content-Type and instead set Accept when appropriate.
In `@test/set_stage_env.sh`:
- Line 9: The current for-loop that reads vault output into variable v via `for
v in $(vault kv get -format=json ... | jq -r "...")` is unsafe for values with
spaces and triggers shellcheck warnings SC2163/SC2086; replace this loop in
set_stage_env.sh with a while-read pattern that uses a safe IFS and read -r to
split key and value (e.g., `while IFS='=' read -r key value; do export
"$key=$value"; done < <(vault kv get ... | jq -r '...')`) so the `vault kv get
... | jq -r` pipeline is consumed safely and keys/values with spaces are
preserved, and ensure the loop uses the same variable names (key, value) instead
of `v`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 1bc8288f-7d58-4609-8a56-e846417515fe
📒 Files selected for processing (12)
CLAUDE.mdpkg/controller/controller.gopkg/investigations/etcddatabasequotalowspace/etcddatabasequotalowspace.gopkg/investigations/etcddatabasequotalowspace/etcddatabasequotalowspace_test.gopkg/investigations/investigation/investigation.gopkg/ocm/mock/ocmmock.gopkg/ocm/ocm.gopkg/rhobs/mock/rhobsmock.gopkg/rhobs/rhobs.gopkg/rhobs/rhobs_test.gopkg/rhobs/types.gotest/set_stage_env.sh
| // Load Grafana/RHOBS token for HCP log fetching | ||
| grafanaToken := os.Getenv("CAD_GRAFANA_TOKEN") | ||
| if grafanaToken == "" { | ||
| return nil, fmt.Errorf("missing required environment variable CAD_GRAFANA_TOKEN") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if CAD_GRAFANA_TOKEN is referenced in deployment manifests or documentation
echo "=== Searching for CAD_GRAFANA_TOKEN references in deployment configs ==="
rg -l "CAD_GRAFANA_TOKEN" --type yaml --type json -g '!*_test*' || echo "No YAML/JSON references found"
echo ""
echo "=== Checking for environment variable documentation ==="
rg -n "CAD_GRAFANA_TOKEN" -g '*.md' -g '!CLAUDE.md'Repository: openshift/configuration-anomaly-detection
Length of output: 244
🏁 Script executed:
#!/bin/bash
echo "=== All references to GrafanaToken in codebase ==="
rg -n "GrafanaToken" --type go
echo ""
echo "=== All references to CAD_GRAFANA_TOKEN ==="
rg -n "CAD_GRAFANA_TOKEN"
echo ""
echo "=== Search for deployment manifests in common locations ==="
find . -type f \( -name "*.yaml" -o -name "*.yml" \) | head -20
echo ""
echo "=== Check for Dockerfile or helm configs ==="
find . -type f \( -name "Dockerfile*" -o -name "Chart.yaml" -o -name "values*.yaml" \) | head -20
echo ""
echo "=== Check if token is used conditionally anywhere ==="
rg -B3 -A3 "GrafanaToken" --type go | head -50Repository: openshift/configuration-anomaly-detection
Length of output: 6638
🏁 Script executed:
#!/bin/bash
echo "=== Check openshift/template.yaml for CAD_GRAFANA_TOKEN ==="
rg -n "CAD_GRAFANA_TOKEN" openshift/template.yaml || echo "Not found in template.yaml"
echo ""
echo "=== Check if template.yaml defines environment variables ==="
rg -B2 -A2 "env:" openshift/template.yaml | head -40
echo ""
echo "=== Check full template for GRAFANA references ==="
rg -n "GRAFANA\|Grafana" openshift/template.yaml
echo ""
echo "=== Check build Dockerfile for env vars ==="
grep -n "ENV CAD_\|ARG CAD_" build/Dockerfile || echo "No CAD env vars in Dockerfile"
echo ""
echo "=== Full context of template.yaml around deployment spec ==="
head -100 openshift/template.yaml | tail -50Repository: openshift/configuration-anomaly-detection
Length of output: 2500
🏁 Script executed:
#!/bin/bash
echo "=== Search for any secret definitions in template or related configs ==="
rg -n "name:.*token\|name:.*grafana\|secretRef" openshift/template.yaml | head -20
echo ""
echo "=== Check if there are parameter definitions for CAD_GRAFANA_TOKEN ==="
rg -n "CAD_GRAFANA_TOKEN\|parameters:" openshift/template.yaml | head -20Repository: openshift/configuration-anomaly-detection
Length of output: 229
Critical: CAD_GRAFANA_TOKEN is required but not defined in deployment configuration.
The code now enforces CAD_GRAFANA_TOKEN as a required environment variable (lines 141-145). The controller will fail to start if this variable is missing. However, openshift/template.yaml does not define this environment variable or configure the corresponding Kubernetes secret, which will cause immediate deployment failures.
Required actions before merging:
- Add
CAD_GRAFANA_TOKENto the deployment template (e.g., as asecretRefforcad-grafana-tokensecret, similar to existingcad-pd-tokenandcad-ocm-client-secret) - Ensure the secret is provisioned in all deployment environments
- Update CI/CD pipelines to provide this token
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/controller.go` around lines 141 - 145, The controller now
requires the environment variable CAD_GRAFANA_TOKEN (referenced as grafanaToken)
but deployments and secrets are not provisioned; add CAD_GRAFANA_TOKEN to the
deployment template as an env var sourced from a Kubernetes Secret (e.g.,
secretRef to cad-grafana-token) following the pattern used for cad-pd-token and
cad-ocm-client-secret, create/update the cad-grafana-token Secret in all
environments/CI pipelines, and ensure the CI/CD manifests and secret
provisioning steps are updated so the controller can read grafanaToken at
startup.
| // buildRHOBSClientResource creates a RHOBS client for fetching logs from RHOBS/Loki | ||
| func (r *ResourceBuilderT) buildRHOBSClientResource() error { | ||
| if r.builtResources.RHOBSClient != nil { | ||
| return nil | ||
| } | ||
|
|
||
| r.builtResources.DynatraceManagementClusterURL = dynatraceURL | ||
| if r.builtResources.RHOBSCell == "" { | ||
| if r.builtResources.ManagementClusterName == "" { | ||
| return fmt.Errorf("management cluster name not available - cannot determine RHOBS cell") | ||
| } |
There was a problem hiding this comment.
RHOBS client fails with unclear error for non-HCP clusters.
If WithRHOBSClient() is called for a non-HCP cluster, buildManagementClusterResources() returns early at lines 311-314 without setting ManagementClusterName. Then buildRHOBSClientResource() fails with "management cluster name not available - cannot determine RHOBS cell" which doesn't clearly indicate that RHOBS is only supported for HCP clusters.
🛠️ Proposed fix for clearer error messaging
func (r *ResourceBuilderT) buildRHOBSClientResource() error {
if r.builtResources.RHOBSClient != nil {
return nil
}
+ if !r.builtResources.IsHCP {
+ return fmt.Errorf("RHOBS client is only available for HCP clusters")
+ }
+
if r.builtResources.RHOBSCell == "" {
if r.builtResources.ManagementClusterName == "" {
return fmt.Errorf("management cluster name not available - cannot determine RHOBS cell")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // buildRHOBSClientResource creates a RHOBS client for fetching logs from RHOBS/Loki | |
| func (r *ResourceBuilderT) buildRHOBSClientResource() error { | |
| if r.builtResources.RHOBSClient != nil { | |
| return nil | |
| } | |
| r.builtResources.DynatraceManagementClusterURL = dynatraceURL | |
| if r.builtResources.RHOBSCell == "" { | |
| if r.builtResources.ManagementClusterName == "" { | |
| return fmt.Errorf("management cluster name not available - cannot determine RHOBS cell") | |
| } | |
| // buildRHOBSClientResource creates a RHOBS client for fetching logs from RHOBS/Loki | |
| func (r *ResourceBuilderT) buildRHOBSClientResource() error { | |
| if r.builtResources.RHOBSClient != nil { | |
| return nil | |
| } | |
| if !r.builtResources.IsHCP { | |
| return fmt.Errorf("RHOBS client is only available for HCP clusters") | |
| } | |
| if r.builtResources.RHOBSCell == "" { | |
| if r.builtResources.ManagementClusterName == "" { | |
| return fmt.Errorf("management cluster name not available - cannot determine RHOBS cell") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/investigations/investigation/investigation.go` around lines 417 - 426,
The error message in buildRHOBSClientResource() is ambiguous when
ManagementClusterName is empty because buildManagementClusterResources()
returned early for non-HCP clusters; update the check in
buildRHOBSClientResource (and any callers like WithRHOBSClient) to return a
clearer error indicating RHOBS/Loki is only supported for HCP clusters (e.g.,
mention non-HCP cluster and that management cluster name is unavailable), or
alternatively detect the non-HCP case before attempting RHOBS setup and return
that specific message; reference the buildRHOBSClientResource function and the
builtResources.ManagementClusterName field so you change the error text and/or
add a guard for non-HCP clusters.
What type of PR is this?
(feature/bug/documentation/other)
What this PR does / Why we need it?
Special notes for your reviewer
Test Coverage
Guidelines for CAD investigations
Test coverage checks
Pre-checks (if applicable)
Summary by CodeRabbit
New Features
Configuration
CAD_GRAFANA_TOKENmust be set for Grafana/Loki API access during investigations