Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,9 @@ Pre-initialized clients available in investigation resources:
- **PagerDuty** (`pkg/pagerduty`) - Alert info, incident management, notes
- **K8s** (`pkg/k8s`) - Kubernetes API client
- **osd-network-verifier** (`pkg/networkverifier`) - Network verification
- **RHOBS** (`pkg/rhobs`) - RHOBS Grafana Loki API for HCP log fetching

For HCP clusters, when using `WithManagementRestConfig()`, `WithManagementK8sClient()`, or `WithManagementOCClient()`, the Dynatrace management cluster URL is automatically fetched and available in `r.DynatraceManagementClusterURL`.
For HCP clusters, when using `WithManagementRestConfig()`, `WithManagementK8sClient()`, or `WithManagementOCClient()`, the RHOBS cell endpoint is automatically fetched from the management cluster's external configuration labels and available in `r.RHOBSCell`. A RHOBS client can be created using the `RHOBSCell` endpoint and `CAD_GRAFANA_TOKEN` to fetch logs from Loki.

### Workflow

Expand All @@ -99,6 +100,7 @@ For local development (available via `source test/set_stage_env.sh`):
- `PD_SIGNATURE` - PagerDuty webhook signature validation
- `BACKPLANE_URL`, `BACKPLANE_INITIAL_ARN` - Backplane access
- `CAD_PROMETHEUS_PUSHGATEWAY` - Metrics endpoint
- `CAD_GRAFANA_TOKEN` - Service account token for RHOBS Grafana/Loki API access (HCP log fetching)

Optional:
- `BACKPLANE_PROXY` - Required for local development
Expand Down
10 changes: 9 additions & 1 deletion pkg/controller/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ type Dependencies struct {
BackplaneURL string
BackplaneProxy string
AWSProxy string
GrafanaToken string
ExperimentalEnabled bool
}

Expand Down Expand Up @@ -137,6 +138,12 @@ func initializeDependencies() (*Dependencies, error) {
experimentalEnabledVar := os.Getenv("CAD_EXPERIMENTAL_ENABLED")
experimentalEnabled, _ := strconv.ParseBool(experimentalEnabledVar)

// Load Grafana/RHOBS token for HCP log fetching
grafanaToken := os.Getenv("CAD_GRAFANA_TOKEN")
if grafanaToken == "" {
return nil, fmt.Errorf("missing required environment variable CAD_GRAFANA_TOKEN")
}
Comment on lines +141 to +145

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if CAD_GRAFANA_TOKEN is referenced in deployment manifests or documentation
echo "=== Searching for CAD_GRAFANA_TOKEN references in deployment configs ==="
rg -l "CAD_GRAFANA_TOKEN" --type yaml --type json -g '!*_test*' || echo "No YAML/JSON references found"

echo ""
echo "=== Checking for environment variable documentation ==="
rg -n "CAD_GRAFANA_TOKEN" -g '*.md' -g '!CLAUDE.md'

Repository: openshift/configuration-anomaly-detection

Length of output: 244


🏁 Script executed:

#!/bin/bash

echo "=== All references to GrafanaToken in codebase ==="
rg -n "GrafanaToken" --type go

echo ""
echo "=== All references to CAD_GRAFANA_TOKEN ==="
rg -n "CAD_GRAFANA_TOKEN"

echo ""
echo "=== Search for deployment manifests in common locations ==="
find . -type f \( -name "*.yaml" -o -name "*.yml" \) | head -20

echo ""
echo "=== Check for Dockerfile or helm configs ==="
find . -type f \( -name "Dockerfile*" -o -name "Chart.yaml" -o -name "values*.yaml" \) | head -20

echo ""
echo "=== Check if token is used conditionally anywhere ==="
rg -B3 -A3 "GrafanaToken" --type go | head -50

Repository: openshift/configuration-anomaly-detection

Length of output: 6638


🏁 Script executed:

#!/bin/bash

echo "=== Check openshift/template.yaml for CAD_GRAFANA_TOKEN ==="
rg -n "CAD_GRAFANA_TOKEN" openshift/template.yaml || echo "Not found in template.yaml"

echo ""
echo "=== Check if template.yaml defines environment variables ==="
rg -B2 -A2 "env:" openshift/template.yaml | head -40

echo ""
echo "=== Check full template for GRAFANA references ==="
rg -n "GRAFANA\|Grafana" openshift/template.yaml

echo ""
echo "=== Check build Dockerfile for env vars ==="
grep -n "ENV CAD_\|ARG CAD_" build/Dockerfile || echo "No CAD env vars in Dockerfile"

echo ""
echo "=== Full context of template.yaml around deployment spec ==="
head -100 openshift/template.yaml | tail -50

Repository: openshift/configuration-anomaly-detection

Length of output: 2500


🏁 Script executed:

#!/bin/bash

echo "=== Search for any secret definitions in template or related configs ==="
rg -n "name:.*token\|name:.*grafana\|secretRef" openshift/template.yaml | head -20

echo ""
echo "=== Check if there are parameter definitions for CAD_GRAFANA_TOKEN ==="
rg -n "CAD_GRAFANA_TOKEN\|parameters:" openshift/template.yaml | head -20

Repository: openshift/configuration-anomaly-detection

Length of output: 229


Critical: CAD_GRAFANA_TOKEN is required but not defined in deployment configuration.

The code now enforces CAD_GRAFANA_TOKEN as a required environment variable (lines 141-145). The controller will fail to start if this variable is missing. However, openshift/template.yaml does not define this environment variable or configure the corresponding Kubernetes secret, which will cause immediate deployment failures.

Required actions before merging:

  • Add CAD_GRAFANA_TOKEN to the deployment template (e.g., as a secretRef for cad-grafana-token secret, similar to existing cad-pd-token and cad-ocm-client-secret)
  • Ensure the secret is provisioned in all deployment environments
  • Update CI/CD pipelines to provide this token
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/controller.go` around lines 141 - 145, The controller now
requires the environment variable CAD_GRAFANA_TOKEN (referenced as grafanaToken)
but deployments and secrets are not provisioned; add CAD_GRAFANA_TOKEN to the
deployment template as an env var sourced from a Kubernetes Secret (e.g.,
secretRef to cad-grafana-token) following the pattern used for cad-pd-token and
cad-ocm-client-secret, create/update the cad-grafana-token Secret in all
environments/CI pipelines, and ensure the CI/CD manifests and secret
provisioning steps are updated so the controller can read grafanaToken at
startup.


// Create OCM client
ocmClient, err := ocm.New(ocmClientID, ocmClientSecret, ocmURL)
if err != nil {
Expand All @@ -160,6 +167,7 @@ func initializeDependencies() (*Dependencies, error) {
BackplaneURL: backplaneURL,
BackplaneProxy: backplaneProxy,
AWSProxy: awsProxy,
GrafanaToken: grafanaToken,
ExperimentalEnabled: experimentalEnabled,
}, nil
}
Expand Down Expand Up @@ -250,7 +258,7 @@ func NewController(opts ControllerOptions, deps *Dependencies) (Controller, erro
func (c *investigationRunner) runInvestigation(ctx context.Context, clusterId string, inv investigation.Investigation, pdClient *pagerduty.SdkClient) error {
metrics.Inc(metrics.Alerts, inv.Name())

builder, err := investigation.NewResourceBuilder(c.ocmClient, c.bpClient, clusterId, inv.Name(), c.dependencies.BackplaneURL)
builder, err := investigation.NewResourceBuilder(c.ocmClient, c.bpClient, clusterId, inv.Name(), c.dependencies.BackplaneURL, c.dependencies.GrafanaToken)
if pdClient != nil {
builder.WithPdClient(pdClient)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ package etcddatabasequotalowspace

import (
"context"
"encoding/json"
"fmt"
"net/url"
"strings"
Expand All @@ -19,6 +18,7 @@ import (
"github.com/openshift/configuration-anomaly-detection/pkg/logging"
"github.com/openshift/configuration-anomaly-detection/pkg/metrics"
"github.com/openshift/configuration-anomaly-detection/pkg/notewriter"
"github.com/openshift/configuration-anomaly-detection/pkg/rhobs"
"github.com/openshift/configuration-anomaly-detection/pkg/types"
)

Expand Down Expand Up @@ -229,6 +229,7 @@ func (i *Investigation) runHCPEtcdAnalysis(ctx context.Context, rb investigation
r, err := rb.
WithManagementRestConfig().
WithManagementK8sClient().
WithRHOBSClient().
Build()
if err != nil {
if msg, ok := investigation.ClusterAccessErrorMessage(err); ok {
Expand Down Expand Up @@ -316,17 +317,6 @@ func (i *Investigation) runHCPEtcdAnalysis(ctx context.Context, rb investigation
r.Notes.AppendAutomation("Created HCP analysis job: %s in namespace %s", etcdAnalysisJob.Name, r.HCPNamespace)

err = waitForJobCompletion(ctx, r.ManagementK8sClient, etcdAnalysisJob.Name, r.HCPNamespace, analysisJobTimeout)

// Add Dynatrace logs query URL to notes if available
if r.DynatraceManagementClusterURL != "" && r.ManagementClusterName != "" {
dynatraceLogsURL := buildDynatraceLogsURL(
r.DynatraceManagementClusterURL,
r.HCPNamespace,
etcdAnalysisJob.Name,
)
r.Notes.AppendSuccess("Note: Click 'Show full note' to access the full URL. Logs may take up to 5 minutes to appear in Dynatrace.\n\nDynatrace Logs: %s", dynatraceLogsURL)
}

if err != nil {
if investigation.IsInfrastructureError(err) {
return result, err
Expand All @@ -344,14 +334,34 @@ func (i *Investigation) runHCPEtcdAnalysis(ctx context.Context, rb investigation
return result, nil
}

r.Notes.AppendSuccess("Analysis job completed successfully, fetching logs from RHOBS")

// Fetch logs from RHOBS
logs, err := fetchRHOBSLogs(ctx, r, r.HCPNamespace, etcdAnalysisJob.Name)
if err != nil {
r.Notes.AppendWarning("Failed to fetch RHOBS logs: %v", err)
logging.Errorf("failed to fetch RHOBS logs: %v", err)
result.EtcdDatabaseAnalysis = investigation.InvestigationStep{
Performed: true,
Labels: []string{"failure", "rhobs_logs_failed"},
}
result.Actions = append(
executor.NoteAndReportFrom(r.Notes, r.Cluster.ID(), i.Name()),
executor.Escalate("Failed to fetch RHOBS logs - manual investigation required"),
)
return result, nil
}

r.Notes.AppendSuccess("Successfully fetched logs from RHOBS\n\n%s", logs)

result.EtcdDatabaseAnalysis = investigation.InvestigationStep{
Performed: true,
Labels: []string{"success", "completed"},
}

result.Actions = append(
executor.NoteAndReportFrom(r.Notes, r.Cluster.ID(), i.Name()),
executor.Escalate("HCP etcd analysis complete - see dynatrace logs for details"),
executor.Escalate("HCP etcd analysis complete - see logs above for details"),
)
return result, nil
}
Expand Down Expand Up @@ -576,41 +586,49 @@ func getEtcdctlContainerImage(pod *corev1.Pod) (string, error) {
return "", fmt.Errorf("etcdctl container image not found in pod: %s", pod.Name)
}

// buildDynatraceLogsURL constructs a Dynatrace UI URL with a DQL query for the analysis job logs
func buildDynatraceLogsURL(baseURL, namespace, jobId string) string {
query := fmt.Sprintf(
`fetch logs, from:now()-1h | filter matchesValue(event.type, "LOG") and (matchesValue(k8s.namespace.name, "%s")) and (matchesValue(k8s.pod.name, "%s*")) | sort timestamp desc | limit 1000`,
namespace,
jobId,
)
// buildRHOBSLogsURL constructs a Grafana/RHOBS explore URL with a LogQL query for the analysis job logs
func buildRHOBSLogsURL(rhobsCell, namespace, jobId string) string {
logQLQuery := fmt.Sprintf(`{kubernetes_namespace_name="%s", kubernetes_pod_name=~"%s.*"}`, namespace, jobId)

leftParam := url.QueryEscape(fmt.Sprintf(
`{"datasource":"Loki","queries":[{"refId":"A","expr":"%s","queryType":"range"}],"range":{"from":"now-30m","to":"now"}}`,
logQLQuery,
))

return fmt.Sprintf("https://%s/explore?left=%s", rhobsCell, leftParam)
}

// fetchRHOBSLogs fetches logs from RHOBS Grafana Loki for the analysis job
func fetchRHOBSLogs(ctx context.Context, r *investigation.Resources, namespace, jobName string) (string, error) {
if r.RHOBSClient == nil {
return "", fmt.Errorf("RHOBS client not available")
}

logging.Infof("Fetching logs from RHOBS for job %s in namespace %s", jobName, namespace)

// Build the state object for Dynatrace logs UI
// The order of fields matters for some Dynatrace UI versions
state := map[string]interface{}{
"version": 2,
"dt.timeframe": map[string]string{
"from": "now()-30m",
"to": "now()",
},
"tableConfig": map[string]interface{}{
"columns": []string{"timestamp", "status", "Log message"},
},
"showDqlEditor": true,
"filterFieldQuery": query,
"dt.query": query,
"facetsCollapse": true,
}

jsonBytes, err := json.Marshal(state)
logQLQuery := fmt.Sprintf(`{kubernetes_namespace_name="%s", kubernetes_pod_name=~"%s.*"}`, namespace, jobName)

now := time.Now()
start := now.Add(-30 * time.Minute)

logging.Debugf("Querying RHOBS with LogQL: %s (time range: %s to %s)", logQLQuery, start.Format(time.RFC3339), now.Format(time.RFC3339))

result, err := r.RHOBSClient.QueryLogs(ctx, logQLQuery, start, now, 1000)
if err != nil {
jsonBytes = []byte("{}")
logging.Warnf("failed to marshal Dynatrace state to JSON: %v", err)
return "", fmt.Errorf("failed to query RHOBS logs: %w", err)
}

// URL encode the JSON state
// Note: QueryEscape uses + for spaces, but hash fragments need %20
encodedState := url.QueryEscape(string(jsonBytes))
encodedState = strings.ReplaceAll(encodedState, "+", "%20")
if result.TotalLines == 0 {
logging.Warnf("No logs found in RHOBS for job %s", jobName)
return "No logs found in RHOBS for this job. Logs may take up to 5 minutes to appear.", nil
}

logging.Infof("Successfully fetched %d log lines from RHOBS", result.TotalLines)

formattedLogs := rhobs.FormatLogsForDisplay(result, 100)

exploreURL := buildRHOBSLogsURL(r.RHOBSCell, namespace, jobName)
formattedLogs += fmt.Sprintf("\n\nView full logs in Grafana: %s", exploreURL)

return fmt.Sprintf("%sui/apps/dynatrace.logs/#%s", baseURL, encodedState)
return formattedLogs, nil
}
Loading