CCXDEV-1560 - Optimize the CPU usage of insights-runtime-extractor#1220
CCXDEV-1560 - Optimize the CPU usage of insights-runtime-extractor#1220jmesnil wants to merge 2 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
| Leverage the new POST endpoint in insights-runtime-extractor that accepts a list of container IDs to scan. The insights-operator will: | ||
|
|
||
| 1. Determine which container IDs are needed based on the pods it's processing | ||
| 2. Send only those container IDs to the insights-runtime-extractor via POST request |
There was a problem hiding this comment.
Would batching those requests, instead of sending all 8000 IDs in a single POST, add here any benefits?
There was a problem hiding this comment.
POSTing at most 8000 ids would send a request of maximum 600KB.
I'm not sure that it is worth splitting that into several requests though... But if you want to make that a requirement from the Insights Operator, it's doable
827f2b7 to
18ed0ab
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/test e2e-gcp-ovn-techpreview |
|
/retest |
|
Warning Review limit reached
More reviews will be available in 20 minutes and 16 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughBuild workload shapes first while collecting container IDs grouped by node, POST those container IDs to the runtime extractor per node, then merge the returned runtime info into the pre-built shapes using an internal runtimeKey. ChangesContainer IDs filtering and deferred runtime merge
Sequence Diagram(s)sequenceDiagram
participant Gatherer as Workload Gatherer
participant PodProc as Pod Processor
participant Shapes as Shape Builder
participant IDs as containerIDsByNode
participant Extractor as insights-runtime-extractor
participant Merger as mergeRuntimeInfoIntoShapes
Gatherer->>PodProc: iterate and process pods
PodProc->>Shapes: calculate container shapes (store runtimeKey, no runtime info)
PodProc->>IDs: addPodContainers (collect container IDs per node)
IDs-->>Gatherer: containerIDsByNode per node
Gatherer->>Extractor: per-node POST /gather_runtime_info {containerIds: [...]}
Extractor-->>Gatherer: workloadRuntimes (filtered)
Gatherer->>Merger: mergeRuntimeInfoIntoShapes(shapes, workloadRuntimes)
Merger-->>Gatherer: shapes populated with RuntimeInfo
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 12 | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/gatherers/workloads/gather_workloads_runtime_infos.go (2)
80-90:⚠️ Potential issue | 🟠 MajorPropagate node setup failures instead of only logging.
When HTTP client creation or token read fails, the goroutine returns without sending an error result. That makes runtime collection failures invisible to the caller and can mask partial data loss.
Suggested patch
httpCli, err := createHTTPClient() if err != nil { - klog.Errorf("Failed to initialize the HTTP client: %v", err) + nodeWorkloadCh <- workloadRuntimesResult{ + Error: fmt.Errorf("failed to initialize HTTP client for node %s: %w", podInfo.nodeName, err), + } return } tokenData, err := readToken() if err != nil { - klog.Errorf("Failed to read the serviceaccount token: %v", err) + nodeWorkloadCh <- workloadRuntimesResult{ + Error: fmt.Errorf("failed to read serviceaccount token for node %s: %w", podInfo.nodeName, err), + } return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/gatherers/workloads/gather_workloads_runtime_infos.go` around lines 80 - 90, The goroutine currently swallows failures from createHTTPClient() and readToken() by logging and returning, so callers never learn about errors; change the goroutine to propagate errors over nodeWorkloadCh instead of returning silently: when createHTTPClient() or readToken() returns an error, send a failure result (e.g., a NodeWorkloadRuntimeInfo value that includes an error field or a separate error wrapper) on nodeWorkloadCh referencing createHTTPClient and readToken, then return; ensure existing consumer of nodeWorkloadCh (which expects values from getNodeWorkloadRuntimeInfos) is updated/compatible to handle the error-carrying message.
210-224:⚠️ Potential issue | 🔴 CriticalClose
resp.Bodybefore checking status code.In Go, HTTP non-2xx responses are not errors—
client.Do()returns a valid response even for 4xx/5xx status codes. On non-200 responses (lines 216–223), the function returns beforedefer resp.Body.Close()is registered at line 224, leaking the response body and preventing connection reuse.Suggested patch
resp, err := httpCli.Do(request) if err != nil { return workloadRuntimesResult{ Error: err, } } + defer resp.Body.Close() if resp.StatusCode != 200 { return workloadRuntimesResult{ Error: insightsclient.HttpError{ StatusCode: resp.StatusCode, Err: fmt.Errorf("received unexpected status code %s from %s", resp.Status, url), }, } } - defer resp.Body.Close()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/gatherers/workloads/gather_workloads_runtime_infos.go` around lines 210 - 224, The response body is not closed on non-200 HTTP responses in the routine that calls httpCli.Do(request), leaking connections; move or add resp.Body.Close() so the body is always closed before any early return: after successfully obtaining resp (i.e., after the err == nil check where resp is valid) call defer resp.Body.Close() (or explicitly call resp.Body.Close() before returning the workloadRuntimesResult on non-200 status) to ensure the body is closed when returning an error; update the code paths around httpCli.Do(request), the resp.StatusCode check, and the workloadRuntimesResult returns to use this pattern.
🧹 Nitpick comments (1)
pkg/gatherers/workloads/gather_workloads_info.go (1)
299-307: Consider removing init-containerruntimeKeypopulation if init runtime is intentionally out of scope.
runtimeKeyis set for init containers, but runtime merge currently applies only toshape.Containers. Either merge init runtime too, or avoid storing init keys to reduce ambiguity and minor overhead.Also applies to: 573-582
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/gatherers/workloads/gather_workloads_info.go` around lines 299 - 307, The init-container shapes are being assigned runtimeKey even though runtime merge only applies to shape.Containers; change calculateWorkloadContainerShapes to accept a flag (e.g., disableRuntimeKey or isInit bool) and use it to skip populating runtimeKey when called for init containers, then update the calls that build podShape.InitContainers (where calculateWorkloadContainerShapes is invoked for InitContainers) to pass that flag; leave the regular container call (podShape.Containers) unchanged. Also apply the same change to the other call site mentioned (lines 573-582) so init-container keys are not stored when runtime merge is intentionally out of scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/prd/prd-0001-container-ids-filtering.md`:
- Around line 152-169: The example mixes variable names and an incompatible
return shape; standardize on a single request variable (use req or request
consistently), use the same http client variable (client or httpCli) when
calling client.Do, and ensure error returns match the function’s return type
(e.g., return workloadRuntimesResult{Error: fmt.Errorf(...)} rather than nil,
fmt.Errorf). Specifically update the block that constructs
gatherRuntimeInfoRequest/reqBody to marshal, create the HTTP request into the
same req variable you set headers on, call the same client variable for Do, and
replace the mismatched return statement with a workloadRuntimesResult value
containing the error.
- Around line 28-92: Add explicit language identifiers to each fenced code block
in the document to satisfy markdownlint MD040 and improve readability: label the
ASCII sequence diagrams as "text" (e.g., the blocks showing the
insights-operator ↔ insights-runtime-extractor flows), label the HTTP
request/response examples as "http" (the GET and POST request blocks and their
bodies), and label the JSON payload/response examples as "json" (the
{"containerIds": [...]} body and any JSON response snippets); apply these
changes to every triple-backtick block in the file including the "Current flow"
and "New flow" code fences.
In `@manifests/10-insights-runtime-extractor-proxy-configmap.yaml`:
- Around line 19-22: The kube-rbac-proxy rule for the path
"/gather_runtime_info" currently omits an explicit HTTP verb, which is treated
as a wildcard; update the config for that rule (the block containing path:
/gather_runtime_info, resourceRequest: false, user.name:
system:serviceaccount:openshift-insights:operator) to include verb: post so only
POST requests are allowed and least-privilege authorization is enforced.
In `@pkg/gatherers/workloads/gather_workloads_info_test.go`:
- Around line 901-949: The test case name "merges runtime info into init
containers" contradicts the assertions (initContainer.RuntimeInfo is expected to
be nil while appContainer.RuntimeInfo is populated); either rename the case to
reflect the actual behavior (e.g., "does not merge runtime info into init
containers") or change the assertions to expect initContainer.RuntimeInfo to be
populated; locate the table entry containing the name string and the checkResult
that references initContainer, appContainer and RuntimeInfo and update the name
or assertions consistently so intent is unambiguous.
---
Outside diff comments:
In `@pkg/gatherers/workloads/gather_workloads_runtime_infos.go`:
- Around line 80-90: The goroutine currently swallows failures from
createHTTPClient() and readToken() by logging and returning, so callers never
learn about errors; change the goroutine to propagate errors over nodeWorkloadCh
instead of returning silently: when createHTTPClient() or readToken() returns an
error, send a failure result (e.g., a NodeWorkloadRuntimeInfo value that
includes an error field or a separate error wrapper) on nodeWorkloadCh
referencing createHTTPClient and readToken, then return; ensure existing
consumer of nodeWorkloadCh (which expects values from
getNodeWorkloadRuntimeInfos) is updated/compatible to handle the error-carrying
message.
- Around line 210-224: The response body is not closed on non-200 HTTP responses
in the routine that calls httpCli.Do(request), leaking connections; move or add
resp.Body.Close() so the body is always closed before any early return: after
successfully obtaining resp (i.e., after the err == nil check where resp is
valid) call defer resp.Body.Close() (or explicitly call resp.Body.Close() before
returning the workloadRuntimesResult on non-200 status) to ensure the body is
closed when returning an error; update the code paths around
httpCli.Do(request), the resp.StatusCode check, and the workloadRuntimesResult
returns to use this pattern.
---
Nitpick comments:
In `@pkg/gatherers/workloads/gather_workloads_info.go`:
- Around line 299-307: The init-container shapes are being assigned runtimeKey
even though runtime merge only applies to shape.Containers; change
calculateWorkloadContainerShapes to accept a flag (e.g., disableRuntimeKey or
isInit bool) and use it to skip populating runtimeKey when called for init
containers, then update the calls that build podShape.InitContainers (where
calculateWorkloadContainerShapes is invoked for InitContainers) to pass that
flag; leave the regular container call (podShape.Containers) unchanged. Also
apply the same change to the other call site mentioned (lines 573-582) so
init-container keys are not stored when runtime merge is intentionally out of
scope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1b9b547c-36a5-4439-9295-a586e65055bb
📒 Files selected for processing (7)
docs/prd/prd-0001-container-ids-filtering.mdmanifests/10-insights-runtime-extractor-proxy-configmap.yamlpkg/gatherers/workloads/gather_workloads_info.gopkg/gatherers/workloads/gather_workloads_info_test.gopkg/gatherers/workloads/gather_workloads_runtime_infos.gopkg/gatherers/workloads/gather_workloads_runtime_infos_test.gopkg/gatherers/workloads/types.go
Add the PRD that describe the required changes in the Insights Operator. The changes to the insights-runtime-extractor are handled in openshift/insights-runtime-extractor#60. JIRA: https://issues.redhat.com/browse/CCXDEV-15960 Signed-off-by: Jeff Mesnil <jmesnil@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jmesnil The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/gatherers/workloads/gather_workloads_info.go (1)
283-299: ⚡ Quick winDocument why init containers are excluded from runtime info merging.
The function only iterates over regular containers (
shape.Containers), not init containers. This design is consistent withaddPodContainers(line 64) which only processesContainerStatuses, but the rationale isn't documented. Init containers are typically terminated by gathering time, so excluding them is sensible, but a brief comment would clarify the design for future maintainers.📝 Suggested comment addition
// mergeRuntimeInfoIntoShapes updates the shapes in workloadPods with runtime info. // This is called after shapes are built and runtime info is gathered. +// Note: Only regular containers receive runtime info. Init containers are excluded because +// they are typically terminated by gathering time and not included in containerIDsByNode. func mergeRuntimeInfoIntoShapes(info *workloadPods, runtimeInfos workloadRuntimes) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/gatherers/workloads/gather_workloads_info.go` around lines 283 - 299, The mergeRuntimeInfoIntoShapes function only merges runtime info for regular containers (iterating shape.Containers) and skips init containers; add a concise comment above mergeRuntimeInfoIntoShapes explaining this design: init containers are excluded because addPodContainers only builds runtime keys from ContainerStatuses (regular containers) and init containers are usually terminated by collection time so their runtime info isn’t available or relevant—refer to addPodContainers and ContainerStatuses in the comment so maintainers understand the rationale.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/gatherers/workloads/gather_workloads_info.go`:
- Line 132: Run gofmt on the source to fix the formatting issue in the workloads
package: format pkg/gatherers/workloads/gather_workloads_info.go (the
gatherWorkloadsInfo-related code) with gofmt (or gofmt -w) and commit the change
so the file conforms to gofmt rules and static analysis passes.
---
Nitpick comments:
In `@pkg/gatherers/workloads/gather_workloads_info.go`:
- Around line 283-299: The mergeRuntimeInfoIntoShapes function only merges
runtime info for regular containers (iterating shape.Containers) and skips init
containers; add a concise comment above mergeRuntimeInfoIntoShapes explaining
this design: init containers are excluded because addPodContainers only builds
runtime keys from ContainerStatuses (regular containers) and init containers are
usually terminated by collection time so their runtime info isn’t available or
relevant—refer to addPodContainers and ContainerStatuses in the comment so
maintainers understand the rationale.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2f88b308-11c6-4827-bb5f-715ef94a87a4
📒 Files selected for processing (7)
docs/prd/prd-0001-container-ids-filtering.mdmanifests/10-insights-runtime-extractor-proxy-configmap.yamlpkg/gatherers/workloads/gather_workloads_info.gopkg/gatherers/workloads/gather_workloads_info_test.gopkg/gatherers/workloads/gather_workloads_runtime_infos.gopkg/gatherers/workloads/gather_workloads_runtime_infos_test.gopkg/gatherers/workloads/types.go
🚧 Files skipped from review as they are similar to previous changes (3)
- manifests/10-insights-runtime-extractor-proxy-configmap.yaml
- pkg/gatherers/workloads/gather_workloads_runtime_infos.go
- pkg/gatherers/workloads/types.go
Implementation Signed-off-by: Jeff Mesnil <jmesnil@redhat.com>
|
/retest |
|
/retest |
|
@jmesnil: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
The errors in ci/prow/insights-runtime-extractor-tests are genuine: Indeed, the PR has changed the HTTP API that is now doing a |
I will update our test suite and then retest it |
Add the PRD that describe the required changes in the Insights Operator.
The changes to the insights-runtime-extractor are handled in openshift/insights-runtime-extractor#60.
JIRA: https://issues.redhat.com/browse/CCXDEV-15960
Changelog
Integration of the insights-runtime-extractor and its API have been changed to collect only runtime data
from the subset of containers that are gathered by the Insights Operator
Breaking Changes
No
References
https://issues.redhat.com/browse/CCXDEV-15960
Summary by CodeRabbit
New Features
Updates
Documentation
Tests