Releases: kubescape/node-agent
Release v0.3.132
Summary
This PR addresses the critical eBPF agent deadlock and secondary OOM crashes under high load.
Key Improvements
-
Deadlock Resolution:
- Decoupled
RBCachemutation from notifier channel sends by introducing an internalnotificationQueue(capacity 10,000). This breaks the circular dependency (RBCache.mutexwrite lock blocking while sending, while readers are blocked onRLock), eliminating the deadlock entirely.
- Decoupled
-
FIFO Ordered Notifications:
- Notifications are queued inside the write lock using a fast, non-blocking
selectwrite. This guarantees that notifications are queued in the exact chronological order of cache mutations, preserving strict FIFO delivery to downstream consumers (likeContainerWatcher.startRunningContainers) and preventing race conditions/state desynchronization. - The background queue processor uses a highly optimized non-blocking fast-path to prevent channel allocation/timer overhead, falling back to a
100mstimeout to defensively isolate slow/stalled notifiers without blocking healthy ones.
- Notifications are queued inside the write lock using a fast, non-blocking
-
Event Queue OOM Protection:
- Capped the unbounded
OrderedEventQueueatmaxBufferSize(100,000 events) and added event dropping. - Dropped events are cleanly returned to the memory pool using
event.Release()to prevent memory leaks and cgroup OOM kills under high system loads.
- Capped the unbounded
Release v0.3.129
Summary
Phase 1 — Traces, logs, drop counters
- New `pkg/otelsetup` package: `InitProviders` wires up TracerProvider, LoggerProvider, and MeterProvider over OTLP gRPC; injects ARMO `X-API-Key` / `X-Customer-GUID` auth headers when the endpoint matches `otel.armosec.io`; returns no-op providers when no endpoint is configured
- Container profile lifecycle tracing: `ProfileLifecycleTracker` maintains one long-running span per container learning period (bounded at 10k entries with LRU eviction), recording `profile.entry.saved`, `learning.completed`, `learning.terminated`, and eviction events
- Alert log records: `EmitAlertLogRecord` emits structured OTEL log records for every fired rule and malware detection; includes 60s/1000-entry dedup LRU to avoid flooding on hot rules
- eBPF drop counters: `node_agent.ebpf.events_dropped.total` incremented in container watcher and event handler factory drop paths, labelled by `reason`
- Slow-eval spans: rule evaluations exceeding `OTEL_SLOW_EVAL_THRESHOLD_MS` emit a `rule.evaluate` span
- Ring-buffer log processor: 7500-entry ring buffer retains recent log records; flush endpoint activates automatically when KS_LOGGER_LEVEL=debug
- sbommanager: attaches `otelgrpc.NewClientHandler()` for automatic trace propagation
Phase 2 — Replace Prometheus metrics with OTEL SDK
- New `pkg/metricsmanager/otel/`: full `MetricsManager` interface backed by OTEL SDK; attribute-set caching on all hot paths (2× faster, 10× less memory vs Prometheus on the histogram path)
- Collapsed eBPF counters: 17 individual per-event-type counters → single `node_agent.ebpf.events.total{event_type}`
- Prometheus scrape mode: `OTEL_METRICS_EXPORTER=prometheus` installs an OTEL→Prometheus bridge and starts `:8080/metrics` listener
- `rule.ID` standardisation: all metric call sites now use the stable rule ID (e.g. `R1001`) instead of the display name; malware alerts use constant `"malware"` to bound cardinality
- `docs/metrics-migration.md`: full mapping of old Prometheus names → new OTEL names with dashboard update checklist
- A/B benchmarks: hard gate passes — OTEL allocs/op ≤ Prometheus allocs/op, ns/op ≤ 1.1× Prometheus on `BenchmarkReportRuleEvaluationTime`
New env vars
| Variable | Default | Purpose |
|---|---|---|
| `OTEL_EXPORTER_OTLP_ENDPOINT` | — | Base OTLP gRPC endpoint |
| `OTEL_METRICS_EXPORTER` | — | Set to `prometheus` to enable scrape endpoint on `:8080/metrics` |
| `OTEL_SLOW_EVAL_THRESHOLD_MS` | 0 (disabled) | Threshold for slow-eval spans |
| `OTEL_DEBUG_PORT` | 6060 | Debug listener port |
`OTEL_COLLECTOR_SVC` is now deprecated (superseded by `OTEL_EXPORTER_OTLP_ENDPOINT`).
Breaking change
Metric names changed. See `docs/metrics-migration.md` for the full mapping and dashboard update checklist.
Test plan
- `go build ./...` — passes
- `go test ./pkg/otelsetup/... ./pkg/metricsmanager/...` — all pass
- A/B benchmark: OTEL `ReportRuleEvaluationTime` ~95 ns/op / 32 B / 2 allocs vs Prometheus ~200 ns/op / 336 B / 2 allocs — gate passes
- `ProfileLifecycleTracker` and `RingBufferLogProcessor` unit tests pass
🤖 Generated with Claude Code
Summary by CodeRabbit
-
New Features
- Provider-based OpenTelemetry init, OTEL-backed metrics manager replacing prior Prometheus path; expanded metrics (events, rules, SBOM, alerts), gRPC instrumentation, profile lifecycle spans, alert deduplication and suppression reporting
-
Documentation
- Expanded OTEL configuration reference, runtime notes, and Prometheus→OTEL migration guide
-
Tests
- New unit tests and benchmarks for OTEL, lifecycle tracking, thresholds, and metrics
Release v0.3.128
Closes #825. Update inspektor-gadget dependency to use the version from your fork with the concurrent map writes fix.
Release v0.3.127
Summary
This PR introduces a container meta-context so that rules can be written once and fire on any containerized workload — both Kubernetes pods and ECS/standalone containers — without duplicating tags.
What changed
-
pkg/contextdetection/types.go: AddedContainer EventSourceContext = "container"alongside the existingKubernetes,Host, andStandaloneconstants. -
pkg/rulemanager/rulepolicy.go: UpdatedRuleAppliesToContextto treatcontext:containeras a meta-context that matches when the agent is running in either akubernetesor acontainercontext.
How the meta-context works
context:container → matches kubernetes OR container runtime
context:kubernetes → matches kubernetes only (unchanged)
context:host → matches host only (unchanged)
context:standalone → matches standalone only (unchanged)
A rule tagged only context:container will fire on Kubernetes AND on ECS/standalone container nodes. A rule that needs Kubernetes-specific behaviour still uses context:kubernetes exclusively.
Backward compatibility
Rules that should fire on both Kubernetes and ECS nodes should carry both tags:
tags:
- context:container # fires on any containerised workload (new agents)
- context:kubernetes # still fires on old node-agents that don't know "container"Old node-agents that don't recognise container will skip that tag and match context:kubernetes as before, so no existing rule behaviour is broken.
Logic walk-through
isContainerContext := currentContext == Kubernetes || currentContext == Container
for _, tag := range rule.Tags {
if ctx == string(currentContext) { return true } // exact match (all contexts)
if ctx == "container" && isContainerContext { return true } // meta-context match
}No-context rules continue to default to Kubernetes-only (unchanged).
Summary by CodeRabbit
-
New Features
- Added support for ECS (Amazon Elastic Container Service) context detection.
- Runtime alerts now include ECS-specific metadata (cluster, task, launch type, availability zone).
- Improved rule matching to support Container context across Kubernetes, Standalone, Container, and ECS environments.
-
Chores
- Updated Go module dependencies.
Release v0.3.124
Summary
- Bumps
github.com/cilium/ciliumv1.17.14 → v1.17.15 (fixes GHSA-gj49-89wh-h4gj High) - Bumps
golang.org/x/cryptov0.50.0 → v0.52.0 (fixes GO-2026-5005 through GO-2026-5033) - Bumps
golang.org/x/netv0.53.0 → v0.55.0 (fixes GO-2026-5024 through GO-2026-5030) - Bumps
golang.org/x/sysv0.43.0 → v0.45.0 (fixes GO-2026-5024)
Not yet fixable
GHSA-x744-4wpc-v9h2 (High) affects github.com/moby/moby and github.com/docker/docker — the fixed version (v29.3.1) is not yet published to the Go module proxy. Will address in a follow-up once it becomes available.
Test plan
- CI passes (build + tests)
- No new dependency conflicts introduced (
go mod tidyclean)
🤖 Generated with Claude Code
Release v0.3.122
Bumps github.com/containerd/containerd from 1.7.30 to 1.7.32.
Release notes
Sourced from github.com/containerd/containerd's releases.
containerd 1.7.32
Welcome to the v1.7.32 release of containerd!
The thirty-second patch release for containerd 1.7 contains various fixes and updates including a security patch.
containerd
Allow hosts.toml to contain only root-level fields without an explicit [host] section (#10028)
Fix handling of out-of-range USER values in OCI spec to avoid unexpected username/group lookups (#13450)
Apply hardening to block AF_ALG in default socket policy (#13406)
Support both "volatile" and "fsync=volatile" mount options for volatile snapshotter (#13299)
Set AppArmor abi conditionally to support versions < 3.0 (#13273)
Please try out the release binaries and report any issues at https://github.com/containerd/containerd/issues.
- Maksym Pavlenko
- Chris Henzie
- Derek McGowan
- Paweł Gronowski
- Samuel Karp
- Wei Fu
- Brad Davidson
- Brian Goff
- LEI WANG
- Phil Estes
bc87d865cPrepare release notes for v1.7.32- oci: return explicit error for out-of-range USER values (#13450)
503f47946oci: return explicit error for out-of-range USER values- seccomp: Block AF_ALG in default socket policy (#13406)
- Fix issue with empty host tree in hosts.toml (#10028)
24007441dFix error parsing hosts.toml without anyhosttree- Support both styles of volatile mount option (#13299)
940733149Support both styles of volatile mount option- apparmor: Set abi conditionally (#13273)
2b732c892apparmor: Set abi conditionally- Add GitHub Action for k8s node e2e tests (#13258)
0db1e143aAdd GitHub Action for k8s node e2e tests- Update release process after 1.7 (#13236)
3223a75c2Update for latest updates to release tool
... (truncated)
Commits
180a7b7Merge pull request #13452 from samuelkarp/prepare-1.7.32bc87d86Prepare release notes for v1.7.326a05dddMerge pull request #13450 from samuelkarp/oci-withuser-errrange-1.79c3d01bMerge pull request #13406 from k8s-infra-cherrypick-robot/cherry-pick-13327-t...e55b747seccomp: Block AF_ALG in default socket policy4627a65seccomp: Document socket rule scope and socketcall limitation33d9e24Merge pull request #10028 from brandond/fix-hosts-toml503f479oci: return explicit error for out-of-range USER values4393e22Merge pull request #13299 from chrishenzie/release/1.7-volatile9407331Support both styles of volatile mount option- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the Security Alerts page.
Release v0.3.119
Summary
- Timeout: Increase default HTTP client timeout from 5s to 30s. Both node-agent and synchronizer had the same 5s limit, causing a race where the synchronizer's
ReadTimeoutcould close the connection before responding, producing spuriouscontext deadline exceeded (Client.Timeout exceeded while awaiting headers)errors — no high CPU required to trigger this. - Mutex stall:
eventsStorageMutexwas held for the entire HTTP round-trip (up to 5s on timeout), blocking allReportEvent/handleNetworkEvent/handleDnsEventgoroutines. Fixed by snapshotting under the lock and releasing before the HTTP call. A newEntitiesmap with entity structs copied by value is sufficient — the clearing loop never mutatesInbound/Outboundmaps in place, so the old maps become exclusively owned by the snapshot once the lock is released. - Empty-stream skip:
sendNetworkEventlogged "skipping" for empty streams but still sent the HTTP request. Added the missingreturn nil.
Deploy note: the timeout fix requires the matching change in
kubescape/synchronizer(raiseReadTimeoutto 30s) to be deployed together.
Test plan
- Verify no
context deadline exceedederrors in node-agent logs after deploying both PRs together - Confirm empty network stream intervals no longer generate HTTP requests to the synchronizer
- Confirm network/DNS event recording is not stalled during a synchronizer outage
🤖 Generated with Claude Code
Summary by CodeRabbit
-
Bug Fixes
- HTTP request timeout increased from 5 to 30 seconds for improved stability.
-
Performance
- Network event processing now skips unnecessary operations when no events are present.
- Improved concurrency handling in network event storage and transmission.
Release v0.3.115
Summary by CodeRabbit
-
Bug Fixes
- Prevents partial or non-terminal profiles from replacing cached entries; entries remain pending until a terminal status (Completed/TooLarge) and are retried safely.
- Refresh now preserves existing cache when a fetched profile is not terminal.
-
New Features
- Cache now accepts and promotes entries immediately after a profile reaches Completed, reducing delay.
-
Tests
- Updated tests and fixtures to reflect the refined completion-status behavior.
Release v0.3.113
Release v0.3.112
Summary by CodeRabbit
- Chores
- Service discovery now supports the
API_URLenvironment variable for dynamic endpoint configuration, defaulting toapi.armosec.iowhen unset.
- Service discovery now supports the