Skip to content

Releases: kubescape/node-agent

Release v0.3.132

02 Jun 07:09
2038a1b

Choose a tag to compare

Summary

This PR addresses the critical eBPF agent deadlock and secondary OOM crashes under high load.

Key Improvements

  1. Deadlock Resolution:

    • Decoupled RBCache mutation from notifier channel sends by introducing an internal notificationQueue (capacity 10,000). This breaks the circular dependency (RBCache.mutex write lock blocking while sending, while readers are blocked on RLock), eliminating the deadlock entirely.
  2. FIFO Ordered Notifications:

    • Notifications are queued inside the write lock using a fast, non-blocking select write. This guarantees that notifications are queued in the exact chronological order of cache mutations, preserving strict FIFO delivery to downstream consumers (like ContainerWatcher.startRunningContainers) and preventing race conditions/state desynchronization.
    • The background queue processor uses a highly optimized non-blocking fast-path to prevent channel allocation/timer overhead, falling back to a 100ms timeout to defensively isolate slow/stalled notifiers without blocking healthy ones.
  3. Event Queue OOM Protection:

    • Capped the unbounded OrderedEventQueue at maxBufferSize (100,000 events) and added event dropping.
    • Dropped events are cleanly returned to the memory pool using event.Release() to prevent memory leaks and cgroup OOM kills under high system loads.

Release v0.3.129

28 May 14:33
eb5d48e

Choose a tag to compare

Summary

Phase 1 — Traces, logs, drop counters

  • New `pkg/otelsetup` package: `InitProviders` wires up TracerProvider, LoggerProvider, and MeterProvider over OTLP gRPC; injects ARMO `X-API-Key` / `X-Customer-GUID` auth headers when the endpoint matches `otel.armosec.io`; returns no-op providers when no endpoint is configured
  • Container profile lifecycle tracing: `ProfileLifecycleTracker` maintains one long-running span per container learning period (bounded at 10k entries with LRU eviction), recording `profile.entry.saved`, `learning.completed`, `learning.terminated`, and eviction events
  • Alert log records: `EmitAlertLogRecord` emits structured OTEL log records for every fired rule and malware detection; includes 60s/1000-entry dedup LRU to avoid flooding on hot rules
  • eBPF drop counters: `node_agent.ebpf.events_dropped.total` incremented in container watcher and event handler factory drop paths, labelled by `reason`
  • Slow-eval spans: rule evaluations exceeding `OTEL_SLOW_EVAL_THRESHOLD_MS` emit a `rule.evaluate` span
  • Ring-buffer log processor: 7500-entry ring buffer retains recent log records; flush endpoint activates automatically when KS_LOGGER_LEVEL=debug
  • sbommanager: attaches `otelgrpc.NewClientHandler()` for automatic trace propagation

Phase 2 — Replace Prometheus metrics with OTEL SDK

  • New `pkg/metricsmanager/otel/`: full `MetricsManager` interface backed by OTEL SDK; attribute-set caching on all hot paths (2× faster, 10× less memory vs Prometheus on the histogram path)
  • Collapsed eBPF counters: 17 individual per-event-type counters → single `node_agent.ebpf.events.total{event_type}`
  • Prometheus scrape mode: `OTEL_METRICS_EXPORTER=prometheus` installs an OTEL→Prometheus bridge and starts `:8080/metrics` listener
  • `rule.ID` standardisation: all metric call sites now use the stable rule ID (e.g. `R1001`) instead of the display name; malware alerts use constant `"malware"` to bound cardinality
  • `docs/metrics-migration.md`: full mapping of old Prometheus names → new OTEL names with dashboard update checklist
  • A/B benchmarks: hard gate passes — OTEL allocs/op ≤ Prometheus allocs/op, ns/op ≤ 1.1× Prometheus on `BenchmarkReportRuleEvaluationTime`

New env vars

Variable Default Purpose
`OTEL_EXPORTER_OTLP_ENDPOINT` Base OTLP gRPC endpoint
`OTEL_METRICS_EXPORTER` Set to `prometheus` to enable scrape endpoint on `:8080/metrics`
`OTEL_SLOW_EVAL_THRESHOLD_MS` 0 (disabled) Threshold for slow-eval spans
`OTEL_DEBUG_PORT` 6060 Debug listener port

`OTEL_COLLECTOR_SVC` is now deprecated (superseded by `OTEL_EXPORTER_OTLP_ENDPOINT`).

Breaking change

Metric names changed. See `docs/metrics-migration.md` for the full mapping and dashboard update checklist.

Test plan

  • `go build ./...` — passes
  • `go test ./pkg/otelsetup/... ./pkg/metricsmanager/...` — all pass
  • A/B benchmark: OTEL `ReportRuleEvaluationTime` ~95 ns/op / 32 B / 2 allocs vs Prometheus ~200 ns/op / 336 B / 2 allocs — gate passes
  • `ProfileLifecycleTracker` and `RingBufferLogProcessor` unit tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Provider-based OpenTelemetry init, OTEL-backed metrics manager replacing prior Prometheus path; expanded metrics (events, rules, SBOM, alerts), gRPC instrumentation, profile lifecycle spans, alert deduplication and suppression reporting
  • Documentation

    • Expanded OTEL configuration reference, runtime notes, and Prometheus→OTEL migration guide
  • Tests

    • New unit tests and benchmarks for OTEL, lifecycle tracking, thresholds, and metrics

Review Change Stack

Release v0.3.128

27 May 17:17
a9bde81

Choose a tag to compare

Closes #825. Update inspektor-gadget dependency to use the version from your fork with the concurrent map writes fix.

Release v0.3.127

27 May 13:20
0a63d5e

Choose a tag to compare

Summary

This PR introduces a container meta-context so that rules can be written once and fire on any containerized workload — both Kubernetes pods and ECS/standalone containers — without duplicating tags.

What changed

  • pkg/contextdetection/types.go: Added Container EventSourceContext = "container" alongside the existing Kubernetes, Host, and Standalone constants.

  • pkg/rulemanager/rulepolicy.go: Updated RuleAppliesToContext to treat context:container as a meta-context that matches when the agent is running in either a kubernetes or a container context.

How the meta-context works

context:container  →  matches kubernetes OR container runtime
context:kubernetes →  matches kubernetes only  (unchanged)
context:host       →  matches host only         (unchanged)
context:standalone →  matches standalone only   (unchanged)

A rule tagged only context:container will fire on Kubernetes AND on ECS/standalone container nodes. A rule that needs Kubernetes-specific behaviour still uses context:kubernetes exclusively.

Backward compatibility

Rules that should fire on both Kubernetes and ECS nodes should carry both tags:

tags:
  - context:container    # fires on any containerised workload (new agents)
  - context:kubernetes   # still fires on old node-agents that don't know "container"

Old node-agents that don't recognise container will skip that tag and match context:kubernetes as before, so no existing rule behaviour is broken.

Logic walk-through

isContainerContext := currentContext == Kubernetes || currentContext == Container

for _, tag := range rule.Tags {
    if ctx == string(currentContext)              { return true } // exact match (all contexts)
    if ctx == "container" && isContainerContext   { return true } // meta-context match
}

No-context rules continue to default to Kubernetes-only (unchanged).

Summary by CodeRabbit

  • New Features

    • Added support for ECS (Amazon Elastic Container Service) context detection.
    • Runtime alerts now include ECS-specific metadata (cluster, task, launch type, availability zone).
    • Improved rule matching to support Container context across Kubernetes, Standalone, Container, and ECS environments.
  • Chores

    • Updated Go module dependencies.

Review Change Stack

Release v0.3.124

26 May 17:18
cbfefb2

Choose a tag to compare

Summary

  • Bumps github.com/cilium/cilium v1.17.14 → v1.17.15 (fixes GHSA-gj49-89wh-h4gj High)
  • Bumps golang.org/x/crypto v0.50.0 → v0.52.0 (fixes GO-2026-5005 through GO-2026-5033)
  • Bumps golang.org/x/net v0.53.0 → v0.55.0 (fixes GO-2026-5024 through GO-2026-5030)
  • Bumps golang.org/x/sys v0.43.0 → v0.45.0 (fixes GO-2026-5024)

Not yet fixable

GHSA-x744-4wpc-v9h2 (High) affects github.com/moby/moby and github.com/docker/docker — the fixed version (v29.3.1) is not yet published to the Go module proxy. Will address in a follow-up once it becomes available.

Test plan

  • CI passes (build + tests)
  • No new dependency conflicts introduced (go mod tidy clean)

🤖 Generated with Claude Code

Release v0.3.122

26 May 08:19
ff24606

Choose a tag to compare

Bumps github.com/containerd/containerd from 1.7.30 to 1.7.32.

Release notes

Sourced from github.com/containerd/containerd's releases.

containerd 1.7.32

Welcome to the v1.7.32 release of containerd!


The thirty-second patch release for containerd 1.7 contains various fixes and updates including a security patch.

  • containerd

  • Allow hosts.toml to contain only root-level fields without an explicit [host] section (#10028)

  • Fix handling of out-of-range USER values in OCI spec to avoid unexpected username/group lookups (#13450)

  • Apply hardening to block AF_ALG in default socket policy (#13406)

  • Support both "volatile" and "fsync=volatile" mount options for volatile snapshotter (#13299)

  • Set AppArmor abi conditionally to support versions < 3.0 (#13273)

Please try out the release binaries and report any issues at https://github.com/containerd/containerd/issues.

  • Maksym Pavlenko
  • Chris Henzie
  • Derek McGowan
  • Paweł Gronowski
  • Samuel Karp
  • Wei Fu
  • Brad Davidson
  • Brian Goff
  • LEI WANG
  • Phil Estes
  • bc87d865c Prepare release notes for v1.7.32
  • oci: return explicit error for out-of-range USER values (#13450)
    • 503f47946 oci: return explicit error for out-of-range USER values
  • seccomp: Block AF_ALG in default socket policy (#13406)
    • e55b747d3 seccomp: Block AF_ALG in default socket policy
    • 4627a65f8 seccomp: Document socket rule scope and socketcall limitation
  • Fix issue with empty host tree in hosts.toml (#10028)
    • 24007441d Fix error parsing hosts.toml without any host tree
  • Support both styles of volatile mount option (#13299)
    • 940733149 Support both styles of volatile mount option
  • apparmor: Set abi conditionally (#13273)
  • Add GitHub Action for k8s node e2e tests (#13258)
    • 0db1e143a Add GitHub Action for k8s node e2e tests
  • Update release process after 1.7 (#13236)
    • 3223a75c2 Update for latest updates to release tool

... (truncated)

Commits
  • 180a7b7 Merge pull request #13452 from samuelkarp/prepare-1.7.32
  • bc87d86 Prepare release notes for v1.7.32
  • 6a05ddd Merge pull request #13450 from samuelkarp/oci-withuser-errrange-1.7
  • 9c3d01b Merge pull request #13406 from k8s-infra-cherrypick-robot/cherry-pick-13327-t...
  • e55b747 seccomp: Block AF_ALG in default socket policy
  • 4627a65 seccomp: Document socket rule scope and socketcall limitation
  • 33d9e24 Merge pull request #10028 from brandond/fix-hosts-toml
  • 503f479 oci: return explicit error for out-of-range USER values
  • 4393e22 Merge pull request #13299 from chrishenzie/release/1.7-volatile
  • 9407331 Support both styles of volatile mount option
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    You can disable automated security fix PRs for this repo from the Security Alerts page.

Release v0.3.119

19 May 15:42
bf71679

Choose a tag to compare

Summary

  • Timeout: Increase default HTTP client timeout from 5s to 30s. Both node-agent and synchronizer had the same 5s limit, causing a race where the synchronizer's ReadTimeout could close the connection before responding, producing spurious context deadline exceeded (Client.Timeout exceeded while awaiting headers) errors — no high CPU required to trigger this.
  • Mutex stall: eventsStorageMutex was held for the entire HTTP round-trip (up to 5s on timeout), blocking all ReportEvent/handleNetworkEvent/handleDnsEvent goroutines. Fixed by snapshotting under the lock and releasing before the HTTP call. A new Entities map with entity structs copied by value is sufficient — the clearing loop never mutates Inbound/Outbound maps in place, so the old maps become exclusively owned by the snapshot once the lock is released.
  • Empty-stream skip: sendNetworkEvent logged "skipping" for empty streams but still sent the HTTP request. Added the missing return nil.

Deploy note: the timeout fix requires the matching change in kubescape/synchronizer (raise ReadTimeout to 30s) to be deployed together.

Test plan

  • Verify no context deadline exceeded errors in node-agent logs after deploying both PRs together
  • Confirm empty network stream intervals no longer generate HTTP requests to the synchronizer
  • Confirm network/DNS event recording is not stalled during a synchronizer outage

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • HTTP request timeout increased from 5 to 30 seconds for improved stability.
  • Performance

    • Network event processing now skips unnecessary operations when no events are present.
    • Improved concurrency handling in network event storage and transmission.

Review Change Stack

Release v0.3.115

18 May 16:12
5cf899c

Choose a tag to compare

Summary by CodeRabbit

  • Bug Fixes

    • Prevents partial or non-terminal profiles from replacing cached entries; entries remain pending until a terminal status (Completed/TooLarge) and are retried safely.
    • Refresh now preserves existing cache when a fetched profile is not terminal.
  • New Features

    • Cache now accepts and promotes entries immediately after a profile reaches Completed, reducing delay.
  • Tests

    • Updated tests and fixtures to reflect the refined completion-status behavior.

Review Change Stack

Release v0.3.113

12 May 12:22
cc59fa0

Choose a tag to compare

Summary by CodeRabbit

  • Bug Fixes
    • Improved validation messaging for rules with missing profile configurations. Consolidated multiple individual error logs into a single aggregated warning message for clearer feedback and reduced log noise.

Review Change Stack

Release v0.3.112

06 May 15:44
2d768cb

Choose a tag to compare

Summary by CodeRabbit

  • Chores
    • Service discovery now supports the API_URL environment variable for dynamic endpoint configuration, defaulting to api.armosec.io when unset.