Skip to content

feat(ecs-fargate): EventBridge lifecycle forwarder + OTel collector improvements#154

Open
prathamesh-sonpatki wants to merge 8 commits into
mainfrom
prathamesh/ecs-eventbridge-lifecycle-forwarder
Open

feat(ecs-fargate): EventBridge lifecycle forwarder + OTel collector improvements#154
prathamesh-sonpatki wants to merge 8 commits into
mainfrom
prathamesh/ecs-eventbridge-lifecycle-forwarder

Conversation

@prathamesh-sonpatki
Copy link
Copy Markdown
Member

Summary

  • New: eventbridge-task-lifecycle/ — captures killed/stopped ECS task events (stop reason, exit codes) and forwards to Last9 as searchable logs via EventBridge → no Lambda required
  • Fix: nodejs-express-large-logs/ — OTel collector config improvements for better visibility of stopped tasks

EventBridge lifecycle forwarder

When ECS kills a task (OOM, health check failure, scale-in), the OTel metrics receiver stops emitting immediately. The stop reason and exit codes are never captured. This forwarder bridges that gap.

How it works:

ECS Task stops → EventBridge "ECS Task State Change" event
  → API Destination (no Lambda) → POST to Last9 /json/v2
      Body: [{"taskArn":"...","stopCode":"EssentialContainerExited","containers":[{"exitCode":137}],...}]

Two CloudFormation templates:

  • cloudformation-no-lambda.yamlrecommended: EventBridge API Destination posts directly to Last9 /json/v2 using an input transformer [<detail>]. Zero code.
  • cloudformation.yaml: Lambda variant that wraps events in OTLP format (useful if structured OTLP attributes are needed for metric correlation)

Correlation key: taskArn in log body ↔ aws_ecs_task_arn label in ECS metrics

OTel config fixes (nodejs-express-large-logs)

Change Why
collection_interval: 60s → 10s Captures task state during 30s ECS shutdown window before SIGKILL
Added memory_limiter (256 MiB) Prevents collector OOM under log bursts
Added traces pipeline OTLP receiver was declared but traces were silently dropped
Added otlp to logs pipeline Enables OTLP log ingestion alongside FluentBit
batch.timeout: 10s → 5s Faster flush during graceful shutdown
Pin collector to 0.149.0 Reproducible builds
Add stopTimeout: 30 Gives collector 30s to flush before SIGKILL

Validation

  • Lambda handler logic validated with sample ECS stop event (Python)
  • Input transformer output [<detail>] validated as valid JSON array
  • Go lifecycle reporter passes go vet

🤖 Generated with Claude Code

prathamesh-sonpatki and others added 7 commits April 2, 2026 10:07
Add session management (15m inactivity / 4h max, file persistence),
automatic UIKit view tracking (swizzle), SwiftUI .trackView(name:) modifier,
and user identification API — aligned with browser SDK and Datadog iOS patterns.

New files: SessionManager, SessionStore, SessionSpanProcessor, ViewManager, UserInfo
Linear: FDE-104

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HangDetector: background thread pings main thread every 1s, captures
Mach thread stack trace if main thread blocks >2s, emits hang start/end
log events with duration.

WatchdogTerminationDetector: persists app state to disk, detects abnormal
termination (no clean shutdown + no crash handler) on next launch, emits
FATAL log event linking to previous session.

Linear: FDE-104

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rash handling

App launch: cold/warm start time (process uptime → first viewDidAppear).
Performance: per-view CPU/memory (mach_task_basic_info) + frame rate monitoring
(CADisplayLink with slow >25ms and frozen >700ms frame detection).
Interactions: tap tracking via UIApplication.sendEvent swizzle with target info.
Network timing: DNS/TLS/TTFB breakdown from URLSessionTaskMetrics.
Signal crashes: POSIX signal handlers (SIGSEGV, SIGABRT, SIGBUS, SIGFPE, SIGILL,
SIGTRAP) with pre-allocated crash markers and handler chaining.

Closes competitive gaps with Datadog and Sentry iOS SDKs (except session replay).
Linear: FDE-104

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the 12 internal SDK source files from the examples directory.
The instrumentation now lives in the published last9-ios-swift-sdk package
(v0.1.0). The example retains only AppDelegate.swift and ExampleUsage.swift.

- Package.swift now depends on last9/last9-ios-swift-sdk instead of
  opentelemetry-swift directly
- All Last9OTel.* references updated to Last9RUM.*
- README updated with SPM installation instructions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Picks up macOS build fixes (OtlpConfiguration import, LoggerProviderSdk
lifecycle, platform declaration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove Last9OTel.swift (manual OTel setup — replaced by the SDK)
- Update Package.swift to depend on last9/last9-ios-swift-sdk v0.1.2
- Add macOS(.v12) platform to satisfy SDK's platform requirements
- Update README with DHH-style setup guide and correct version

SDK handles sessions, views, crashes, ANR, user identification, and
all network tracing automatically via one initialize call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ctor improvements

- Add eventbridge-task-lifecycle/ with two CloudFormation templates:
  - cloudformation-no-lambda.yaml: EventBridge API Destination → Last9 /json/v2
    (no Lambda; input transformer wraps ECS detail in JSON array)
  - cloudformation.yaml: EventBridge → Lambda → Last9 OTLP /v1/logs
    (richer OTLP attributes for structured correlation)
- Fix otel-config-tcp.yaml: collection_interval 60s → 10s (captures task state
  during 30s shutdown window), add memory_limiter, add traces pipeline,
  add otlp receiver to logs pipeline, batch timeout 10s → 5s
- Fix task-definition-with-otel.json: pin collector to 0.149.0, add
  stopTimeout: 30, sync embedded OTEL_CONFIG env var with same changes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Type: AWS::Events::ApiDestination
Properties:
Name: !Sub 'last9-ecs-lifecycle-${AWS::StackName}'
Description: >
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML block scalar > here produces a trailing newline in the resolved string, which violates the AWS API Destination description regex constraint (Member must satisfy regular expression pattern: .*). CloudFormation fails with CREATE_FAILED on this resource.

Fix: use a single-line string instead:

Description: Posts ECS task stop events to Last9 HTTP log ingestion endpoint.

lastStatus:
- STOPPED
resources: !If
- FilterByCluster
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resources prefix filter here uses the cluster ARN format (arn:aws:ecs:REGION:ACCOUNT:cluster/CLUSTER-NAME), but the resources array in ECS Task State Change events contains task ARNs (arn:aws:ecs:REGION:ACCOUNT:task/CLUSTER-NAME/TASK-ID).

Since task/cluster/, this prefix will never match and the rule silently drops all events when ECSClusterArn is provided.

Verified by deploying this template with a real ECS cluster — TriggeredRules stayed at 0 until the resources filter was removed.

Fix: filter on detail.clusterArn instead, which contains the actual cluster ARN:

      EventPattern:
        source:
          - aws.ecs
        detail-type:
          - ECS Task State Change
        detail:
          lastStatus:
            - STOPPED
          clusterArn: !If
            - FilterByCluster
            - - !Ref ECSClusterArn
            - !Ref "AWS::NoValue"

InputTransformer:
InputPathsMap:
detail: '$.detail'
InputTemplate: '[<detail>]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InputTransformer with template [<detail>] produces invalid JSON for Last9's /json/v2 endpoint. EventBridge serializes the <detail> variable as a JSON string (double-encoded), resulting in ["{\"taskArn\":...}"] instead of [{"taskArn":...}]. Every invocation fails silently.

Verified by deploying two rules targeting the same API Destination — one with InputTransformer and one without:

  • With InputTransformer: TriggeredRules: 7, FailedInvocations: 7
  • Without InputTransformer: TriggeredRules: 7, FailedInvocations: 0

Fix: remove the InputTransformer entirely. The raw EventBridge envelope (which includes source, detail-type, time, region, and detail) is a valid JSON object that /json/v2 accepts as-is. This actually provides richer data in Last9 since it includes event metadata alongside the ECS detail.

…cycle events

Three bugs fixed (verified by deploying to real ECS cluster):
- ApiDestination Description: YAML block scalar newline caused CREATE_FAILED
- Cluster filter: resources prefix used cluster ARN but events have task ARNs;
  moved to detail.clusterArn with conditional EventPattern to avoid empty detail
- InputTransformer: [<detail>] double-encoded JSON as string, causing 100%
  FailedInvocations; removed InputTransformer, raw EventBridge envelope works

Expanded from STOPPED-only to full ECS lifecycle:
- ECS Task State Change (PROVISIONING → RUNNING → STOPPED)
- ECS Service Action (deployments, scaling, steady-state)
- ECS Deployment State Change (IN_PROGRESS → COMPLETED/FAILED)
Each event type has its own toggleable EventBridge rule.

Lambda handler updated to v0.2.0: generic event handling with dynamic
OTLP attributes, EVENT_NAMES mapping, and sample events for all 3 types.

Added README.md, .env.example, .gitignore per repo conventions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants