Skip to content

last9/last9-k8s-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Last9 OpenTelemetry Operator Setup

Automated setup script for deploying OpenTelemetry Operator, Collector, Kubernetes monitoring, and Events collection to your Kubernetes cluster with Last9 integration.

Features

  • βœ… One-command installation - Deploy everything with a single command
  • βœ… Flexible deployment options - Install only what you need (logs, traces, metrics, events)
  • βœ… Auto-instrumentation - Automatic instrumentation for Java, Python, Node.js, and more
  • βœ… Kubernetes monitoring - Full cluster observability with kube-prometheus-stack
  • βœ… Events collection - Capture and forward Kubernetes events
  • βœ… Cluster identification - Automatic cluster name detection and attribution
  • βœ… Tolerations support - Deploy on tainted nodes (control-plane, spot instances, etc.)
  • βœ… Environment customization - Override deployment environment and cluster name

Quick Start

Prerequisites

  • kubectl configured to access your Kubernetes cluster
  • helm (v3+) installed

Option 1: Install Everything (Recommended)

Installs OpenTelemetry Operator, Collector, Kubernetes monitoring stack, and Events agent:

./last9-otel-setup.sh \
  token="Basic <your-base64-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

Quick Install (One-liner)

curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh | bash -s -- \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<user>" \
  password="<pass>"

Installation Options

Option 2: Traces Only (Operator + Collector)

For applications that need distributed tracing:

./last9-otel-setup.sh operator-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 3: Logs Only (Collector without Operator)

For log collection use cases:

./last9-otel-setup.sh logs-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 4: Metrics Only (Kubernetes Monitoring)

For cluster metrics and monitoring. This is a fully standalone path β€” no prior install steps needed.

Step 1 β€” Download the script (skip if you already have it):

curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh \
  -o last9-otel-setup.sh && chmod +x last9-otel-setup.sh

Step 2 β€” Run monitoring-only install:

./last9-otel-setup.sh monitoring-only \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

What gets installed (in the last9 namespace):

Component Purpose
kube-prometheus-stack Prometheus Operator + AlertManager
PrometheusAgent Scrapes cluster metrics, remote-writes to Last9
kube-state-metrics Kubernetes object state metrics
node-exporter Per-node CPU/memory/disk metrics

Verify the install:

kubectl get pods -n last9
kubectl get prometheusagent -n last9
kubectl get secrets -n last9 last9-remote-write-secret

Pre-existing Prometheus CRDs (Terraform / prior Helm installs)

If your cluster already has monitoring.coreos.com CRDs managed by Terraform or another Helm release, the script detects this automatically and handles the conflict by upgrading CRD schemas to the required version and skipping conflicting CRD installation steps. No manual intervention needed.

Option 5: Kubernetes Events Only

For Kubernetes events collection:

./last9-otel-setup.sh events-only \
  endpoint="<your-otlp-endpoint>" \
  token="Basic <your-base64-token>" \
  monitoring-endpoint="<your-metrics-endpoint>"

Advanced Configuration

Specify kubectl Context (Multi-Cluster)

On shared machines with multiple clusters, pass context= to pin all operations to a specific kubectl context. Without it, the current active context is used β€” which could change mid-install on shared instances.

./last9-otel-setup.sh monitoring-only \
  context="prod-us-east-1" \
  monitoring-endpoint="..." \
  username="..." \
  password="..."

Works with all install modes (monitoring-only, logs-only, operator-only, events-only, full install, uninstall). All kubectl and helm calls use the specified context.

List available contexts:

kubectl config get-contexts

Override Cluster Name

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  cluster="prod-us-east-1"

If not provided, the cluster name is auto-detected from kubectl config current-context (or from context= if that was passed).

Set Deployment Environment

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  env="production"

Default: staging for collector, local for auto-instrumentation.

Deploy with Tolerations

For deploying on nodes with taints (e.g., control-plane, monitoring nodes):

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  tolerations-file=/path/to/tolerations.yaml

Example tolerations files are provided in the examples/ directory:

  • tolerations-all-nodes.yaml - Deploy on all nodes including control-plane
  • tolerations-monitoring-nodes.yaml - Deploy on dedicated monitoring nodes
  • tolerations-spot-instances.yaml - Deploy on spot/preemptible instances
  • tolerations-multi-taint.yaml - Handle multiple taints
  • tolerations-nodeSelector-only.yaml - Use nodeSelector without tolerations
  • tolerations-gpu-nodes.yaml - Deploy on GPU nodes with nvidia.com/gpu taints

TLS Server Name Override (SNI)

Use server-name= when your OTLP / metrics endpoint terminates TLS behind a proxy or load balancer whose certificate name (CN/SAN) differs from the host in the connection URL β€” for example, the endpoint is a private NLB last9nlb.example.com:443 but the certificate is issued for ingest-internal.<region>.last9.cloud. Without it, TLS verification fails with a hostname mismatch.

./last9-otel-setup.sh \
  token="..." \
  endpoint="last9nlb.example.com:443" \
  monitoring-endpoint="https://last9nlb.example.com/v1/metrics/<cluster-id>/sender/last9/write" \
  server-name="ingest-internal.example.last9.cloud"

When set, it injects (only into the files being installed):

  • tls.server_name_override (with insecure: false) under the otlp/last9 exporter in the collector and events-agent configs.
  • tlsConfig.serverName under the Prometheus remoteWrite entry in the monitoring config.

When omitted (the default), no TLS block is added and the values files are unchanged β€” fully backward compatible.

Note: injection is idempotent β€” re-running with the same server-name= is a no-op. To change an already-injected SNI, edit the TLS block in the values file (or remove it) before re-running; the script will not overwrite an existing override.

Applying SNI to an already-installed cluster

If a release is already deployed, extract its live values, add the TLS block, and upgrade. Use --reuse-values=false because helm get values already returns the full merged value set:

# Collector
helm get values last9-opentelemetry-collector -n last9 -o yaml > collector-values.yaml
# Under config.exporters.otlp/last9 add:
#   tls:
#     insecure: false
#     server_name_override: ingest-internal.example.last9.cloud
helm upgrade last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  -n last9 -f collector-values.yaml --reuse-values=false

# Events agent (same tls block under config.exporters.otlp/last9)
helm get values last9-kube-events-agent -n last9 -o yaml > events-values.yaml
helm upgrade last9-kube-events-agent open-telemetry/opentelemetry-collector \
  -n last9 -f events-values.yaml --reuse-values=false

# Monitoring (Prometheus): under prometheus.prometheusSpec.remoteWrite[0] add:
#   tlsConfig:
#     serverName: ingest-internal.example.last9.cloud
helm get values last9-k8s-monitoring -n last9 -o yaml > monitoring-values.yaml
helm upgrade last9-k8s-monitoring prometheus-community/kube-prometheus-stack \
  -n last9 -f monitoring-values.yaml --reuse-values=false

Configuration Files

File Description
last9-otel-collector-values.yaml OpenTelemetry Collector configuration for logs and traces
last9-otel-collector-metrics-values.yaml Optional: Application metrics scraping (Prometheus SD)
last9-otel-collector-gpu-values.yaml Optional: GPU (DCGM) + Ray metrics scraping (includes app metrics)
k8s-monitoring-values.yaml Kube-prometheus-stack configuration for metrics
last9-kube-events-agent-values.yaml Events collection agent configuration
collector-svc.yaml Collector service for application instrumentation
instrumentation.yaml Auto-instrumentation configuration
deploy.yaml Sample application deployment with auto-instrumentation
tolerations.yaml Sample tolerations configuration

Placeholders

The following placeholders are automatically replaced during installation:

  • {{AUTH_TOKEN}} - Your Last9 authorization token
  • {{OTEL_ENDPOINT}} - Your OTEL endpoint URL
  • {{MONITORING_ENDPOINT}} - Your metrics endpoint URL

Uninstallation

Uninstall Everything

./last9-otel-setup.sh uninstall-all

Uninstall Specific Components

# Uninstall only monitoring stack
./last9-otel-setup.sh uninstall function="uninstall_last9_monitoring"

# Uninstall only events agent
./last9-otel-setup.sh uninstall function="uninstall_events_agent"

# Uninstall OpenTelemetry components (operator + collector)
./last9-otel-setup.sh uninstall

Verification

After installation, verify the deployment:

# Check all pods in last9 namespace
kubectl get pods -n last9

# Check collector logs
kubectl logs -n last9 -l app.kubernetes.io/name=opentelemetry-collector

# Check monitoring stack
kubectl get prometheus -n last9

# Check events agent
kubectl get pods -n last9 -l app.kubernetes.io/name=last9-kube-events-agent

Auto-Instrumentation

The script automatically sets up instrumentation for:

  • β˜• Java - Automatic OTLP export
  • 🐍 Python - Automatic OTLP export
  • 🟒 Node.js - Automatic OTLP export
  • πŸ”΅ Go - Manual instrumentation supported
  • πŸ’Ž Ruby - Coming soon

Application Metrics Scraping (Optional)

The OpenTelemetry Collector can automatically discover and scrape application metrics using Kubernetes service discovery with Prometheus-compatible scraping.

Note: This is an optional feature. Use last9-otel-collector-metrics-values.yaml to enable metrics scraping.

Enable Metrics Scraping

To enable application metrics scraping, deploy with the additional metrics configuration file:

# Deploy with metrics scraping enabled
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-metrics-values.yaml

Configure Last9 Metrics Endpoint:

Before deploying, update these placeholders in last9-otel-collector-metrics-values.yaml:

  • {{LAST9_METRICS_ENDPOINT}} - Your Last9 Prometheus remote write URL
  • {{LAST9_METRICS_USERNAME}} - Your Last9 metrics username
  • {{LAST9_METRICS_PASSWORD}} - Your Last9 metrics password

Quick Start

Add these annotations to your pod template or service to enable automatic metrics scraping:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"  # Optional, defaults to /metrics

That's it! Your application metrics will be automatically:

  • Discovered - No manual configuration needed
  • Scraped - Every 30 seconds by default
  • Enriched - With pod, namespace, node labels
  • Exported - To Last9 via Prometheus remote write

How It Works

  1. Automatic Discovery - OTel Collector watches Kubernetes API for all pods/services
  2. Annotation-Based Filtering - Only scrapes resources with prometheus.io/scrape: "true"
  3. Metadata Enrichment - Adds Kubernetes labels automatically (pod, namespace, node, app)
  4. Direct Export - Sends metrics to Last9 Prometheus endpoint

Supported Annotations

Annotation Required Default Description
prometheus.io/scrape Yes - Set to "true" to enable scraping
prometheus.io/port Yes - Port number exposing /metrics
prometheus.io/path No /metrics HTTP path for metrics endpoint

Scaling

This setup scales automatically:

  • 1 service β†’ Automatically scraped
  • 1000 services β†’ Automatically scraped
  • No configuration changes needed when adding new services

Configuration Files

Base Configuration: last9-otel-collector-values.yaml

  • Traces and logs collection
  • Basic OTLP receiver
  • No metrics scraping

App Metrics Configuration: last9-otel-collector-metrics-values.yaml

  • Prometheus receiver with kubernetes_sd_configs for auto-discovery
  • prometheusremotewrite exporter for sending to Last9
  • RBAC for Kubernetes API access
  • Increased resource limits for collector pods
  • BasicAuth extension for Last9 metrics endpoint

GPU + App Metrics Configuration: last9-otel-collector-gpu-values.yaml

  • Everything in the app metrics configuration, plus:
  • DCGM GPU metrics scraping (NVIDIA GPU Operator or GKE-managed DCGM β€” see variants in values file)
  • Ray head/worker metrics scraping (KubeRay Operator)
  • Cardinality control via metric keep-list for DCGM

Choose ONE metrics overlay:

  • App metrics only: --values last9-otel-collector-values.yaml --values last9-otel-collector-metrics-values.yaml
  • App + GPU metrics: --values last9-otel-collector-values.yaml --values last9-otel-collector-gpu-values.yaml

Verification

Check if metrics are being scraped:

# Check collector logs for scraping
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep kubernetes-pods

# Port-forward to collector metrics endpoint
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888

# Check scrape status
curl http://localhost:8888/metrics | grep scrape_samples_scraped

GPU Metrics (DCGM) & Ray Metrics Scraping

GPU and Ray metrics collection is opt-in. Use last9-otel-collector-gpu-values.yaml instead of the base metrics file to enable these scrape jobs. They use label-based discovery β€” no annotation changes needed on DCGM or Ray pods.

Enable GPU Metrics

helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-gpu-values.yaml

Note: Use last9-otel-collector-gpu-values.yaml instead of last9-otel-collector-metrics-values.yaml β€” the GPU file already includes all application metrics scrape jobs.

Prerequisites

  • DCGM metrics (pick one):
    • Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with app.kubernetes.io/name=nvidia-dcgm-exporter label)
    • GKE: GPU node pools enabled β€” GKE auto-deploys DCGM exporter in gke-managed-system namespace (label: app.kubernetes.io/name=gke-managed-dcgm-exporter)
  • Ray metrics: KubeRay Operator installed (Ray pods carry ray.io/node-type and ray.io/cluster labels)

Important: The values file ships with two DCGM variants (A and B). Variant A (self-managed NVIDIA GPU Operator) is enabled by default. If you're on GKE, comment out Variant A and uncomment Variant B. See the inline comments in last9-otel-collector-gpu-values.yaml for details.

Scrape Jobs

Job Target Label Selector Namespace Port Interval
dcgm-gpu-metrics (Variant A) DCGM Exporter pods app.kubernetes.io/name=nvidia-dcgm-exporter All (auto-discovered) 9400 15s
dcgm-gpu-metrics (Variant B) GKE DCGM Exporter pods app.kubernetes.io/name=gke-managed-dcgm-exporter gke-managed-system 9400 15s
ray-head Ray head nodes ray.io/node-type=head All 8080 30s
ray-workers Ray worker nodes ray.io/node-type=worker All 8080 30s

DCGM Metrics Collected

The DCGM job includes a cardinality keep-list limiting collection to 18 key metrics:

  • Utilization: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_ENC_UTIL, DCGM_FI_DEV_DEC_UTIL
  • Memory: DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_TOTAL
  • Temperature & Power: DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEMORY_TEMP, DCGM_FI_DEV_POWER_USAGE
  • Errors: DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
  • PCIe: DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT
  • Clock & Performance: DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEM_CLOCK, DCGM_FI_DEV_PSTATE

To add more DCGM metrics, extend the metric_relabel_configs regex in last9-otel-collector-gpu-values.yaml.

GPU Node Tolerations

If your GPU nodes have nvidia.com/gpu taints, the OTel Collector and node exporter need tolerations to schedule on those nodes:

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  monitoring-endpoint="..." \
  username="..." \
  password="..." \
  tolerations-file=examples/tolerations-gpu-nodes.yaml

See examples/tolerations-gpu-nodes.yaml for the toleration configuration.

Resource Scaling for Large GPU Fleets

For clusters with many GPU nodes, the collector handles additional scrape targets. Suggested resource scaling:

GPU Nodes CPU Request/Limit Memory Request/Limit
1-10 250m / 500m 512Mi / 1Gi
10-50 500m / 1000m 1Gi / 2Gi
50-100 1000m / 2000m 2Gi / 4Gi

Override resources in your Helm values or pass a custom values file.

Detect Your DCGM Variant

Before deploying, check which DCGM exporter is running in your cluster to pick the right variant:

# Check for GKE-managed DCGM exporter (Variant B)
kubectl get pods -n gke-managed-system -l app.kubernetes.io/name=gke-managed-dcgm-exporter

# Check for self-managed NVIDIA GPU Operator DCGM exporter (Variant A)
kubectl get pods -A -l app.kubernetes.io/name=nvidia-dcgm-exporter
  • If the first command returns pods β†’ use Variant B (uncomment it, comment out Variant A)
  • If the second command returns pods β†’ use Variant A (the default, no changes needed)
  • If neither returns pods β†’ your DCGM exporter is not yet installed (see Prerequisites)

GPU & Ray Verification

# Verify DCGM exporter pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep dcgm-gpu-metrics

# Verify Ray head/worker pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-head
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-workers

# Check DCGM metrics are being scraped
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -c "DCGM_FI_DEV"

# Check Ray metrics are being scraped
curl -s http://localhost:8888/metrics | grep scrape_samples_scraped | grep ray

Releases

No releases published

Packages

 
 
 

Contributors