Last9 OpenTelemetry Operator Setup

Automated setup script for deploying OpenTelemetry Operator, Collector, Kubernetes monitoring, and Events collection to your Kubernetes cluster with Last9 integration.

Features

✅ One-command installation - Deploy everything with a single command
✅ Flexible deployment options - Install only what you need (logs, traces, metrics, events)
✅ Auto-instrumentation - Automatic instrumentation for Java, Python, Node.js, and more
✅ Kubernetes monitoring - Full cluster observability with kube-prometheus-stack
✅ Events collection - Capture and forward Kubernetes events
✅ Cluster identification - Automatic cluster name detection and attribution
✅ Tolerations support - Deploy on tainted nodes (control-plane, spot instances, etc.)
✅ Environment customization - Override deployment environment and cluster name

Quick Start

Prerequisites

kubectl configured to access your Kubernetes cluster
helm (v3+) installed

Option 1: Install Everything (Recommended)

Installs OpenTelemetry Operator, Collector, Kubernetes monitoring stack, and Events agent:

./last9-otel-setup.sh \
  token="Basic <your-base64-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

Quick Install (One-liner)

curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh | bash -s -- \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<user>" \
  password="<pass>"

Installation Options

Option 2: Traces Only (Operator + Collector)

For applications that need distributed tracing:

./last9-otel-setup.sh operator-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 3: Logs Only (Collector without Operator)

For log collection use cases:

./last9-otel-setup.sh logs-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 4: Metrics Only (Kubernetes Monitoring)

For cluster metrics and monitoring. This is a fully standalone path — no prior install steps needed.

Step 1 — Download the script (skip if you already have it):

curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh \
  -o last9-otel-setup.sh && chmod +x last9-otel-setup.sh

Step 2 — Run monitoring-only install:

./last9-otel-setup.sh monitoring-only \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

What gets installed (in the last9 namespace):

Component	Purpose
kube-prometheus-stack	Prometheus Operator + AlertManager
PrometheusAgent	Scrapes cluster metrics, remote-writes to Last9
kube-state-metrics	Kubernetes object state metrics
node-exporter	Per-node CPU/memory/disk metrics

Verify the install:

kubectl get pods -n last9
kubectl get prometheusagent -n last9
kubectl get secrets -n last9 last9-remote-write-secret

Pre-existing Prometheus CRDs (Terraform / prior Helm installs)

If your cluster already has monitoring.coreos.com CRDs managed by Terraform or another Helm release, the script detects this automatically and handles the conflict by upgrading CRD schemas to the required version and skipping conflicting CRD installation steps. No manual intervention needed.

Option 5: Kubernetes Events Only

For Kubernetes events collection:

./last9-otel-setup.sh events-only \
  endpoint="<your-otlp-endpoint>" \
  token="Basic <your-base64-token>" \
  monitoring-endpoint="<your-metrics-endpoint>"

Advanced Configuration

Specify kubectl Context (Multi-Cluster)

On shared machines with multiple clusters, pass context= to pin all operations to a specific kubectl context. Without it, the current active context is used — which could change mid-install on shared instances.

./last9-otel-setup.sh monitoring-only \
  context="prod-us-east-1" \
  monitoring-endpoint="..." \
  username="..." \
  password="..."

Works with all install modes (monitoring-only, logs-only, operator-only, events-only, full install, uninstall). All kubectl and helm calls use the specified context.

List available contexts:

kubectl config get-contexts

Override Cluster Name

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  cluster="prod-us-east-1"

If not provided, the cluster name is auto-detected from kubectl config current-context (or from context= if that was passed).

Set Deployment Environment

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  env="production"

Default: staging for collector, local for auto-instrumentation.

Deploy with Tolerations

For deploying on nodes with taints (e.g., control-plane, monitoring nodes):

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  tolerations-file=/path/to/tolerations.yaml

Example tolerations files are provided in the examples/ directory:

tolerations-all-nodes.yaml - Deploy on all nodes including control-plane
tolerations-monitoring-nodes.yaml - Deploy on dedicated monitoring nodes
tolerations-spot-instances.yaml - Deploy on spot/preemptible instances
tolerations-multi-taint.yaml - Handle multiple taints
tolerations-nodeSelector-only.yaml - Use nodeSelector without tolerations
tolerations-gpu-nodes.yaml - Deploy on GPU nodes with nvidia.com/gpu taints

TLS Server Name Override (SNI)

Use server-name= when your OTLP / metrics endpoint terminates TLS behind a proxy or load balancer whose certificate name (CN/SAN) differs from the host in the connection URL — for example, the endpoint is a private NLB last9nlb.example.com:443 but the certificate is issued for ingest-internal.<region>.last9.cloud. Without it, TLS verification fails with a hostname mismatch.

./last9-otel-setup.sh \
  token="..." \
  endpoint="last9nlb.example.com:443" \
  monitoring-endpoint="https://last9nlb.example.com/v1/metrics/<cluster-id>/sender/last9/write" \
  server-name="ingest-internal.example.last9.cloud"

When set, it injects (only into the files being installed):

tls.server_name_override (with insecure: false) under the otlp/last9 exporter in the collector and events-agent configs.
tlsConfig.serverName under the Prometheus remoteWrite entry in the monitoring config.

When omitted (the default), no TLS block is added and the values files are unchanged — fully backward compatible.

Note: injection is idempotent — re-running with the same server-name= is a no-op. To change an already-injected SNI, edit the TLS block in the values file (or remove it) before re-running; the script will not overwrite an existing override.

Applying SNI to an already-installed cluster

If a release is already deployed, extract its live values, add the TLS block, and upgrade. Use --reuse-values=false because helm get values already returns the full merged value set:

# Collector
helm get values last9-opentelemetry-collector -n last9 -o yaml > collector-values.yaml
# Under config.exporters.otlp/last9 add:
#   tls:
#     insecure: false
#     server_name_override: ingest-internal.example.last9.cloud
helm upgrade last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  -n last9 -f collector-values.yaml --reuse-values=false

# Events agent (same tls block under config.exporters.otlp/last9)
helm get values last9-kube-events-agent -n last9 -o yaml > events-values.yaml
helm upgrade last9-kube-events-agent open-telemetry/opentelemetry-collector \
  -n last9 -f events-values.yaml --reuse-values=false

# Monitoring (Prometheus): under prometheus.prometheusSpec.remoteWrite[0] add:
#   tlsConfig:
#     serverName: ingest-internal.example.last9.cloud
helm get values last9-k8s-monitoring -n last9 -o yaml > monitoring-values.yaml
helm upgrade last9-k8s-monitoring prometheus-community/kube-prometheus-stack \
  -n last9 -f monitoring-values.yaml --reuse-values=false

Configuration Files

File	Description
`last9-otel-collector-values.yaml`	OpenTelemetry Collector configuration for logs and traces
`last9-otel-collector-metrics-values.yaml`	Optional: Application metrics scraping (Prometheus SD)
`last9-otel-collector-gpu-values.yaml`	Optional: GPU (DCGM) + Ray metrics scraping (includes app metrics)
`k8s-monitoring-values.yaml`	Kube-prometheus-stack configuration for metrics
`last9-kube-events-agent-values.yaml`	Events collection agent configuration
`collector-svc.yaml`	Collector service for application instrumentation
`instrumentation.yaml`	Auto-instrumentation configuration
`deploy.yaml`	Sample application deployment with auto-instrumentation
`tolerations.yaml`	Sample tolerations configuration

Placeholders

The following placeholders are automatically replaced during installation:

{{AUTH_TOKEN}} - Your Last9 authorization token
{{OTEL_ENDPOINT}} - Your OTEL endpoint URL
{{MONITORING_ENDPOINT}} - Your metrics endpoint URL

Uninstallation

Uninstall Everything

./last9-otel-setup.sh uninstall-all

Uninstall Specific Components

# Uninstall only monitoring stack
./last9-otel-setup.sh uninstall function="uninstall_last9_monitoring"

# Uninstall only events agent
./last9-otel-setup.sh uninstall function="uninstall_events_agent"

# Uninstall OpenTelemetry components (operator + collector)
./last9-otel-setup.sh uninstall

Verification

After installation, verify the deployment:

# Check all pods in last9 namespace
kubectl get pods -n last9

# Check collector logs
kubectl logs -n last9 -l app.kubernetes.io/name=opentelemetry-collector

# Check monitoring stack
kubectl get prometheus -n last9

# Check events agent
kubectl get pods -n last9 -l app.kubernetes.io/name=last9-kube-events-agent

Auto-Instrumentation

The script automatically sets up instrumentation for:

☕ Java - Automatic OTLP export
🐍 Python - Automatic OTLP export
🟢 Node.js - Automatic OTLP export
🔵 Go - Manual instrumentation supported
💎 Ruby - Coming soon

Application Metrics Scraping (Optional)

The OpenTelemetry Collector can automatically discover and scrape application metrics using Kubernetes service discovery with Prometheus-compatible scraping.

Note: This is an optional feature. Use last9-otel-collector-metrics-values.yaml to enable metrics scraping.

Enable Metrics Scraping

To enable application metrics scraping, deploy with the additional metrics configuration file:

# Deploy with metrics scraping enabled
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-metrics-values.yaml

Configure Last9 Metrics Endpoint:

Before deploying, update these placeholders in last9-otel-collector-metrics-values.yaml:

{{LAST9_METRICS_ENDPOINT}} - Your Last9 Prometheus remote write URL
{{LAST9_METRICS_USERNAME}} - Your Last9 metrics username
{{LAST9_METRICS_PASSWORD}} - Your Last9 metrics password

Quick Start

Add these annotations to your pod template or service to enable automatic metrics scraping:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"  # Optional, defaults to /metrics

That's it! Your application metrics will be automatically:

Discovered - No manual configuration needed
Scraped - Every 30 seconds by default
Enriched - With pod, namespace, node labels
Exported - To Last9 via Prometheus remote write

How It Works

Automatic Discovery - OTel Collector watches Kubernetes API for all pods/services
Annotation-Based Filtering - Only scrapes resources with prometheus.io/scrape: "true"
Metadata Enrichment - Adds Kubernetes labels automatically (pod, namespace, node, app)
Direct Export - Sends metrics to Last9 Prometheus endpoint

Supported Annotations

Annotation	Required	Default	Description
`prometheus.io/scrape`	Yes	-	Set to "true" to enable scraping
`prometheus.io/port`	Yes	-	Port number exposing /metrics
`prometheus.io/path`	No	/metrics	HTTP path for metrics endpoint

Scaling

This setup scales automatically:

1 service → Automatically scraped
1000 services → Automatically scraped
No configuration changes needed when adding new services

Configuration Files

Base Configuration: last9-otel-collector-values.yaml

Traces and logs collection
Basic OTLP receiver
No metrics scraping

App Metrics Configuration: last9-otel-collector-metrics-values.yaml

Prometheus receiver with kubernetes_sd_configs for auto-discovery
prometheusremotewrite exporter for sending to Last9
RBAC for Kubernetes API access
Increased resource limits for collector pods
BasicAuth extension for Last9 metrics endpoint

GPU + App Metrics Configuration: last9-otel-collector-gpu-values.yaml

Everything in the app metrics configuration, plus:
DCGM GPU metrics scraping (NVIDIA GPU Operator or GKE-managed DCGM — see variants in values file)
Ray head/worker metrics scraping (KubeRay Operator)
Cardinality control via metric keep-list for DCGM

Choose ONE metrics overlay:

App metrics only: --values last9-otel-collector-values.yaml --values last9-otel-collector-metrics-values.yaml
App + GPU metrics: --values last9-otel-collector-values.yaml --values last9-otel-collector-gpu-values.yaml

Verification

Check if metrics are being scraped:

# Check collector logs for scraping
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep kubernetes-pods

# Port-forward to collector metrics endpoint
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888

# Check scrape status
curl http://localhost:8888/metrics | grep scrape_samples_scraped

GPU Metrics (DCGM) & Ray Metrics Scraping

GPU and Ray metrics collection is opt-in. Use last9-otel-collector-gpu-values.yaml instead of the base metrics file to enable these scrape jobs. They use label-based discovery — no annotation changes needed on DCGM or Ray pods.

Enable GPU Metrics

helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-gpu-values.yaml

Note: Use last9-otel-collector-gpu-values.yaml instead of last9-otel-collector-metrics-values.yaml — the GPU file already includes all application metrics scrape jobs.

Prerequisites

DCGM metrics (pick one):
- Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with app.kubernetes.io/name=nvidia-dcgm-exporter label)
- GKE: GPU node pools enabled — GKE auto-deploys DCGM exporter in gke-managed-system namespace (label: app.kubernetes.io/name=gke-managed-dcgm-exporter)
Ray metrics: KubeRay Operator installed (Ray pods carry ray.io/node-type and ray.io/cluster labels)

Important: The values file ships with two DCGM variants (A and B). Variant A (self-managed NVIDIA GPU Operator) is enabled by default. If you're on GKE, comment out Variant A and uncomment Variant B. See the inline comments in last9-otel-collector-gpu-values.yaml for details.

Scrape Jobs

Job	Target	Label Selector	Namespace	Port	Interval
`dcgm-gpu-metrics` (Variant A)	DCGM Exporter pods	`app.kubernetes.io/name=nvidia-dcgm-exporter`	All (auto-discovered)	9400	15s
`dcgm-gpu-metrics` (Variant B)	GKE DCGM Exporter pods	`app.kubernetes.io/name=gke-managed-dcgm-exporter`	`gke-managed-system`	9400	15s
`ray-head`	Ray head nodes	`ray.io/node-type=head`	All	8080	30s
`ray-workers`	Ray worker nodes	`ray.io/node-type=worker`	All	8080	30s

DCGM Metrics Collected

The DCGM job includes a cardinality keep-list limiting collection to 18 key metrics:

Utilization: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_ENC_UTIL, DCGM_FI_DEV_DEC_UTIL
Memory: DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_TOTAL
Temperature & Power: DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEMORY_TEMP, DCGM_FI_DEV_POWER_USAGE
Errors: DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
PCIe: DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT
Clock & Performance: DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEM_CLOCK, DCGM_FI_DEV_PSTATE

To add more DCGM metrics, extend the metric_relabel_configs regex in last9-otel-collector-gpu-values.yaml.

GPU Node Tolerations

If your GPU nodes have nvidia.com/gpu taints, the OTel Collector and node exporter need tolerations to schedule on those nodes:

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  monitoring-endpoint="..." \
  username="..." \
  password="..." \
  tolerations-file=examples/tolerations-gpu-nodes.yaml

See examples/tolerations-gpu-nodes.yaml for the toleration configuration.

Resource Scaling for Large GPU Fleets

For clusters with many GPU nodes, the collector handles additional scrape targets. Suggested resource scaling:

GPU Nodes	CPU Request/Limit	Memory Request/Limit
1-10	250m / 500m	512Mi / 1Gi
10-50	500m / 1000m	1Gi / 2Gi
50-100	1000m / 2000m	2Gi / 4Gi

Override resources in your Helm values or pass a custom values file.

Detect Your DCGM Variant

Before deploying, check which DCGM exporter is running in your cluster to pick the right variant:

# Check for GKE-managed DCGM exporter (Variant B)
kubectl get pods -n gke-managed-system -l app.kubernetes.io/name=gke-managed-dcgm-exporter

# Check for self-managed NVIDIA GPU Operator DCGM exporter (Variant A)
kubectl get pods -A -l app.kubernetes.io/name=nvidia-dcgm-exporter

If the first command returns pods → use Variant B (uncomment it, comment out Variant A)
If the second command returns pods → use Variant A (the default, no changes needed)
If neither returns pods → your DCGM exporter is not yet installed (see Prerequisites)

GPU & Ray Verification

# Verify DCGM exporter pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep dcgm-gpu-metrics

# Verify Ray head/worker pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-head
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-workers

# Check DCGM metrics are being scraped
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -c "DCGM_FI_DEV"

# Check Ray metrics are being scraped
curl -s http://localhost:8888/metrics | grep scrape_samples_scraped | grep ray

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
GCP-Autopilot		GCP-Autopilot
examples		examples
java/k8s		java/k8s
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collector-svc.yaml		collector-svc.yaml
deploy.yaml		deploy.yaml
instrumentation.yaml		instrumentation.yaml
k8s-monitoring-values.yaml		k8s-monitoring-values.yaml
last9-kube-events-agent-values.yaml		last9-kube-events-agent-values.yaml
last9-otel-collector-gpu-gke-values.yaml		last9-otel-collector-gpu-gke-values.yaml
last9-otel-collector-gpu-values.yaml		last9-otel-collector-gpu-values.yaml
last9-otel-collector-metrics-values.yaml		last9-otel-collector-metrics-values.yaml
last9-otel-collector-values.yaml		last9-otel-collector-values.yaml
last9-otel-setup.sh		last9-otel-setup.sh
tolerations.yaml		tolerations.yaml

Folders and files

Latest commit

History

Repository files navigation

Last9 OpenTelemetry Operator Setup

Features

Quick Start

Prerequisites

Option 1: Install Everything (Recommended)

Quick Install (One-liner)

Installation Options

Option 2: Traces Only (Operator + Collector)

Option 3: Logs Only (Collector without Operator)

Option 4: Metrics Only (Kubernetes Monitoring)

Option 5: Kubernetes Events Only

Advanced Configuration

Specify kubectl Context (Multi-Cluster)

Override Cluster Name

Set Deployment Environment

Deploy with Tolerations

TLS Server Name Override (SNI)

Applying SNI to an already-installed cluster

Configuration Files

Placeholders

Uninstallation

Uninstall Everything

Uninstall Specific Components

Verification

Auto-Instrumentation

Application Metrics Scraping (Optional)

Enable Metrics Scraping

Quick Start

How It Works

Supported Annotations

Scaling

Configuration Files

Verification

GPU Metrics (DCGM) & Ray Metrics Scraping

Enable GPU Metrics

Prerequisites

Scrape Jobs

DCGM Metrics Collected

GPU Node Tolerations

Resource Scaling for Large GPU Fleets

Detect Your DCGM Variant

GPU & Ray Verification

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages