Automated setup script for deploying OpenTelemetry Operator, Collector, Kubernetes monitoring, and Events collection to your Kubernetes cluster with Last9 integration.
- β One-command installation - Deploy everything with a single command
- β Flexible deployment options - Install only what you need (logs, traces, metrics, events)
- β Auto-instrumentation - Automatic instrumentation for Java, Python, Node.js, and more
- β Kubernetes monitoring - Full cluster observability with kube-prometheus-stack
- β Events collection - Capture and forward Kubernetes events
- β Cluster identification - Automatic cluster name detection and attribution
- β Tolerations support - Deploy on tainted nodes (control-plane, spot instances, etc.)
- β Environment customization - Override deployment environment and cluster name
kubectlconfigured to access your Kubernetes clusterhelm(v3+) installed
Installs OpenTelemetry Operator, Collector, Kubernetes monitoring stack, and Events agent:
./last9-otel-setup.sh \
token="Basic <your-base64-token>" \
endpoint="<your-otlp-endpoint>" \
monitoring-endpoint="<your-metrics-endpoint>" \
username="<your-username>" \
password="<your-password>"curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh | bash -s -- \
token="Basic <your-token>" \
endpoint="<your-otlp-endpoint>" \
monitoring-endpoint="<your-metrics-endpoint>" \
username="<user>" \
password="<pass>"For applications that need distributed tracing:
./last9-otel-setup.sh operator-only \
token="Basic <your-token>" \
endpoint="<your-otlp-endpoint>"For log collection use cases:
./last9-otel-setup.sh logs-only \
token="Basic <your-token>" \
endpoint="<your-otlp-endpoint>"For cluster metrics and monitoring. This is a fully standalone path β no prior install steps needed.
Step 1 β Download the script (skip if you already have it):
curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh \
-o last9-otel-setup.sh && chmod +x last9-otel-setup.shStep 2 β Run monitoring-only install:
./last9-otel-setup.sh monitoring-only \
monitoring-endpoint="<your-metrics-endpoint>" \
username="<your-username>" \
password="<your-password>"What gets installed (in the last9 namespace):
| Component | Purpose |
|---|---|
| kube-prometheus-stack | Prometheus Operator + AlertManager |
| PrometheusAgent | Scrapes cluster metrics, remote-writes to Last9 |
| kube-state-metrics | Kubernetes object state metrics |
| node-exporter | Per-node CPU/memory/disk metrics |
Verify the install:
kubectl get pods -n last9
kubectl get prometheusagent -n last9
kubectl get secrets -n last9 last9-remote-write-secretPre-existing Prometheus CRDs (Terraform / prior Helm installs)
If your cluster already has
monitoring.coreos.comCRDs managed by Terraform or another Helm release, the script detects this automatically and handles the conflict by upgrading CRD schemas to the required version and skipping conflicting CRD installation steps. No manual intervention needed.
For Kubernetes events collection:
./last9-otel-setup.sh events-only \
endpoint="<your-otlp-endpoint>" \
token="Basic <your-base64-token>" \
monitoring-endpoint="<your-metrics-endpoint>"On shared machines with multiple clusters, pass context= to pin all operations to a specific kubectl context. Without it, the current active context is used β which could change mid-install on shared instances.
./last9-otel-setup.sh monitoring-only \
context="prod-us-east-1" \
monitoring-endpoint="..." \
username="..." \
password="..."Works with all install modes (monitoring-only, logs-only, operator-only, events-only, full install, uninstall). All kubectl and helm calls use the specified context.
List available contexts:
kubectl config get-contexts./last9-otel-setup.sh \
token="..." \
endpoint="..." \
cluster="prod-us-east-1"If not provided, the cluster name is auto-detected from kubectl config current-context (or from context= if that was passed).
./last9-otel-setup.sh \
token="..." \
endpoint="..." \
env="production"Default: staging for collector, local for auto-instrumentation.
For deploying on nodes with taints (e.g., control-plane, monitoring nodes):
./last9-otel-setup.sh \
token="..." \
endpoint="..." \
tolerations-file=/path/to/tolerations.yamlExample tolerations files are provided in the examples/ directory:
tolerations-all-nodes.yaml- Deploy on all nodes including control-planetolerations-monitoring-nodes.yaml- Deploy on dedicated monitoring nodestolerations-spot-instances.yaml- Deploy on spot/preemptible instancestolerations-multi-taint.yaml- Handle multiple taintstolerations-nodeSelector-only.yaml- Use nodeSelector without tolerationstolerations-gpu-nodes.yaml- Deploy on GPU nodes withnvidia.com/gputaints
Use server-name= when your OTLP / metrics endpoint terminates TLS behind a proxy or load balancer whose certificate name (CN/SAN) differs from the host in the connection URL β for example, the endpoint is a private NLB last9nlb.example.com:443 but the certificate is issued for ingest-internal.<region>.last9.cloud. Without it, TLS verification fails with a hostname mismatch.
./last9-otel-setup.sh \
token="..." \
endpoint="last9nlb.example.com:443" \
monitoring-endpoint="https://last9nlb.example.com/v1/metrics/<cluster-id>/sender/last9/write" \
server-name="ingest-internal.example.last9.cloud"When set, it injects (only into the files being installed):
tls.server_name_override(withinsecure: false) under theotlp/last9exporter in the collector and events-agent configs.tlsConfig.serverNameunder the PrometheusremoteWriteentry in the monitoring config.
When omitted (the default), no TLS block is added and the values files are unchanged β fully backward compatible.
Note: injection is idempotent β re-running with the same
server-name=is a no-op. To change an already-injected SNI, edit the TLS block in the values file (or remove it) before re-running; the script will not overwrite an existing override.
If a release is already deployed, extract its live values, add the TLS block, and upgrade. Use --reuse-values=false because helm get values already returns the full merged value set:
# Collector
helm get values last9-opentelemetry-collector -n last9 -o yaml > collector-values.yaml
# Under config.exporters.otlp/last9 add:
# tls:
# insecure: false
# server_name_override: ingest-internal.example.last9.cloud
helm upgrade last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
-n last9 -f collector-values.yaml --reuse-values=false
# Events agent (same tls block under config.exporters.otlp/last9)
helm get values last9-kube-events-agent -n last9 -o yaml > events-values.yaml
helm upgrade last9-kube-events-agent open-telemetry/opentelemetry-collector \
-n last9 -f events-values.yaml --reuse-values=false
# Monitoring (Prometheus): under prometheus.prometheusSpec.remoteWrite[0] add:
# tlsConfig:
# serverName: ingest-internal.example.last9.cloud
helm get values last9-k8s-monitoring -n last9 -o yaml > monitoring-values.yaml
helm upgrade last9-k8s-monitoring prometheus-community/kube-prometheus-stack \
-n last9 -f monitoring-values.yaml --reuse-values=false| File | Description |
|---|---|
last9-otel-collector-values.yaml |
OpenTelemetry Collector configuration for logs and traces |
last9-otel-collector-metrics-values.yaml |
Optional: Application metrics scraping (Prometheus SD) |
last9-otel-collector-gpu-values.yaml |
Optional: GPU (DCGM) + Ray metrics scraping (includes app metrics) |
k8s-monitoring-values.yaml |
Kube-prometheus-stack configuration for metrics |
last9-kube-events-agent-values.yaml |
Events collection agent configuration |
collector-svc.yaml |
Collector service for application instrumentation |
instrumentation.yaml |
Auto-instrumentation configuration |
deploy.yaml |
Sample application deployment with auto-instrumentation |
tolerations.yaml |
Sample tolerations configuration |
The following placeholders are automatically replaced during installation:
{{AUTH_TOKEN}}- Your Last9 authorization token{{OTEL_ENDPOINT}}- Your OTEL endpoint URL{{MONITORING_ENDPOINT}}- Your metrics endpoint URL
./last9-otel-setup.sh uninstall-all# Uninstall only monitoring stack
./last9-otel-setup.sh uninstall function="uninstall_last9_monitoring"
# Uninstall only events agent
./last9-otel-setup.sh uninstall function="uninstall_events_agent"
# Uninstall OpenTelemetry components (operator + collector)
./last9-otel-setup.sh uninstallAfter installation, verify the deployment:
# Check all pods in last9 namespace
kubectl get pods -n last9
# Check collector logs
kubectl logs -n last9 -l app.kubernetes.io/name=opentelemetry-collector
# Check monitoring stack
kubectl get prometheus -n last9
# Check events agent
kubectl get pods -n last9 -l app.kubernetes.io/name=last9-kube-events-agentThe script automatically sets up instrumentation for:
- β Java - Automatic OTLP export
- π Python - Automatic OTLP export
- π’ Node.js - Automatic OTLP export
- π΅ Go - Manual instrumentation supported
- π Ruby - Coming soon
The OpenTelemetry Collector can automatically discover and scrape application metrics using Kubernetes service discovery with Prometheus-compatible scraping.
Note: This is an optional feature. Use last9-otel-collector-metrics-values.yaml to enable metrics scraping.
To enable application metrics scraping, deploy with the additional metrics configuration file:
# Deploy with metrics scraping enabled
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
--namespace last9 \
--version 0.125.0 \
--values last9-otel-collector-values.yaml \
--values last9-otel-collector-metrics-values.yamlConfigure Last9 Metrics Endpoint:
Before deploying, update these placeholders in last9-otel-collector-metrics-values.yaml:
{{LAST9_METRICS_ENDPOINT}}- Your Last9 Prometheus remote write URL{{LAST9_METRICS_USERNAME}}- Your Last9 metrics username{{LAST9_METRICS_PASSWORD}}- Your Last9 metrics password
Add these annotations to your pod template or service to enable automatic metrics scraping:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics" # Optional, defaults to /metricsThat's it! Your application metrics will be automatically:
- Discovered - No manual configuration needed
- Scraped - Every 30 seconds by default
- Enriched - With pod, namespace, node labels
- Exported - To Last9 via Prometheus remote write
- Automatic Discovery - OTel Collector watches Kubernetes API for all pods/services
- Annotation-Based Filtering - Only scrapes resources with
prometheus.io/scrape: "true" - Metadata Enrichment - Adds Kubernetes labels automatically (pod, namespace, node, app)
- Direct Export - Sends metrics to Last9 Prometheus endpoint
| Annotation | Required | Default | Description |
|---|---|---|---|
prometheus.io/scrape |
Yes | - | Set to "true" to enable scraping |
prometheus.io/port |
Yes | - | Port number exposing /metrics |
prometheus.io/path |
No | /metrics | HTTP path for metrics endpoint |
This setup scales automatically:
- 1 service β Automatically scraped
- 1000 services β Automatically scraped
- No configuration changes needed when adding new services
Base Configuration: last9-otel-collector-values.yaml
- Traces and logs collection
- Basic OTLP receiver
- No metrics scraping
App Metrics Configuration: last9-otel-collector-metrics-values.yaml
- Prometheus receiver with kubernetes_sd_configs for auto-discovery
- prometheusremotewrite exporter for sending to Last9
- RBAC for Kubernetes API access
- Increased resource limits for collector pods
- BasicAuth extension for Last9 metrics endpoint
GPU + App Metrics Configuration: last9-otel-collector-gpu-values.yaml
- Everything in the app metrics configuration, plus:
- DCGM GPU metrics scraping (NVIDIA GPU Operator or GKE-managed DCGM β see variants in values file)
- Ray head/worker metrics scraping (KubeRay Operator)
- Cardinality control via metric keep-list for DCGM
Choose ONE metrics overlay:
- App metrics only:
--values last9-otel-collector-values.yaml --values last9-otel-collector-metrics-values.yaml - App + GPU metrics:
--values last9-otel-collector-values.yaml --values last9-otel-collector-gpu-values.yaml
Check if metrics are being scraped:
# Check collector logs for scraping
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep kubernetes-pods
# Port-forward to collector metrics endpoint
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
# Check scrape status
curl http://localhost:8888/metrics | grep scrape_samples_scrapedGPU and Ray metrics collection is opt-in. Use last9-otel-collector-gpu-values.yaml instead of the base metrics file to enable these scrape jobs. They use label-based discovery β no annotation changes needed on DCGM or Ray pods.
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
--namespace last9 \
--version 0.125.0 \
--values last9-otel-collector-values.yaml \
--values last9-otel-collector-gpu-values.yamlNote: Use
last9-otel-collector-gpu-values.yamlinstead oflast9-otel-collector-metrics-values.yamlβ the GPU file already includes all application metrics scrape jobs.
- DCGM metrics (pick one):
- Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with
app.kubernetes.io/name=nvidia-dcgm-exporterlabel) - GKE: GPU node pools enabled β GKE auto-deploys DCGM exporter in
gke-managed-systemnamespace (label:app.kubernetes.io/name=gke-managed-dcgm-exporter)
- Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with
- Ray metrics: KubeRay Operator installed (Ray pods carry
ray.io/node-typeandray.io/clusterlabels)
Important: The values file ships with two DCGM variants (A and B). Variant A (self-managed NVIDIA GPU Operator) is enabled by default. If you're on GKE, comment out Variant A and uncomment Variant B. See the inline comments in
last9-otel-collector-gpu-values.yamlfor details.
| Job | Target | Label Selector | Namespace | Port | Interval |
|---|---|---|---|---|---|
dcgm-gpu-metrics (Variant A) |
DCGM Exporter pods | app.kubernetes.io/name=nvidia-dcgm-exporter |
All (auto-discovered) | 9400 | 15s |
dcgm-gpu-metrics (Variant B) |
GKE DCGM Exporter pods | app.kubernetes.io/name=gke-managed-dcgm-exporter |
gke-managed-system |
9400 | 15s |
ray-head |
Ray head nodes | ray.io/node-type=head |
All | 8080 | 30s |
ray-workers |
Ray worker nodes | ray.io/node-type=worker |
All | 8080 | 30s |
The DCGM job includes a cardinality keep-list limiting collection to 18 key metrics:
- Utilization:
DCGM_FI_DEV_GPU_UTIL,DCGM_FI_DEV_MEM_COPY_UTIL,DCGM_FI_DEV_ENC_UTIL,DCGM_FI_DEV_DEC_UTIL - Memory:
DCGM_FI_DEV_FB_FREE,DCGM_FI_DEV_FB_USED,DCGM_FI_DEV_FB_TOTAL - Temperature & Power:
DCGM_FI_DEV_GPU_TEMP,DCGM_FI_DEV_MEMORY_TEMP,DCGM_FI_DEV_POWER_USAGE - Errors:
DCGM_FI_DEV_XID_ERRORS,DCGM_FI_DEV_ECC_SBE_VOL_TOTAL,DCGM_FI_DEV_ECC_DBE_VOL_TOTAL - PCIe:
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,DCGM_FI_DEV_PCIE_RX_THROUGHPUT - Clock & Performance:
DCGM_FI_DEV_SM_CLOCK,DCGM_FI_DEV_MEM_CLOCK,DCGM_FI_DEV_PSTATE
To add more DCGM metrics, extend the metric_relabel_configs regex in last9-otel-collector-gpu-values.yaml.
If your GPU nodes have nvidia.com/gpu taints, the OTel Collector and node exporter need tolerations to schedule on those nodes:
./last9-otel-setup.sh \
token="..." \
endpoint="..." \
monitoring-endpoint="..." \
username="..." \
password="..." \
tolerations-file=examples/tolerations-gpu-nodes.yamlSee examples/tolerations-gpu-nodes.yaml for the toleration configuration.
For clusters with many GPU nodes, the collector handles additional scrape targets. Suggested resource scaling:
| GPU Nodes | CPU Request/Limit | Memory Request/Limit |
|---|---|---|
| 1-10 | 250m / 500m | 512Mi / 1Gi |
| 10-50 | 500m / 1000m | 1Gi / 2Gi |
| 50-100 | 1000m / 2000m | 2Gi / 4Gi |
Override resources in your Helm values or pass a custom values file.
Before deploying, check which DCGM exporter is running in your cluster to pick the right variant:
# Check for GKE-managed DCGM exporter (Variant B)
kubectl get pods -n gke-managed-system -l app.kubernetes.io/name=gke-managed-dcgm-exporter
# Check for self-managed NVIDIA GPU Operator DCGM exporter (Variant A)
kubectl get pods -A -l app.kubernetes.io/name=nvidia-dcgm-exporter- If the first command returns pods β use Variant B (uncomment it, comment out Variant A)
- If the second command returns pods β use Variant A (the default, no changes needed)
- If neither returns pods β your DCGM exporter is not yet installed (see Prerequisites)
# Verify DCGM exporter pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep dcgm-gpu-metrics
# Verify Ray head/worker pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-head
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-workers
# Check DCGM metrics are being scraped
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -c "DCGM_FI_DEV"
# Check Ray metrics are being scraped
curl -s http://localhost:8888/metrics | grep scrape_samples_scraped | grep ray