Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/observability/core_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ This file contains a list of metrics exported by NVIDIA Infra Controller (NICo).
<tr><td>carbide_site_explorer_created_power_shelves_count</td><td>gauge</td><td>The amount of Power Shelves that had been created by Site Explorer after being identified</td></tr>
<tr><td>carbide_site_explorer_enabled</td><td>gauge</td><td>Whether site-explorer is enabled (1) or paused (0)</td></tr>
<tr><td>carbide_site_explorer_iteration_latency_milliseconds</td><td>histogram</td><td>The time it took to perform one site explorer iteration</td></tr>
<tr><td>carbide_site_explorer_last_run_status</td><td>gauge</td><td>The status of the latest Site Explorer run</td></tr>
<tr><td>carbide_site_explorer_phase_latency_milliseconds</td><td>histogram</td><td>The time it took to perform one site explorer iteration phase</td></tr>
<tr><td>carbide_site_explorer_update_explored_endpoints_count</td><td>gauge</td><td>Counts from the last update_explored_endpoints phase by kind</td></tr>
<tr><td>carbide_switches_enqueuer_iteration_latency_milliseconds</td><td>histogram</td><td>The overall time it took to enqueue state handling tasks for all carbide_switches in the system</td></tr>
Expand Down
15 changes: 15 additions & 0 deletions helm/PREREQUISITES.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,21 @@ If you want Prometheus metrics collection, install the [Prometheus Operator](htt
- **nico-hardware-health** also exposes an optional `telemetryServiceMonitor` (disabled by default) that scrapes `/telemetry` for per-machine sensor gauge data (temperature, power, fans, etc.) from the Prometheus sink. Use `serviceMonitor` for `/metrics` operational metrics only.
- NICo functions normally without the Prometheus Operator installed.

### Grafana Dashboard Sidecar (Optional)

The umbrella chart can install its packaged Grafana dashboards as a ConfigMap
when `grafanaDashboards.enabled=true`. This requires an existing Grafana
installation with a dashboard sidecar watching the ConfigMap namespace and
labels. The default `grafana_dashboard: "1"` label matches the common
`kube-prometheus-stack` selector. To place dashboards in the configured NICo
folder, the sidecar must read the `grafana_folder` annotation (or both sides
must be configured with another annotation key).

Grafana is not installed by the NICo chart. If `grafanaDashboards.namespace`
targets a namespace other than the NICo release namespace, create that
namespace first and configure the sidecar to search it. See
[`README.md`](./README.md#grafana-dashboards) for values and namespace examples.

---

## 2. PostgreSQL Database
Expand Down
43 changes: 43 additions & 0 deletions helm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,49 @@ Top-level `global:` values are automatically passed to all subcharts.
| `global.spiffe.trustDomain` | SPIFFE trust domain for mTLS | `nico.local` |
| `global.labels` | Common labels applied to all resources | See `values.yaml` |

### Grafana Dashboards

The chart packages three dashboards built from NICo's exported Prometheus
metrics: a site overview, object lifecycle diagnostics, and API performance.
They are disabled by default because this chart does not install Grafana.
The source JSON files live in [`dashboards/`](./dashboards/) and can also be
imported into Grafana directly.

To expose the dashboards to a Grafana dashboard sidecar in the release
namespace:

```yaml
grafanaDashboards:
enabled: true
```

The default `grafana_dashboard: "1"` label matches the dashboard-sidecar
selector used by `kube-prometheus-stack`. The chart also adds the conventional
`grafana_folder: NICo` annotation; configure the Grafana sidecar's
`folderAnnotation` setting if it does not already read that key. If Grafana
watches a different namespace or selector, configure them explicitly:

```yaml
grafanaDashboards:
enabled: true
namespace: monitoring
folder: Infrastructure/NICo
folderAnnotation: grafana_folder
labels:
grafana_dashboard: "1"
annotations: {}
```

The target namespace must exist before Helm runs, and the Helm identity must be
allowed to create ConfigMaps there. The Grafana sidecar must also watch that
namespace; for `kube-prometheus-stack`, configure
`grafana.sidecar.dashboards.searchNamespace` accordingly.

Each dashboard provides a Prometheus data-source selector, a NICo scrape-job
selector, and an editable metric-prefix variable. The prefix defaults to
`carbide`, which is the prefix currently emitted by NICo. Set it to `nico` (or
another configured value) when using the `alt_metric_prefix` site setting.

### Subchart Enable/Disable Flags

Each subchart can be independently enabled or disabled. All core NICo services are enabled by default. Infrastructure services (`unbound`) that may already be provided by the environment are disabled by default.
Expand Down
Loading
Loading