Skip to content

Commit ab4dddb

Browse files
feat: device-api-server with NVML provider, client SDK, and gRPC services
Implement a Kubernetes-style device API server with the following components: Core server (cmd/device-api-server): - gRPC-based controlplane API server using apiserver-runtime patterns - GPU service with full CRUD + Watch + UpdateStatus via storage.Interface - Server lifecycle management with graceful shutdown - Health and metrics endpoints with gRPC reflection Storage (pkg/storage): - In-memory storage.Interface implementation with watch support - Configurable watch channel buffer sizes and event drop metrics - Factory pattern for storage backend selection NVML provider (cmd/nvml-provider): - Sidecar that enumerates GPUs via NVML and registers them with the API - XID error event monitoring with health condition updates - Reconciliation loop with configurable intervals - Environment-based driver root configuration (NVIDIA_DRIVER_ROOT) Client SDK (pkg/client-go): - Typed gRPC client with Get/List/Watch/Create/Update/UpdateStatus/Delete - Fake client for testing with k8s.io/client-go/testing integration - Informers and listers following Kubernetes client-go conventions - Clientset pattern for versioned API access Code generator (code-generator): - Fork of k8s.io/code-generator/cmd/client-gen for gRPC backends - Generates typed clients, fake clients, and expansion interfaces - Full UpdateStatus template with proper gRPC implementation - Integrated into hack/update-codegen.sh pipeline Proto API (api/proto/device/v1alpha1): - GPU resource with spec (UUID) and status (conditions, recommendedAction) - Standard CRUD + Watch + UpdateStatus RPCs - K8s-style request/response patterns with options Security hardening: - gRPC message size and stream limits - Server error detail scrubbing for client responses - Unix socket path validation and restrictive permissions - Localhost-only enforcement for insecure credentials Deployment (deployments/helm): - Helm chart with configurable storage, gRPC, health, and metrics - Static manifests with versioned image references - Dockerfile with pinned base images Testing: - Unit tests for storage, services, providers, and utilities - Integration tests for client-go with full gRPC stack - Shared testutil with bufconn gRPC test helpers - Fake client examples for consumer testing patterns Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
1 parent 6bf52de commit ab4dddb

174 files changed

Lines changed: 28098 additions & 3202 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitattributes

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# ==============================================================================
2+
# GIT ATTRIBUTES
3+
# ==============================================================================
4+
# Use 'linguist-generated=true' to hide generated code from GitHub PR diffs.
5+
# ==============================================================================
6+
7+
# Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc)
8+
zz_generated.*.go linguist-generated=true
9+
10+
# Hide generated Protobuf Go bindings
11+
*.pb.go linguist-generated=true
12+
13+
# Hide generated client library
14+
pkg/client-go/** linguist-generated=true
15+
16+
# Hide generated gRPC services
17+
pkg/services/** linguist-generated=true
18+
19+
# Hide copied, unmodified upstream code
20+
code-generator/cmd/client-gen/types/** linguist-generated=true
21+
code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true
22+
code-generator/cmd/client-gen/generators/util/** linguist-generated=true

.github/ISSUE_TEMPLATE/bug_report.yml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,10 @@ body:
4141
attributes:
4242
label: Component
4343
options:
44-
- Health Monitor
45-
- Core Service
46-
- Fault Management
47-
- Deployment/Config
44+
- Proto Definitions
45+
- Generated Code
46+
- Build/Tooling
47+
- Documentation
4848
- Other
4949
validations:
5050
required: true
@@ -65,13 +65,13 @@ body:
6565
attributes:
6666
label: Environment
6767
placeholder: |
68-
- NVSentinel version:
69-
- Kubernetes version:
70-
- Deployment method:
68+
- Go version:
69+
- protoc version:
70+
- OS:
7171
value: |
72-
- NVSentinel version:
73-
- Kubernetes version:
74-
- Deployment method:
72+
- Go version:
73+
- protoc version:
74+
- OS:
7575
validations:
7676
required: true
7777

@@ -83,4 +83,4 @@ body:
8383
placeholder: |
8484
```
8585
# Logs here
86-
```
86+
```

.github/ISSUE_TEMPLATE/config.yml

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,8 @@
1515
blank_issues_enabled: false
1616
contact_links:
1717
- name: 🔒 Security Issue
18-
url: https://github.com/NVIDIA/NVSentinel/security/advisories/new
19-
about: Report a security vulnerability (private disclosure)
20-
- name: 💬 Discussions
21-
url: https://github.com/NVIDIA/NVSentinel/discussions
22-
about: Ask questions, share ideas, and discuss with the community
18+
url: https://www.nvidia.com/object/submit-security-vulnerability.html
19+
about: Report a security vulnerability (private disclosure via NVIDIA PSIRT)
2320
- name: 📖 Documentation
24-
url: https://github.com/NVIDIA/NVSentinel/blob/main/README.md
25-
about: Read the documentation and guides
21+
url: https://github.com/NVIDIA/device-api/blob/main/README.md
22+
about: Read the documentation

.github/ISSUE_TEMPLATE/feature_request.yml

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,11 @@ body:
5656
attributes:
5757
label: Component
5858
options:
59-
- Health Monitor
60-
- Core Service
61-
- Fault Management
62-
- Deployment/Config
63-
- API/Interface
64-
- New Component
65-
- Multiple Components
59+
- Proto Definitions
60+
- New Resource Type
61+
- API Enhancement
62+
- Build/Tooling
63+
- Documentation
64+
- Other
6665
validations:
67-
required: true
66+
required: true

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,14 @@
44

55
## Type of Change
66
- [ ] 🐛 Bug fix
7-
- [ ] ✨ New feature
7+
- [ ] ✨ New feature (new message, field, or service method)
88
- [ ] 💥 Breaking change
99
- [ ] 📚 Documentation
10-
- [ ] 🔧 Refactoring
11-
- [ ] 🔨 Build/CI
12-
13-
## Component(s) Affected
14-
- [ ] Core Services
15-
- [ ] Documentation/CI
16-
- [ ] Fault Management
17-
- [ ] Health Monitors
18-
- [ ] Janitor
19-
- [ ] Other: ____________
20-
21-
## Testing
22-
- [ ] Tests pass locally
23-
- [ ] Manual testing completed
24-
- [ ] No breaking changes (or documented)
10+
- [ ] 🔧 Build/Tooling
2511

2612
## Checklist
13+
- [ ] Proto files compile successfully (`make protos-generate`)
14+
- [ ] Generated code is up to date and committed
2715
- [ ] Self-review completed
2816
- [ ] Documentation updated (if needed)
29-
- [ ] Ready for review
17+
- [ ] Signed-off commits (DCO)

0 commit comments

Comments
 (0)