Skip to content

Commit 4c6d2ed

Browse files
authored
Merge pull request #69 from pfl/dranet
Dranet
2 parents 0031168 + a297a50 commit 4c6d2ed

30 files changed

Lines changed: 2150 additions & 639 deletions

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ BUILD_DATE = $(shell date -u +"%Y-%m-%dT%H:%M:%SZ")
152152

153153
GO111MODULE=on
154154
CGOFLAGS=-trimpath -tags osusergo,netgo
155-
GCFLAGS=all=-spectre=all -N -l
155+
GCFLAGS=all=-spectre=all
156156
ASMFLAGS=all=-spectre=all
157157
LDFLAGS=all=-s -w -X ${PKG}/internal/version.gitInfo=${GIT_INFO} -X ${PKG}/internal/version.buildDate=${BUILD_DATE}
158158

README.md

Lines changed: 52 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,14 @@
22

33
[![OpenSSF Scorecard](https://api.scorecard.dev/projects/github.com/intel/network-operator/badge)](https://scorecard.dev/viewer/?uri=github.com/intel/network-operator)
44

5-
CAUTION: This is an beta / non-production software, do not use on production clusters.
6-
7-
Network Operator allows automatic configuring and easier use of RDMA NICs with Intel AI accelerators.
5+
Network Operator allows automatic configuring and easier use of NICs with Intel AI accelerators.
86

97
## Description
108

11-
Network operator currently supports Gaudi and its integrated scale-out network interfaces.
9+
Network operator supports Gaudi and its integrated scale-out network interfaces
10+
as well as host based network interfaces.
1211

13-
### Intel® Gaudi®
12+
### Intel® Gaudi® integrated NICs
1413

1514
Intel Gaudi and its integrated NICs are supported in two modes: L2 and L3.
1615

@@ -28,16 +27,23 @@ The operator will deploy configuration Pods to the worker nodes which will liste
2827

2928
More info on the switch topology and configurations is available [here](https://docs.habana.ai/en/v1.20.0/Management_and_Monitoring/Network_Configuration/Configure_E2E_Test_in_L3.html).
3029

30+
### Host based network interface cards
31+
32+
Network operator uses [DRANet](https://github.com/kubernetes-sigs/dranet) to configure
33+
host based network interface cards. Currently network cards supporting RDMA are
34+
requested with the [DRANet DeviceClass](config/deployments/dranet/deviceclass.yaml)
35+
with the name of the DeviceClass being configurable in the
36+
[HostNicScaleOutSpec](api/v1alpha1/networkconfiguration_types.go).
37+
3138
### Future work
3239

33-
* Enable Host-NIC use in cluster
3440
* Support to install Host-NIC KMD
3541
* Configure RDMA NICs to be used with Intel AI accelerators
3642

3743
### Dependencies
3844

3945
The operator depends on following Kubernetes components:
40-
* Intel Gaudi base operator
46+
* Intel Gaudi base operator (Gaudi accelerators)
4147
* Node Feature Discovery
4248
* Cert-manager
4349
* Prometheus (optional)
@@ -50,7 +56,7 @@ The operator depends on following Kubernetes components:
5056
- kubectl version v1.31+.
5157
- Access to a Kubernetes v1.31+ cluster.
5258

53-
### Deploy operator using kubectl
59+
### Deploy operator with Gaudi support using kubectl
5460

5561
Images are available at [dockerhub.io](https://hub.docker.com/r/intel/intel-network-operator).
5662

@@ -76,6 +82,20 @@ sample with:
7682
kubectl apply -f config/operator/samples/gaudi-l3.yaml
7783
```
7884

85+
### Deploy operator with host based NIC support
86+
87+
**Install operator into the cluster:**
88+
89+
```sh
90+
kubectl apply -k config/operator/default/
91+
```
92+
93+
**Create instance for host NIC**
94+
95+
```sh
96+
kubectl apply -f config/operator/samples/hostnic-so.yaml
97+
```
98+
7999
### Remove operator using kubectl
80100

81101
**Delete the instances (CRs) from the cluster:**
@@ -84,6 +104,12 @@ kubectl apply -f config/operator/samples/gaudi-l3.yaml
84104
kubectl delete -f config/operator/samples/gaudi-l3.yaml
85105
```
86106

107+
**OR**
108+
109+
```sh
110+
kubectl delete -f config/operator/samples/hostnic-so.yaml
111+
```
112+
87113
**Uninstall the controller from the cluster:**
88114

89115
```sh
@@ -99,7 +125,7 @@ kubectl delete -f config/nfd/gaudi-device-rule.yaml
99125

100126
See the [README for Helm installation](charts/network-operator/README.md).
101127

102-
### Prometheus scale-out network metrics
128+
### Prometheus scale-out network metrics for Gaudi
103129

104130
In order to supply scale-out network metrics to Prometheus, enable them
105131
in the CR by setting `networkMetrics=true`. Use for example the
@@ -146,6 +172,14 @@ kubectl delete -f config/discovery/prometheus/metrics-service.yaml
146172

147173
The most important Network Operator CRD properties are:
148174

175+
* `configurationType` string
176+
177+
Enable Gaudi network scale-out configuration with `gaudi-so` or host based NICs with `hostnic-so`.
178+
179+
**Applicable for Gaudi Accelerators**
180+
181+
Properties under `gaudiScaleOut`
182+
149183
* `disableNetworkManager` boolean
150184

151185
Disable Gaudi scale-out interfaces in NetworkManager. For nodes where NetworkManager tries
@@ -175,6 +209,15 @@ The most important Network Operator CRD properties are:
175209

176210
Enable scale-out network metrics from an HTTP endpoint on the Pod. Prometheus can be configured to scrape the endpoint with [Service and ServiceMonitor objects](#prometheus-scale-out-network-metrics).
177211

212+
**Applicable for host NIC**
213+
214+
Properties under `hostNicScaleOut`
215+
216+
* `installDranet` boolean
217+
218+
Have Network Operator automatically install DRANet in the operator's namespace.
219+
If set to `false` it is assumed that the cluster admin already has DRANet set up.
220+
178221
The full set of properties is available in the [NetworkClusterPolicy CRD definition](config/operator/crd/bases/intel.com_networkclusterpolicies.yaml).
179222
Examples of Network Operator CRDs are found in the [samples directory](config/operator/samples/).
180223

api/v1alpha1/networkconfiguration_types.go

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,19 +21,21 @@ import (
2121
// NOTE: json tags are required. Any new fields you add must have json tags for the fields to be serialized.
2222

2323
// NetworkClusterPolicySpec defines the desired state of NetworkClusterPolicy
24+
// +kubebuilder:validation:XValidation:rule="self.configurationType != 'gaudi-so' || has(self.nodeSelector)",message="nodeSelector is required when configurationType is gaudi-so"
2425
type NetworkClusterPolicySpec struct {
25-
// Configuration type that the operator will configure to the nodes. Possible options: gaudi-so.
26-
// TODO: plausible other options: host-nic
27-
// +kubebuilder:validation:Enum=gaudi-so
26+
// Configuration type that the operator will configure to the nodes. Possible options: gaudi-so, hostnic-so.
27+
// +kubebuilder:validation:Enum=gaudi-so;hostnic-so
2828
ConfigurationType string `json:"configurationType"`
2929

3030
// Select which nodes the operator should target. Align with labels created by NFD.
31-
// +kubebuilder:validation:Required
3231
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
3332

3433
// Gaudi Scale-Out specific settings. Only valid when configuration type is 'gaudi-so'
3534
GaudiScaleOut GaudiScaleOutSpec `json:"gaudiScaleOut,omitempty"`
3635

36+
// Host NIC Scale-Out specific settings, valid when configuration type is 'hostnic-so'
37+
HostNicScaleOut HostNicScaleOutSpec `json:"hostNicScaleOut,omitempty"`
38+
3739
// LogLevel sets the operator's log level.
3840
// +kubebuilder:validation:Minimum=0
3941
// +kubebuilder:validation:Maximum=8
@@ -47,7 +49,6 @@ type GaudiScaleOutSpec struct {
4749
DisableNetworkManager bool `json:"disableNetworkManager,omitempty"`
4850

4951
// Layer where the configuration should occur. Possible options: L2 and L3.
50-
// +kubebuilder:validation:Required
5152
// +kubebuilder:validation:Enum=L2;L3
5253
Layer string `json:"layer,omitempty"`
5354

@@ -78,6 +79,37 @@ type GaudiScaleOutSpec struct {
7879
NetworkMetrics bool `json:"networkMetrics,omitempty"`
7980
}
8081

82+
// RDMA device specification
83+
type RDMADeviceClassSpec struct {
84+
// Name of the RDMA device class
85+
Name string `json:"name"`
86+
}
87+
88+
// Configuration specific for DRANet
89+
type DranetSpec struct {
90+
// Dranet image to use.
91+
Image string `json:"image,omitempty"`
92+
93+
// Pull policy for the dranet image.
94+
// +kubebuilder:validation:Enum=Never;Always;IfNotPresent
95+
PullPolicy string `json:"pullPolicy,omitempty"`
96+
97+
// Device Class for DRANet RDMA resources
98+
RDMADeviceClass *RDMADeviceClassSpec `json:"rdmaDeviceClass,omitempty"`
99+
}
100+
101+
// HostNicScaleOutSpec defines the desired state of HostNicScaleOut and will install
102+
// dranet, https://github.com/kubernetes-sigs/dranet/, to define interface resources
103+
type HostNicScaleOutSpec struct {
104+
// Have the Network Operator install DRANet into its namespace. If set
105+
// to false, the cluster admin is expected to provision DRANet by other
106+
// means.
107+
InstallDRANet bool `json:"installDranet"`
108+
109+
// Dranet specification, not used if installDranet is false
110+
Dranet DranetSpec `json:"dranet,omitempty"`
111+
}
112+
81113
// NetworkClusterPolicyStatus defines the observed state of NetworkClusterPolicy
82114
type NetworkClusterPolicyStatus struct {
83115
Targets int32 `json:"targets"`

api/v1alpha1/networkconfiguration_webhook.go

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,9 @@ import (
2929
var netpolicylog = logf.Log.WithName("nicclusterpolicy-resource")
3030

3131
const (
32-
gaudiScaleOut = "gaudi-so"
32+
gaudiScaleOut = "gaudi-so"
33+
hostNicScaleOut = "hostnic-so"
34+
DefaultRDMADeviceClass = "dranet-rdma"
3335
)
3436

3537
type emptyNodeSelectorError struct{}
@@ -50,6 +52,12 @@ func (e unknownConfigurationError) Error() string {
5052
return "unknown error"
5153
}
5254

55+
type missingDeviceClassNameError struct{}
56+
57+
func (e missingDeviceClassNameError) Error() string {
58+
return "missing device class name"
59+
}
60+
5361
// SetupWebhookWithManager will setup the manager to manage the webhooks
5462
func (r *NetworkClusterPolicy) SetupWebhookWithManager(mgr ctrl.Manager) error {
5563
return ctrl.NewWebhookManagedBy(mgr).
@@ -70,6 +78,12 @@ func (r *NetworkClusterPolicy) Default() {
7078
if len(r.Spec.GaudiScaleOut.Image) == 0 {
7179
r.Spec.GaudiScaleOut.Image = "intel/intel-network-linkdiscovery:latest"
7280
}
81+
case hostNicScaleOut:
82+
rdmaDeviceClass := r.Spec.HostNicScaleOut.Dranet.RDMADeviceClass
83+
if rdmaDeviceClass != nil && rdmaDeviceClass.Name == "" {
84+
r.Spec.HostNicScaleOut.Dranet.RDMADeviceClass.Name = DefaultRDMADeviceClass
85+
}
86+
7387
}
7488
}
7589

@@ -118,14 +132,23 @@ func validateNodeSelector(nodeSelector map[string]string) error {
118132
return nil
119133
}
120134

121-
func validateSpec(s NetworkClusterPolicySpec) (admission.Warnings, error) {
122-
if err := validateNodeSelector(s.NodeSelector); err != nil {
123-
return nil, err
135+
func validateHostNicSoSpec(s HostNicScaleOutSpec) error {
136+
if s.Dranet.RDMADeviceClass != nil && s.Dranet.RDMADeviceClass.Name == "" {
137+
return missingDeviceClassNameError{}
124138
}
125139

140+
return nil
141+
}
142+
143+
func validateSpec(s NetworkClusterPolicySpec) (admission.Warnings, error) {
126144
switch s.ConfigurationType {
127145
case gaudiScaleOut:
146+
if err := validateNodeSelector(s.NodeSelector); err != nil {
147+
return nil, err
148+
}
128149
return nil, validateGaudiSoSpec(s.GaudiScaleOut)
150+
case hostNicScaleOut:
151+
return nil, validateHostNicSoSpec(s.HostNicScaleOut)
129152
default:
130153
return nil, unknownConfigurationError{}
131154
}

api/v1alpha1/zz_generated.deepcopy.go

Lines changed: 52 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

build/Dockerfile.operator

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ RUN go mod download
2626
# Copy the go source
2727
COPY cmd/operator cmd/operator
2828
COPY api/ api/
29+
COPY config config
2930
COPY config/discovery config/discovery
3031
COPY internal/controller/ internal/controller/
3132
COPY internal/version/ internal/version/

0 commit comments

Comments
 (0)