Skip to content

Commit

Permalink
Merge pull request #354 from chaitanya1731/gaudi-updates
Browse files Browse the repository at this point in the history
gaudi: Updated README and provisioning steps for v1.19
  • Loading branch information
uMartinXu authored Jan 30, 2025
2 parents b39aaf5 + bf7c9ab commit 1e3bb58
Show file tree
Hide file tree
Showing 5 changed files with 146 additions and 115 deletions.
73 changes: 26 additions & 47 deletions gaudi/README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,18 @@
# Setting up Intel Gaudi Base Operator
# Setting up Intel Gaudi AI Accelerator Operator

## Overview
[Intel Gaudi Base Operator](https://catalog.redhat.com/software/container-stacks/detail/6683b2cce45daa25e36bddcb) is used to provision Intel Gaudi Accelerator with OpenShift. The steps and yaml files mentioned in this document to provision the Gaudi accelerator are based on [Intel Gaudi Base Operator for OpenShift](https://docs.habana.ai/en/latest/Orchestration/Intel_Gaudi_Base_Operator/index.html).
[Intel Gaudi AI Accelerator Operator](https://catalog.redhat.com/software/container-stacks/detail/6683b2cce45daa25e36bddcb) is used to provision Intel Gaudi Accelerator with OpenShift. The steps and yaml files mentioned in this document to provision the Gaudi accelerator are based on [Intel Gaudi AI Accelerator Operator for OpenShift](https://docs.habana.ai/en/latest/Orchestration/Intel_Gaudi_Base_Operator/index.html).

If you are familiar with the steps here to manually provision the accelerator, the Red Hat certified Operator and Ansible based [One-Click](/one_click/README.md#reference-playbook-–-habana-gaudi-provisioning) solution can be used as a reference to provision the accelerator automatically.

## Prerequisities
- To Provision RHOCP cluster, follow steps [here](/README.md#provisioning-rhocp-cluster).
- To Install NFD Operator, follow steps [here](/nfd/README.md#install-nfd-operator).
- To Install KMM Operator, follow steps [here](/kmmo/README.md#install-kmm-operator).

## Update Kernel Firmware Search Path with MCO
**Note:** This step will reboot the nodes, it is recommended to do this in the first step.

The default kernel firmware search path `/lib/firmware` in RHCOS is not writable. Command below can be used to add path `/var/lib/fimware` into the firmware search path list.
```
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_firmware_path.yaml
```

## Label Gaudi Accelerator Nodes With NFD
NFD operator can be used to configure NFD to automatically detect the Gaudi accelerators and label the nodes for the following provisioning steps.
```
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_nfd_instance_openshift.yaml
```
Verify NFD has labelled the node correctly:
```
oc get no -o json | jq '.items[].metadata.labels' | grep pci-1da3
"feature.node.kubernetes.io/pci-1da3.present": "true",
```
NFD detects underlying Gaudi Accelerator using its PCI device class and the vendor ID.

## Install Intel Gaudi Base Operator on Red Hat OpenShift
## Install Intel Gaudi AI Accelerator Operator on Red Hat OpenShift
### Installation via web console
Follow the steps below to install Intel Gaudi Base Operator using OpenShift web console:
Follow the steps below to install Intel Gaudi AI Accelerator Operator using OpenShift web console:
1. In the OpenShift web console, navigate to **Operator** -> **OperatorHub**.
2. Search for **Intel Gaudi Base Operator** in all items field -> Click **Install**.
2. Search for **Intel Gaudi AI Accelerator Operator** in all items field -> Click **Install**.
### Verify Installation via web console
1. Go to **Operator** -> **Installed Operators**.
2. Verify that the status of the operator is **Succeeded**.
Expand All @@ -54,53 +31,55 @@ NAME READY STATUS RESTARTS AGE
controller-manager-6c8459d9cb-fqs8h 2/2 Running 0 25m
```

## Creating Intel Gaudi Base Operator DeviceConfig Instance
## Creating Intel Gaudi AI Accelerator Operator ClusterPolicy Instance
To create a Habana Gaudi device plugin CR, follow the steps below.

### Create CR via web console
1. Go to **Operator** -> **Installed Operators**.
2. Open **Intel Gaudi Base Operator**.
3. Navigate to tab **Device Config**.
4. Click **Create DeviceConfig** -> set correct parameters -> Click **Create**. To set correct parameters please refer [Using RedHat OpenShift Console](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Intel_Gaudi_Base_Operator/Deploying_Intel_Gaudi_Base_Operator.html?highlight=openshift#id2).
2. Open **Intel Gaudi AI Accelerator Operator**.
3. Navigate to tab **Cluster Policy**.
4. Click **Create ClusterPolicy** -> set correct parameters -> Click **Create**. To set correct parameters please refer [Using RedHat OpenShift Container Platform Console](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Kubernetes_Installation/Kubernetes_Operator.html#id1).

### Verify via web console
1. Verify CR by checking the status of **Workloads** -> **DaemonSet** -> **habana-ai-module-device-plugin-xxxxx**.
2. Now `DeviceConfig` is created.
1. Verify CR by checking the status of **Workloads** -> **DaemonSet** -> **habana-ai-device-plugin-ds**, **habana-ai-driver-rhel-9-4-xxxxx**, **habana-ai-feature-discovery-ds**, **habana-ai-metric-exporter-ds**, **habana-ai-runtime-ds**.
2. Now `ClusterPolicy` is created.

### Create CR via CLI
Apply the CR yaml file:
```
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_device_config.yaml
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_cluster_policy.yaml
```

### Verify the DeviceConfig CR is created
You can use command below to verify that the `DeviceConfig` CR has been created:
### Verify the ClusterPolicy CR is created
You can use command below to verify that the `ClusterPolicy` CR has been created:
```
oc get pod -n habana-ai-operator
NAME READY STATUS RESTARTS AGE
controller-manager-6586758d54-qw644 2/2 Running 0 5d5h
habana-ai-habana-runtime-bqpvp 1/1 Running 0 5d6h
habana-ai-module-device-plugin-pljkf-kxgdj 1/1 Running 0 5d6h
habana-ai-node-metrics-rghlr 1/1 Running 0 5d6h
NAME READY STATUS RESTARTS AGE
habana-ai-device-plugin-ds-thj7b 1/1 Running 0 10d
habana-ai-driver-rhel-9-4-416-94-202412170927-0-ds-vqhzb 1/1 Running 2 10d
habana-ai-feature-discovery-ds-ztl2j 1/1 Running 5 10d
habana-ai-metric-exporter-ds-g5qqh 1/1 Running 0 10d
habana-ai-operator-controller-manager-6c995b5646-wl7cp 2/2 Running 0 10d
habana-ai-runtime-ds-x49lf 1/1 Running 0 10d
```
Alternatively, you can also check the status of the `DeviceConfig` CR like below:
Alternatively, you can also check the status of the `ClusterPolicy` CR like below:
```
oc describe deviceconfig habana-ai -n habana-ai-operator
oc describe ClusterPolicy habana-ai -n habana-ai-operator
Name: habana-ai
Namespace: habana-ai-operator
.
.
Status:
Conditions:
Last Transition Time: 2024-07-24T14:05:11Z
Last Transition Time: 2025-01-21T18:50:46Z
Message: All resources have been successfully reconciled
Reason: Reconciled
Status: True
```
## Verify Gaudi Provisioning
After the `DeviceConfig` instance CR is created, it will take some time for the operator to download the Gaudi OOT driver source code and build it on-premise with the help of the KMM operator. The OOT driver module binaries will be loaded into the RHCOS kernel on each node with Gaudi cards labelled by NFD. Then, the Gaudi device plugin can advertise the Gaudi resources listed in the table for the pods on OpenShit to use. Run the command below to check the availability of Gaudi resources:
After the `ClusterPolicy` instance CR is created, it will take some time for the operator to download the Gaudi OOT driver source code and build it on-premise with the help of the KMM operator. The OOT driver module binaries will be loaded into the RHCOS kernel on each node with Gaudi cards labelled by feature discovery. Then, the Gaudi device plugin can advertise the Gaudi resources listed in the table for the pods on OpenShit to use. Run the command below to check the availability of Gaudi resources:
```
oc describe node | grep habana.ai/gaudi
Expand All @@ -119,4 +98,4 @@ The resources provided are the user interface for customers to claim and consume
| Habana Gaudi | `habana.ai/gaudi` | Number of Habana Gaudi Card resources ready to claim |

## Upgrade Intel Gaudi SPI Firmware
Refer [Upgrade Intel Gaudi SPI Firmware](/gaudi/Gaudi-SPI-Firmware-Upgrade.md) to upgrade the SPI Firmware on Intel Gaudi.
Refer [Upgrade Intel Gaudi SPI Firmware](/gaudi/Gaudi-SPI-Firmware-Upgrade.md) to upgrade the SPI Firmware on Intel Gaudi.
120 changes: 120 additions & 0 deletions gaudi/gaudi_cluster_policy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Copyright (c) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
# Adapted from https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Kubernetes_Installation/Kubernetes_Operator.html#id2
#
apiVersion: habanalabs.habana.ai/v1
kind: ClusterPolicy
metadata:
name: habana-ai
spec:
image_registry: vault.habana.ai
driver:
driver_loader:
images:
ubuntu_22.04:
repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
tag: 1.19.1-26
rhel_8.6:
repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer
tag: 1.19.1-26
rhel_9.2:
repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer
tag: 1.19.1-26
rhel_9.4:
repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer
tag: 1.19.1-26
tencentos_3.1:
repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer
tag: 1.19.1-26
resources:
limits:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
requests:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
repo_server: vault.habana.ai
repo_path: artifactory/gaudi-installer/repos
mlnx_ofed_repo_path: artifactory/gaudi-installer/deps
mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz
hugepages: hugepages_number_int_optional
external_ports: turn_on_external_port_bool_optional
firmware_flush: flush_firmware_on_the_gaudi_cards_bool_optional
driver_runner:
image:
repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer
tag: 1.19.1-26
resources:
limits:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
requests:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
device_plugin:
image:
repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
tag: 1.19.1
resources:
limits:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
requests:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
runtime:
runner:
image:
repository: vault.habana.ai/habana-ai-operator/habana-container-runtime
tag: 1.19.1-26
resources:
limits:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
requests:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
configuration:
container_engine: one_of_containerd_docker_crio
engine_container_runtime_configuration: container_engine_configuration_optional
habana_container_runtime_configuration: container_runtime_configuration_optional
metric_exporter:
runner:
image:
repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter
tag: 1.19.1-26
resources:
limits:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
requests:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
port: 41611
interval: 20
feature_discovery:
runner:
image:
repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery
tag: 1.19.1-26
resources:
limits:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
requests:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
nfd_plugin: boolean_nfd_installed
bmc_monitoring:
image:
repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter
tag: 1.19.1-26
resources:
limits:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
requests:
cpu: cpu_str_or_int_optional
memory: memory_str_optional
node_selector:
key_optional: value_optional
22 changes: 0 additions & 22 deletions gaudi/gaudi_device_config.yaml

This file was deleted.

16 changes: 0 additions & 16 deletions gaudi/gaudi_firmware_path.yaml

This file was deleted.

30 changes: 0 additions & 30 deletions gaudi/gaudi_nfd_instance_openshift.yaml

This file was deleted.

0 comments on commit 1e3bb58

Please sign in to comment.