-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #282 from chaitanya1731/playbook
gaudi: Added Gaudi Provisioning on OpenShift details
- Loading branch information
Showing
5 changed files
with
217 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Setting up HabanaAI Operator | ||
|
||
## Overview | ||
[Habana AI Operator](https://catalog.redhat.com/software/container-stacks/detail/64342b3bcbfbb9a6588ce8dd) is used to provision Intel Gaudi Accelerator with OpenShift. The steps and yaml files mentioned in this document to provision the Gaudi accelerator are based on [HabanaAI Operator for OpenShift](https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/index.html). | ||
|
||
If you are familiar with the steps here to manually provision the accelerator, the Red Hat certified Operator and Ansible based [One-Click](/one_click/README.md) solution can be used as a reference to provision the accelerator automatically. | ||
|
||
## Prerequisities | ||
- To Provision RHOCP cluster, follow steps [here](/README.md#provisioning-rhocp-cluster). | ||
- To Install NFD Operator, follow steps [here](/nfd/README.md#install-nfd-operator). | ||
- To Install KMM Operator, follow steps [here](/kmmo/README.md#install-kmm-operator). | ||
|
||
## Update Kernel Firmware Search Path with MCO | ||
**Note:** This step will reboot the nodes, it is recommended to do this in the first step. | ||
|
||
The default kernel firmware search path `/lib/firmware` in RHCOS is not writable. Command below can be used to add path `/var/lib/fimware` into the firmware search path list. | ||
``` | ||
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_firmware_path.yaml | ||
``` | ||
|
||
## Label Gaudi Accelerator Nodes With NFD | ||
NFD operator can be used to configure NFD to automatically detect the Gaudi accelerators and label the nodes for the flowing provisioning steps. | ||
``` | ||
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_nfd_instance_openshift.yaml | ||
``` | ||
Verify NFD has labelled the node correctly: | ||
``` | ||
oc get no -o json | jq '.items[].metadata.labels' | grep pci-1da3 | ||
"feature.node.kubernetes.io/pci-1da3.present": "true", | ||
``` | ||
NFD detects underlying Gaudi Accelerator using its PCI device class and the vendor ID. | ||
|
||
## Install HabanaAI Operator on Red Hat OpenShift | ||
### Installation via web console | ||
Follow the steps below to install HabanaAI Operator using OpenShift web console: | ||
1. In the OpenShift web console, navigate to **Operator** -> **OperatorHub**. | ||
2. Search for **HabanaAI Operator** in all items field -> Click **Install**. | ||
### Verify Installation via web console | ||
1. Go to **Operator** -> **Installed Operators**. | ||
2. Verify that the status of the operator is **Succeeded**. | ||
|
||
### Installation via Command Line Interface (CLI) | ||
``` | ||
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_install_operator.yaml | ||
``` | ||
|
||
### Verify Installation via CLI | ||
Verify that the operator controller manager pod is up and running: | ||
``` | ||
oc get pods -n habana-ai-operator | ||
NAME READY STATUS RESTARTS AGE | ||
controller-manager-6c8459d9cb-fqs8h 2/2 Running 0 25m | ||
``` | ||
|
||
## Creating Habana AI Operator DeviceConfig Instance | ||
To create a Habana Gaudi device plugin CR, follow the steps below. | ||
|
||
### Create CR via web console | ||
1. Go to **Operator** -> **Installed Operators**. | ||
2. Open **HabanaAI Operator**. | ||
3. Navigate to tab **Device Config**. | ||
4. Click **Create DeviceConfig** -> set correct parameters -> Click **Create**. To set correct parameters please refer [Using RedHat OpenShift Console](https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Deploying_HabanaAI_Operator.html#id2). | ||
|
||
### Verify via web console | ||
1. Verify CR by checking the status of **Workloads** -> **DaemonSet** -> **habana-ai-module-device-plugin-xxxxx**. | ||
2. Now `DeviceConfig` is created. | ||
|
||
### Create CR via CLI | ||
Apply the CR yaml file: | ||
``` | ||
oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/gaudi/gaudi_device_config.yaml | ||
``` | ||
|
||
### Verify the DeviceConfig CR is created | ||
You can use command below to verify that the `DeviceConfig` CR has been created: | ||
``` | ||
oc get pod -n habana-ai-operator | ||
NAME READY STATUS RESTARTS AGE | ||
controller-manager-6586758d54-qw644 2/2 Running 0 5d5h | ||
habana-ai-habana-runtime-bqpvp 1/1 Running 0 5d6h | ||
habana-ai-module-device-plugin-pljkf-kxgdj 1/1 Running 0 5d6h | ||
habana-ai-node-metrics-rghlr 1/1 Running 0 5d6h | ||
``` | ||
Alternatively, you can also check the status of the `DeviceConfig` CR like below: | ||
``` | ||
oc describe deviceconfig habana-ai -n habana-ai-operator | ||
Name: habana-ai | ||
Namespace: habana-ai-operator | ||
. | ||
. | ||
Status: | ||
Conditions: | ||
Last Transition Time: 2024-07-24T14:05:11Z | ||
Message: All resources have been successfully reconciled | ||
Reason: Reconciled | ||
Status: True | ||
``` | ||
## Verify Gaudi Provisioning | ||
After the `DeviceConfig` instance CR is created, it will take some time for the operator to download the Gaudi OOT driver source code and build it on-premise with the help of the KMM operator. The OOT driver module binaries will be loaded into the RHCOS kernel on each node with Gaudi cards labelled by NFD. Then, the Gaudi device plugin can advertise the Gaudi resources listed in the table for the pods on OpenShit to use. Run the command below to check the availability of Gaudi resources: | ||
``` | ||
oc describe node | grep habana.ai/gaudi | ||
habana.ai/gaudi: 8 -> Gaudi cards number on the cluster | ||
habana.ai/gaudi: 8 -> Gaudi cards number allocatble on the cluster | ||
habana.ai/gaudi 4 4 -> number of Gaudi cards allocated and number of Gardi cards available | ||
``` | ||
|
||
To view the metrics on a node with Gaudi card, refer [Collecting Metrics](https://docs.habana.ai/en/latest/Orchestration/Prometheus_Metric_Exporter.html?highlight=metrics#collecting-metrics). | ||
|
||
## Resources Provided by Habana Gaudi Device Plugin | ||
The resources provided are the user interface for customers to claim and consume the hardware features from the user pods. See below table for the details: | ||
|
||
| Feature | Resources | Description | | ||
| ------- | --------- | ----------- | | ||
| Habana Gaudi | `habana.ai/gaudi` | Number of Habana Gaudi Card resources ready to claim | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Copyright (c) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Deploying_HabanaAI_Operator.html#id3 | ||
# | ||
apiVersion: habana.ai/v1 | ||
kind: DeviceConfig | ||
metadata: | ||
name: habana-ai | ||
namespace: habana-ai-operator | ||
spec: | ||
devicePlugin: | ||
image: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin | ||
version: 1.15.1 | ||
driver: | ||
image: image-registry.openshift-image-registry.svc:5000/habana-ai-operator/habana-ai-driver | ||
version: 1.15.1-15 | ||
habanaRuntime: | ||
image: vault.habana.ai/habana-ocp-operator/1.15.1/habana-runtime | ||
version: 1.15.1-15 | ||
nodeMetrics: | ||
image: vault.habana.ai/gaudi-metric-exporter/metric-exporter | ||
version: 1.15.1-15 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Copyright (c) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Environment_Setup.html#installing-intel-gaudi-firmware | ||
# | ||
apiVersion: machineconfiguration.openshift.io/v1 | ||
kind: MachineConfig | ||
metadata: | ||
labels: | ||
machineconfiguration.openshift.io/role: worker | ||
name: firmware-path | ||
spec: | ||
config: | ||
ignition: | ||
version: 3.2.0 | ||
kernelArguments: | ||
- 'firmware_class.path=/var/lib/firmware' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Copyright (c) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Deploying_HabanaAI_Operator.html#using-cli | ||
# | ||
--- | ||
apiVersion: v1 | ||
kind: Namespace | ||
metadata: | ||
name: habana-ai-operator | ||
--- | ||
apiVersion: operators.coreos.com/v1 | ||
kind: OperatorGroup | ||
metadata: | ||
name: habana-ai-operator | ||
namespace: habana-ai-operator | ||
spec: | ||
targetNamespaces: | ||
- habana-ai-operator | ||
--- | ||
apiVersion: operators.coreos.com/v1alpha1 | ||
kind: Subscription | ||
metadata: | ||
name: habana-ai-operator | ||
namespace: habana-ai-operator | ||
spec: | ||
channel: stable | ||
installPlanApproval: Automatic | ||
name: habana-ai-operator | ||
source: certified-operators | ||
sourceNamespace: openshift-marketplace |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Copyright (c) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# Adapted from https://docs.habana.ai/en/latest/Orchestration/HabanaAI_Operator/Environment_Setup.html#id2 | ||
# | ||
apiVersion: nfd.openshift.io/v1 | ||
kind: NodeFeatureDiscovery | ||
metadata: | ||
name: nfd-instance | ||
namespace: openshift-nfd | ||
spec: | ||
extraLabelNs: | ||
- habana.ai | ||
instance: '' | ||
operand: | ||
image: >- | ||
registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:edd2adfdf423d6a1eb7e8c1e388d9cf5fbc829e7e66c7bc955e9b2a6f50d1a47 | ||
servicePort: 12000 | ||
topologyupdater: false | ||
workerConfig: | ||
configData: | | ||
core: | ||
sleepInterval: 60s | ||
sources: | ||
pci: | ||
deviceClassWhitelist: | ||
- "0200" | ||
- "03" | ||
- "12" | ||
deviceLabelFields: | ||
- "vendor" |