2023-02-09
技术笔记
00
请注意,本文编写于 784 天前,最后修改于 484 天前,其中某些信息可能已经过时。

目录

安装Node Feature Discovery (NFD) Operator
安装GPU operator
验证安装
配置MIG
修改MIG配置
参考文档

简单记录在Openshift平台上配置GPU以及MIG的过程。

安装Node Feature Discovery (NFD) Operator

https://docs.openshift.com/container-platform/4.10/hardware_enablement/psap-node-feature-discovery-operator.html

在operatorhub搜索,选择仅在特定命名空间可用,点击安装 如果安装时一直显示upgradepending,install plan报错:

Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline

可以参考这里解决:https://access.redhat.com/solutions/6459071

安装成功后,进入operator详情,点击NodeFeatureDiscovery tab,点击创建,默认参数即可,点击确定。创建后operator会创建nfd-master和nfd-worker两个ds. 验证:节点GPUnode详情,如果出现以下标签说明配置成功

feature.node.kubernetes.io/pci-10de.present=true

安装GPU operator

在operatorhub搜索NVIDIA GPU Operator,点击安装。 安装完成后,点击进入详情,点击ClusterPolicy tab,点击创建。由于我们用的是开源版本的okd,这里有两点需要注意:

  1. 需要把use_ocp_driver_toolkit改成false,否则后面nvidia-driver-daemonset会报错(https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/troubleshooting-gpu-ocp.html#verify-the-nvidia-driver-deployment):
Unable to find a match: elfutils-libelf-devel.x86_64 failed to install elfutils packages. RHEL entitlement may be improperly deployed.

然而经过实验,将use_ocp_driver_toolkit改成false还是不行,参考下面链接配置entitlement https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/archive/1.8/openshift/cluster-entitlement.html#obtain-entitlement-1-8 配置完还是失败。

  1. 需要修改GPU机器里面os-release文件,以满足nvidia-driver-daemonset对系统的检查,方法:
cp /etc/os-release /opt/os-release unlink /etc/os-relase

修改/opt/os-release

#修改下面一行: ID=rhcos #添加下面一行: RHEL_VERSION="8.4"

经过实验,上面两点都不行,最后在gitlab找到fedora coreos的镜像: https://gitlab.com/nvidia/container-images/driver/container_registry 然而这里的镜像只有fedora36的,而我们的系统是fedora35,gpu-oprator会自动给镜像加上fedora35的后缀,所以只能等daemonset创建出来之后再去改镜像tag. 以下两个镜像测试可用:

registry.gitlab.com/nvidia/container-images/driver:ff4d82c7-470.141.03-fedora36 registry.gitlab.com/nvidia/container-images/driver:4517fedd-510.85.02-fedora36

ClusterPolicy配置如下:

创建后发现有的pod拉镜像出错,报错:

dial tcp: lookup ngc.download.nvidia.cn: no such host

解决办法:找一台能解析这个地址的机器,把解析出来的地址写到/etc/hosts里:

124.232.178.99 ngc.download.nvidia.cn

注意:安装gpu-operator需要拉取镜像以及安装软件包,如果网络不好可能会需要很长时间(几个小时)才能安装好。

验证安装

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/archive/1.8/openshift/install-gpu-ocp.html#verify-the-successful-installation-of-the-nvidia-gpu-operator

配置MIG

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/archive/1.8/openshift/mig-ocp.html 配置cluster-policy

bash
STRATEGY=mixed && \ oc patch clusterpolicy/gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": '$STRATEGY'}]'

给节点配置profile

bash
MIG_CONFIGURATION=all-2g.10gb && \ oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite

修改MIG配置

  1. 修改mig-parted-config cm,添加自定义配置
    如何定义配置,参考 https://docs.nvidia.com/datacenter/cloud-native/openshift/23.9.0/mig-ocp.htmlhttps://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/master/assets/state-mig-manager/0400_configmap.yaml?ref_type=heads

  2. 修改节点标签,指定为自定义配置
    nvidia.com/mig.config: custom-config

  3. 等待mig-manager更新配置,如果一直没有更新,检查mig-manager pod的日志,必要时删掉重建pod

参考文档

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/archive/1.8/openshift/mig-ocp.html https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/contents.html https://docs.nvidia.com/datacenter/cloud-native/openshift/23.9.0/mig-ocp.html https://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/master/assets/state-mig-manager/0400_configmap.yaml?ref_type=heads

本文作者:renbear

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC 2.0 许可协议。转载请注明出处!