简单记录在Openshift平台上配置GPU以及MIG的过程。
在operatorhub搜索,选择仅在特定命名空间可用,点击安装 如果安装时一直显示upgradepending,install plan报错:
Bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
可以参考这里解决:https://access.redhat.com/solutions/6459071
安装成功后,进入operator详情,点击NodeFeatureDiscovery tab,点击创建,默认参数即可,点击确定。创建后operator会创建nfd-master和nfd-worker两个ds. 验证:节点GPUnode详情,如果出现以下标签说明配置成功
feature.node.kubernetes.io/pci-10de.present=true
在operatorhub搜索NVIDIA GPU Operator,点击安装。 安装完成后,点击进入详情,点击ClusterPolicy tab,点击创建。由于我们用的是开源版本的okd,这里有两点需要注意:
Unable to find a match: elfutils-libelf-devel.x86_64 failed to install elfutils packages. RHEL entitlement may be improperly deployed.
然而经过实验,将use_ocp_driver_toolkit改成false还是不行,参考下面链接配置entitlement https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/archive/1.8/openshift/cluster-entitlement.html#obtain-entitlement-1-8 配置完还是失败。
cp /etc/os-release /opt/os-release unlink /etc/os-relase
修改/opt/os-release
:
#修改下面一行: ID=rhcos #添加下面一行: RHEL_VERSION="8.4"
经过实验,上面两点都不行,最后在gitlab找到fedora coreos的镜像: https://gitlab.com/nvidia/container-images/driver/container_registry 然而这里的镜像只有fedora36的,而我们的系统是fedora35,gpu-oprator会自动给镜像加上fedora35的后缀,所以只能等daemonset创建出来之后再去改镜像tag. 以下两个镜像测试可用:
registry.gitlab.com/nvidia/container-images/driver:ff4d82c7-470.141.03-fedora36 registry.gitlab.com/nvidia/container-images/driver:4517fedd-510.85.02-fedora36
ClusterPolicy配置如下:
创建后发现有的pod拉镜像出错,报错:
dial tcp: lookup ngc.download.nvidia.cn: no such host
解决办法:找一台能解析这个地址的机器,把解析出来的地址写到/etc/hosts里:
124.232.178.99 ngc.download.nvidia.cn
注意:安装gpu-operator需要拉取镜像以及安装软件包,如果网络不好可能会需要很长时间(几个小时)才能安装好。
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/archive/1.8/openshift/mig-ocp.html 配置cluster-policy
bashSTRATEGY=mixed && \
oc patch clusterpolicy/gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": '$STRATEGY'}]'
给节点配置profile
bashMIG_CONFIGURATION=all-2g.10gb && \
oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite
修改mig-parted-config cm,添加自定义配置
如何定义配置,参考 https://docs.nvidia.com/datacenter/cloud-native/openshift/23.9.0/mig-ocp.html 和 https://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/master/assets/state-mig-manager/0400_configmap.yaml?ref_type=heads
修改节点标签,指定为自定义配置
nvidia.com/mig.config: custom-config
等待mig-manager更新配置,如果一直没有更新,检查mig-manager pod的日志,必要时删掉重建pod
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/archive/1.8/openshift/mig-ocp.html https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/contents.html https://docs.nvidia.com/datacenter/cloud-native/openshift/23.9.0/mig-ocp.html https://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/master/assets/state-mig-manager/0400_configmap.yaml?ref_type=heads
本文作者:renbear
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC 2.0 许可协议。转载请注明出处!