diff --git a/gpu-operator/gpu-driver-configuration.rst b/gpu-operator/gpu-driver-configuration.rst index 87907776f..aa5d84077 100644 --- a/gpu-operator/gpu-driver-configuration.rst +++ b/gpu-operator/gpu-driver-configuration.rst @@ -34,16 +34,16 @@ You can specify labels in the node selector field to control which NVIDIA driver Limitations =========== -* This feature is recommended for new cluster installations only. - Upgrades from ClusterPolicy managed drivers to NVIDIA driver custom resource managed drivers are not supported. - Switching from ClusterPolicy to NVIDIA driver, will cause all existing driver pods to be terminated immediately and redeployed using the new NVIDIADriver configuration. -* Users are required to either use the default NVIDIA driver custom resource rendered by helm chart or create and manage their own custom NVIDIA driver. +* This feature is recommended for new cluster installations only. + Upgrades from ClusterPolicy managed drivers to NVIDIA driver custom resource managed drivers are not supported. + Switching from ClusterPolicy to the NVIDIA driver custom resource will cause all existing driver pods to be terminated immediately and redeployed using the new NVIDIADriver configuration. +* You must either use the default NVIDIA driver custom resource that the Helm chart creates or create and manage your own custom NVIDIA driver custom resource. * You can't use ClusterPolicy and the NVIDIA driver custom resource at the same time. You can only use one or the other in a cluster. Comparison: Managing the Driver with CRD versus the Cluster Policy ================================================================== -Before the introduction of the NVIDIA GPU Driver custom resource definition, you manage the driver by modifying +Before the introduction of the NVIDIA GPU Driver custom resource definition, you managed the driver by modifying the driver field and subfields of the cluster policy custom resource definition. The key differences between the two approaches are summarized in the following table. @@ -86,7 +86,7 @@ then the Operator starts two daemon sets. About the Default NVIDIA Driver Custom Resource =============================================== -By default, the helm chart configures a default NVIDIA driver custom resource during installation. +By default, the Helm chart configures a default NVIDIA driver custom resource during installation. This custom resource does not include a node selector and as a result, the custom resource applies to every node in your cluster that has an NVIDIA GPU. The Operator starts a driver daemon set and pods for each operating system version in your cluster. @@ -99,6 +99,15 @@ matching all nodes and your custom resources matching some of the same nodes. To prevent configuring the default custom resource, specify the ``--set driver.nvidiaDriverCRD.deployDefaultCR=false`` argument when you install the Operator with Helm. +If the Operator is already installed with the default custom resource and you want to create your own +driver custom resources and apply them to specific nodes, delete the default custom resource. + +.. note:: + + After you delete the default custom resource, your custom resources might not reconcile + automatically due to a known issue. Refer to the :ref:`v26.3.0 known issues ` + for the workaround. + Feature Compatibility ===================== @@ -128,7 +137,7 @@ Support for X86_64 and ARM64 web page to determine which driver version and operating system combinations support both architectures. Custom Driver Parameters - Each NVIDIA driver custom resource can specify custom kernel module parameters via configmap. + Each NVIDIA driver custom resource can specify custom kernel module parameters by using a ConfigMap. For more information, refer to :doc:`Customizing NVIDIA GPU Driver Parameters during Installation `. *************************************** @@ -304,7 +313,7 @@ One Driver Type and Version on All Nodes .. literalinclude:: ./manifests/input/nvd-all.yaml :language: yaml -#. Apply the manfiest: +#. Apply the manifest: .. code-block:: console @@ -339,7 +348,7 @@ Multiple Driver Versions .. literalinclude:: ./manifests/input/nvd-driver-multiple.yaml :language: yaml -#. Apply the manfiest: +#. Apply the manifest: .. code-block:: console @@ -364,10 +373,10 @@ One Precompiled Driver Container on All Nodes .. tip:: - Because the manfiest does not include a ``nodeSelector`` field, the driver custom + Because the manifest does not include a ``nodeSelector`` field, the driver custom resource selects all nodes in the cluster that have an NVIDIA GPU. -#. Apply the manfiest: +#. Apply the manifest: .. code-block:: console @@ -395,7 +404,7 @@ Precompiled Driver Container on Some Nodes .. literalinclude:: ./manifests/input/nvd-precompiled-some.yaml :language: yaml -#. Apply the manfiest: +#. Apply the manifest: .. code-block:: console diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst index 75303436a..dcc09cb7d 100644 --- a/gpu-operator/life-cycle-policy.rst +++ b/gpu-operator/life-cycle-policy.rst @@ -93,9 +93,10 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. * - v26.3.0 * - NVIDIA GPU Driver |ki|_ - - | `590.48.01 `_ + - | `595.58.03 `_ + | `590.48.01 `_ | `580.126.20 `_ (**D**, **R**) - | `570.211.01 `_ + | `570.211.01 `_ | `535.288.01 `_ * - NVIDIA Driver Manager for Kubernetes diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index c77ce054f..20eee1610 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -426,7 +426,7 @@ The GPU Operator has been validated in the following scenarios: .. _rhel-9: :sup:`3` - Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, and 9.6 versions are available for x86 based platforms only. + Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.6, and 9.7 versions are available for x86 based platforms only. They are not available for ARM based systems. .. note:: diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index f1247e2b2..6e47426ec 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -81,6 +81,8 @@ New Features You are still able to use a custom MIG configuration if you have specific requirements. Refer to the :doc:`MIG Manager documentation ` for more information. + There is a known issue with MIG configurations on RHEL 8 with pre-installed NVIDIA drivers, refer to the :ref:`Known Issues ` section for more information. + * Added support for the NVIDIA Driver Custom Resource Definition (CRD). Use this feature on new cluster installations to configure multiple driver types and versions on different nodes or multiple operating system versions on nodes. Refer to the :doc:`NVIDIA Driver Custom Resource Definition documentation ` for more information. @@ -152,6 +154,59 @@ Fixed Issues * Fixed an issue where the GPU Operator was not adding a namespace to ServiceAccount objects. (`PR #2039 `_) +.. _v26.3.0-known-issues: + +Known Issues +------------ + +* When GPUDirect RDMA is enabled, the ``nvidia-peermem`` container may fail to restart after the driver pod restarts without a node reboot and without any driver configuration changes. + In this scenario, the driver uses a fast-path optimization that skips recompilation, but the ``nvidia-peermem`` sidecar does not detect that its module is already loaded and fails to start. + This occurs because the kernel state is not cleared when the driver pod restarts. + + + To work around this issue, set the ``FORCE_REINSTALL=true`` environment variable in the ClusterPolicy. + + .. code-block:: console + + $ kubectl patch clusterpolicy cluster-policy --type=json \ + -p='[{"op": "add", "path": "/spec/driver/manager/env/-", "value": {"name": "FORCE_REINSTALL", "value": "true"}}]' + + Setting ``FORCE_REINSTALL=true`` forces full driver recompilation, node drain, and GPU workload disruption on every restart. + Alternatively, rebooting the node clears the kernel state and allows the ``nvidia-peermem`` module to load successfully, though this may disrupt running workloads. + +* On RHEL 8 nodes with pre-installed NVIDIA drivers (``driver.enabled=false``), MIG configuration can fail when using NVIDIA MIG Manager v0.13.1 or later. + NVIDIA MIG Manager copies the ``nvidia-mig-parted`` binary to the host and runs it in the host userspace by using ``chroot``. + Recent versions of the binary were compiled against a UBI9 base image and require GLIBC 2.32 and GLIBC 2.34 which are not available on RHEL 8, causing the following errors in the MIG Manager pod logs: + + .. code-block:: console + + /usr/local/nvidia/mig-manager/nvidia-mig-parted: /lib64/libc.so.6: version `GLIBC_2.32' not found + /usr/local/nvidia/mig-manager/nvidia-mig-parted: /lib64/libc.so.6: version `GLIBC_2.34' not found + + To work around this issue, downgrade the NVIDIA MIG Manager component to v0.12.3. + After downgrading, automatically generated per-node MIG configuration ConfigMaps will not be available. + MIG configuration information will be available in the ``default-mig-parted-config`` ConfigMap instead. + Refer to the :doc:`MIG Manager documentation ` for more information on MIG configuration. + + Refer to the MIG Controller issue `#329 `_ for more information. + +* After you delete the default NVIDIADriver custom resource, any custom NVIDIADriver + custom resources that you created might not become active automatically. + The custom resources remain in a pending state because the Operator controller + does not re-evaluate them after the conflicting default custom resource is removed. + + To work around this issue, restart the GPU Operator controller by deleting + the controller pod: + + .. code-block:: console + + $ kubectl delete pod -n gpu-operator -l app=gpu-operator + + Restarting the controller pod does not disrupt running GPU workloads or + driver pods on nodes. + + Refer to issue `#2259 `_ + for more information. Removals and Deprecations ------------------------- @@ -159,8 +214,6 @@ Removals and Deprecations * Marked unused field ``defaultRuntime`` as optional in the ClusterPolicy. (`PR #2000 `_) - - .. _v25.10.1: 25.10.1