[Question]: Mig-Manager GPU reset behavior?

I am working in a Liqid Dynamic GPU environment.
Operationally this means when workloads are submitted to kubernetes the requested number and type of GPU as well any possible MIG profiles are automatically configured via mig-manager.

The way this works is a bit beyond the scope of this inquiry but for context I will share.
The Liqid software watches pending pods and leverages fields like
nvidia.com/gpu.product
nvidia.com/gpu

Liqid uses these workload attributes to determine how many and what type of GPU needs to be exposed from the Liqid fabric to a host for the workload to run.  The Liqid software also supports defining MIG profiles like this.
nvidia.com/gpu.product: "NVIDIA-A100-PCIE-40GB-MIG-3g.20gb"

The Liqid software understand this to mean expose an A100 from the Liqid fabric and apply the 3g.20gb mig configuration to it via mig manager labels.  Liqid software relies on mig manager to do the heavy lifting of apply the correct MIG profiles to the GPUs as well as restart any operator pods.

All this works great when the first GPU is presented to the host.

1. Workload requests: 1 x "NVIDIA-A100-PCIE-40GB-MIG-3g.20gb"
2. Liqid intercepts request and exposes GPU.
3. GPU operator pods all automatically start up and the base A100 is seen by K8s.
4. Liqid the sets the label nvidia.com/mig.config=3g.20gb
5. Mig manager does the needful.
6. Once the GPU is reset and all operator pods have restarted the MIG devices are seen in nvidia.com/gpu.product.
7. Kubernetes starts the workload.

When another identical workload is submitted it starts without any interaction because 3g.20gb has 2 instances.

The problem is when the 3rd workload is requested.
There are no available instances so Liqid must expose a new A100 to the host and configure it for MIG.
The desired outcome would be for MIG manager to leave the first GPU alone and configure only the second GPU to the correct MIG profile.

The issue I am running into is mig-manager by default will attempt to reset all GPUs prior to applying the mig configuration.  The first physical GPU that was exposed cannot be reset because there are active workloads running on it.  This causes MIG manager to fail and leave the node in a MIG INVALID state.

The MIG configuration is this example is...
all-3g.20gb: 
  - devices: all 
    mig-enabled: true 
	 mig-devices: 
	  "3g.20gb": 2

I have attempted to work around this issue by creating slot specific MIG profiles like this.
  gpu0-3g.20gb:
    - devices: [0]
      mig-enabled: true
      mig-devices:
        "3g.20gb": 2

  gpu1-3g.20gb:
    - devices: [1]
      mig-enabled: true
      mig-devices:
        "3g.20gb": 2

  gpu2-3g.20gb:
    - devices: [2]
      mig-enabled: true
      mig-devices:
        "3g.20gb": 2

That way I could assign a slot specific MIG profile dynamically when assigning a GPU.

After doing some testing on this slot specific approach it appears that MIG manager will still attempt to reset all GPUs on the host regardless of what device is specified in the MIG profile.

Here is the log from the slot specific profile.

 kubectl logs nvidia-mig-manager-gx8rn -n kommander Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init) 
 IS_HOST_DRIVER=false 
 NVIDIA_DRIVER_ROOT=/run/nvidia/driver 
 DRIVER_ROOT_CTR_PATH=/driver-root 
 NVIDIA_DEV_ROOT=/run/nvidia/driver 
 DEV_ROOT_CTR_PATH=/driver-root 
 WITH_SHUTDOWN_HOST_GPU_CLIENTS=false 
 Starting nvidia-mig-manager 
 W0227 16:54:19.036623 1607968 client_config.go:667] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 
 time="2026-02-27T16:54:19Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label" time="2026-02-27T16:54:19Z" level=info msg="Updating to MIG config: manual-gpu1-3g.20gb" Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label Current value of 'nvidia.com/gpu.deploy.device-plugin=true' Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true' Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label Current value of 'nvidia.com/gpu.deploy.dcgm=true' Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label Current value of 'nvidia.com/gpu.deploy.nvsm=false' Asserting that the requested configuration is present in the configuration file Selected MIG configuration is valid Getting current value of the 'nvidia.com/mig.config.state' node label Current value of 'nvidia.com/mig.config.state=success' Checking if the selected MIG config is currently applied or not 
 time="2026-02-27T16:54:20Z" level=fatal msg="Assertion failure: selected configuration not currently applied" Checking if the MIG mode setting in the selected config is currently applied or not If the state is 'rebooting', we expect this to always return true 
 time="2026-02-27T16:54:20Z" level=fatal msg="Assertion failure: selected configuration not currently applied" Changing the 'nvidia.com/mig.config.state' node label to 'pending' node/blade-slot-2 labeled Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels node/blade-slot-2 labeled Waiting for the device-plugin to shutdown pod/nvidia-device-plugin-daemonset-rwx2d condition met Waiting for gpu-feature-discovery to shutdown Waiting for dcgm-exporter to shutdown Waiting for dcgm to shutdown Removing the cuda-validator pod pod "nvidia-cuda-validator-lrttx" deleted Removing the plugin-validator pod No resources found Applying the MIG mode change from the selected config to the node (and double checking it took effect) If the -r option was passed, the node will be automatically rebooted if this is not successful 
 time="2026-02-27T16:54:27Z" level=debug msg="Parsing config file..." 
 time="2026-02-27T16:54:27Z" level=debug msg="Selecting specific MIG config..." 
 time="2026-02-27T16:54:27Z" level=debug msg="Running apply-start hook" 
 time="2026-02-27T16:54:27Z" level=debug msg="Checking current MIG mode..." 
 time="2026-02-27T16:54:29Z" level=debug msg="Walking MigConfig for (devices=[1])" 
 time="2026-02-27T16:54:29Z" level=debug msg=" GPU 1: 0x20F110DE" 
 time="2026-02-27T16:54:29Z" level=debug msg=" Asserting MIG mode: Enabled" 
 time="2026-02-27T16:54:29Z" level=debug msg=" MIG capable: true\n" 
 time="2026-02-27T16:54:29Z" level=debug msg=" Current MIG mode: Disabled" 
 time="2026-02-27T16:54:30Z" level=debug msg="Running pre-apply-mode hook" 
 time="2026-02-27T16:54:30Z" level=debug msg="Applying MIG mode change..." 
 time="2026-02-27T16:54:32Z" level=debug msg="Walking MigConfig for (devices=[1])" 
 time="2026-02-27T16:54:32Z" level=debug msg=" GPU 1: 0x20F110DE" 
 time="2026-02-27T16:54:32Z" level=debug msg=" MIG capable: true\n" 
 time="2026-02-27T16:54:32Z" level=debug msg=" Current MIG mode: Disabled" 
 time="2026-02-27T16:54:32Z" level=debug msg=" Updating MIG mode: Enabled" 
 time="2026-02-27T16:54:32Z" level=debug msg=" Mode change pending: true" 
 time="2026-02-27T16:54:33Z" level=debug msg="At least one mode change pending" 
 ****time="2026-02-27T16:54:33Z" level=debug msg="Resetting all GPUs..."**** 
 time="2026-02-27T16:54:37Z" level=error msg="\nThe following GPUs could not be reset:\n GPU 00000000:9C:00.0: In use by another client\n GPU 00000000:9D:00.0: Unknown Error\n\n1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.\n" 
 time="2026-02-27T16:54:37Z" level=debug msg="Running apply-exit hook" 
 time="2026-02-27T16:54:37Z" level=fatal msg="Error applying MIG configuration with hooks: error resetting all GPUs: exit status 255" 
 time="2026-02-27T16:54:37Z" level=debug msg="Parsing config file..." 
 time="2026-02-27T16:54:37Z" level=debug msg="Selecting specific MIG config..." 
 time="2026-02-27T16:54:37Z" level=debug msg="Asserting MIG mode configuration..." 
 time="2026-02-27T16:54:39Z" level=debug msg="Walking MigConfig for (devices=[1])" 
 time="2026-02-27T16:54:39Z" level=debug msg=" GPU 1: 0x20F110DE" 
 time="2026-02-27T16:54:39Z" level=debug msg=" Asserting MIG mode: Enabled" 
 time="2026-02-27T16:54:39Z" level=debug msg=" MIG capable: true\n" 
 time="2026-02-27T16:54:39Z" level=debug msg=" Current MIG mode: Disabled" 
 time="2026-02-27T16:54:40Z" level=debug msg="Current mode different than mode being asserted" time="2026-02-27T16:54:40Z" level=fatal msg="Assertion failure: selected configuration not currently applied" Applying the selected MIG config to the node time="2026-02-27T16:54:40Z" level=debug msg="Parsing config file..." 
 time="2026-02-27T16:54:40Z" level=debug msg="Selecting specific MIG config..." 
 time="2026-02-27T16:54:40Z" level=debug msg="Running apply-start hook" 
 time="2026-02-27T16:54:40Z" level=debug msg="Checking current MIG mode..." 
 time="2026-02-27T16:54:42Z" level=debug msg="Walking MigConfig for (devices=[1])" 
 time="2026-02-27T16:54:42Z" level=debug msg=" GPU 1: 0x20F110DE" 
 time="2026-02-27T16:54:42Z" level=debug msg=" Asserting MIG mode: Enabled" 
 time="2026-02-27T16:54:42Z" level=debug msg=" MIG capable: true\n" 
 time="2026-02-27T16:54:42Z" level=debug msg=" Current MIG mode: Disabled" 
 time="2026-02-27T16:54:43Z" level=debug msg="Running pre-apply-mode hook" 
 time="2026-02-27T16:54:43Z" level=debug msg="Applying MIG mode change..." 
 time="2026-02-27T16:54:45Z" level=debug msg="Walking MigConfig for (devices=[1])" 
 time="2026-02-27T16:54:45Z" level=debug msg=" GPU 1: 0x20F110DE" 
 time="2026-02-27T16:54:45Z" level=debug msg=" MIG capable: true\n" 
 time="2026-02-27T16:54:45Z" level=debug msg=" Current MIG mode: Disabled" 
 time="2026-02-27T16:54:45Z" level=debug msg=" Updating MIG mode: Enabled" 
 time="2026-02-27T16:54:45Z" level=debug msg=" Mode change pending: true" 
 time="2026-02-27T16:54:45Z" level=debug msg="At least one mode change pending" 
 time="2026-02-27T16:54:45Z" level=debug msg="Resetting all GPUs..." 
 time="2026-02-27T16:54:50Z" level=error msg="\nThe following GPUs could not be reset:\n GPU 00000000:9C:00.0: In use by another client\n GPU 00000000:9D:00.0: Unknown Error\n\n1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.\n" 
 time="2026-02-27T16:54:50Z" level=debug msg="Running apply-exit hook" 
 time="2026-02-27T16:54:50Z" level=fatal msg="Error applying MIG configuration with hooks: error resetting all GPUs: exit status 255" Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels node/blade-slot-2 labeled Changing the 'nvidia.com/mig.config.state' node label to 'failed' node/blade-slot-2 labeled 
 time="2026-02-27T16:54:51Z" level=error msg="Error: exit status 1" 
 time="2026-02-27T16:54:51Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
 
 Is there any way to control this reset behavior so mig manager will only reset devices that are specified in the configmap?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Mig-Manager GPU reset behavior? #2376

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question]: Mig-Manager GPU reset behavior? #2376

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions