gpu nvidia mig time-slicing mps multi-tenancy inference cost-optimization

GPU Sharing Strategies for Multi-Tenant Kubernetes: MIG, Time-Slicing, and MPS

by KubeDojo·May 6, 2026·14 min read·

GPU Sharing Strategies for Multi-Tenant Kubernetes: MIG, Time-Slicing, and MPS

Introduction

Three GPU Sharing Mechanisms Compared Figure 1: Three GPU Sharing Mechanisms Compared

Three GPU sharing mechanisms. Three isolation models. One ecosystem where cost pressure forces hard choices.

When teams run multiple inference workloads per GPU, they need to understand the trade-offs between isolation, throughput, and resource utilization. MIG partitioning provides hardware-level isolation. Time-slicing multiplexes workloads with no memory isolation. MPS shares memory across processes with a control daemon. All three are orchestrated through the NVIDIA GPU Operator with DRA integration.

MIG Partitioning: Hardware-Level Isolation

MIG Partitioning Architecture Figure 2: MIG Partitioning Architecture

Multi-Instance GPU (MIG) partitions a single physical GPU into smaller, isolated instances with their own memory, compute engines, and I/O paths. This is the only GPU sharing mechanism that provides true hardware-level isolation. MIG works on Ampere architecture and later: A100, H100, H200, and Blackwell GPUs.

MIG Manager, deployed by the GPU Operator, applies the partitioning at the driver level. You then request MIG-sliced resources like nvidia.com/mig-1g.10gb from your application. Kubernetes schedules the pod on a node with that MIG profile, and the device plugin hands the pod a pre-partitioned MIG instance.

Single vs Mixed MIG Strategy

The GPU Operator supports two MIG strategies: single and mixed. Single MIG applies one profile to all GPUs on a node. Mixed MIG applies multiple profiles, letting you share a node's GPUs in different ways.

Set the strategy via clusterpolicies.nvidia.com/cluster-policy:

# Configure single MIG strategy: one profile per GPU
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  mig:
    strategy: single  # or: mixed
    migManager:
      config:
        name: default-mig-parted-config
  devicePlugin:
    enabled: true
    config:
      name: ""
      default: ""
  gfd:
    enabled: true  # Enables GPU Feature Discovery for automatic node labeling

Single MIG applies one profile, like all-1g.10gb. Label nodes with the profile name:

$ kubectl label nodes <node-name> nvidia.com/mig.config=all-1g.10gb --overwrite

MIG Manager applies the configuration, which terminates all GPU pods and requires a node reboot in some cases (especially on cloud providers). Monitor the MIG Manager pod:

$ kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager

Applying the selected MIG config to the node
time="2024-05-14T18:31:26Z" level=debug msg="Parsing config file..."
time="2024-05-14T18:31:26Z" level=debug msg="Selecting specific MIG config..."
time="2024-05-14T18:31:26Z" level=debug msg="Running apply-start hook"
time="2024-05-14T18:31:26Z" level=debug msg="Checking current MIG mode..."
time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)"
MIG configuration applied successfully
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-kmncw" deleted
node/node-name labeled
Changing the 'nvidia.com/mig.config.state' node label to 'success'

Mixed MIG applies multiple profiles. Label nodes with all-balanced or a custom profile:

$ kubectl label nodes <node-name> nvidia.com/mig.config=all-balanced --overwrite

This creates MIG slices like 1g.10gb, 2g.20gb, and 3g.40gb from each GPU. Kubernetes advertises them as separate extended resources:

$ kubectl get node <node-name> -o jsonpath='{.metadata.labels}' | jq .

{
  "nvidia.com/gpu.present": "true",
  "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3",
  "nvidia.com/gpu.sharing-strategy": "none",
  "nvidia.com/mig-1g.10gb.count": "2",
  "nvidia.com/mig-1g.10gb.engines.copy": "1",
  "nvidia.com/mig-1g.10gb.engines.decoder": "1",
  "nvidia.com/mig-1g.10gb.engines.encoder": "0",
  "nvidia.com/mig-1g.10gb.engines.jpeg": "1",
  "nvidia.com/mig-1g.10gb.engines.ofa": "0",
  "nvidia.com/mig-1g.10gb.memory": "9984",
  "nvidia.com/mig-1g.10gb.multiprocessors": "16",
  "nvidia.com/mig-1g.10gb.product": "NVIDIA-H100-80GB-HBM3-MIG-1g.10gb",
  "nvidia.com/mig-2g.20gb.count": "1",
  "nvidia.com/mig-3g.40gb.count": "1",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "mixed",
  "nvidia.com/mps.capable": "false"
}

Request MIG resources in your pod spec:

# Request a MIG-sliced GPU
apiVersion: v1
kind: Pod
metadata:
  name: mig-inference-pod
spec:
  restartPolicy: OnFailure
  containers:
  - name: inference
    image: nvidia/samples:vectoradd-cuda11.2.1
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1
  nodeSelector:
    nvidia.com/mig.config: all-1g.10gb

MIG provides memory and fault isolation. Each MIG instance has its own CUDA context, memory space, and compute engines. A crash or kernel fault in one MIG instance doesn't affect others on the same GPU.

Custom MIG Configurations

The default MIG profiles come from the default-mig-parted-config ConfigMap. Create a custom ConfigMap for fine-grained control:

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      
      five-1g-one-2g:
        - devices: all 
          mig-enabled: true
          mig-devices:
            "1g.10gb": 5
            "2g.20gb": 1

Patch the ClusterPolicy to use your custom ConfigMap:

$ kubectl patch clusterpolicies.nvidia.com/cluster-policy \\
  --type='json' \\
  -p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]'

Label nodes with your custom profile:

$ kubectl label nodes <node-name> nvidia.com/mig.config=five-1g-one-2g --overwrite

MIG Manager reads the custom ConfigMap and applies the geometry. The node labels reflect the new MIG devices:

$ kubectl get node <node-name> -o jsonpath='{.metadata.labels}' | jq .

{
  "nvidia.com/gpu.count": "6",
  "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3",
  "nvidia.com/mig-1g.10gb.count": "5",
  "nvidia.com/mig-2g.20gb.count": "1",
  "nvidia.com/mig.capable": "true",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "mixed",
  "nvidia.com/mps.capable": "false"
}

Time-Slicing: Lightweight Multiplexing

Time-Slicing vs MPS Flow Figure 3: Time-Slicing vs MPS Flow

Time-slicing oversubscribes GPUs by interleaving workloads across time. Multiple pods share the same GPU, and the GPU driver timeslices compute time between them. There's no memory isolation—each pod runs in the same process namespace as the GPU driver.

Time-slicing works on all GPU generations, not just Ampere and later. It's ideal for inference workloads where latency tolerance is higher than strict isolation requirements.

Configuring Time-Slicing

Create a ConfigMap with time-slicing settings:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

The any key applies the config to all GPU types. Use node-specific keys like a100-40gb for targeted configurations.

Key settings:

renameByDefault=false: Uses nvidia.com/gpu with -SHARED suffix on the product label
failRequestsGreaterThanOne=false: Allows requesting multiple replicas (not recommended)

Apply the ConfigMap to the GPU Operator namespace:

$ kubectl create -n gpu-operator -f time-slicing-config.yaml

configmap/time-slicing-config created

Patch the device plugin to use the ConfigMap:

$ kubectl patch clusterpolicies.nvidia.com/cluster-policy -n gpu-operator --type=merge \\
    -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

Apply node labels to target specific GPU types:

$ kubectl label node <node-name> nvidia.com/device-plugin.config=tesla-t4

Verifying Time-Slicing

Check the node labels after applying the config:

$ kubectl describe node <node-name>

...
Labels:
          nvidia.com/gpu.count=4
          nvidia.com/gpu.product=Tesla-T4-SHARED
          nvidia.com/gpu.replicas=4
Capacity:
  nvidia.com/gpu:        16
Allocatable:
  nvidia.com/gpu:        16
...

The nvidia.com/gpu.replicas label shows the oversubscription factor. The -SHARED suffix on the product label indicates time-slicing is active.

Deploy a workload to validate:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: time-slicing-verification
  labels:
    app: time-slicing-verification
spec:
  replicas: 5
  selector:
    matchLabels:
      app: time-slicing-verification
  template:
    metadata:
      labels:
        app: time-slicing-verification
    spec:
      restartPolicy: Always
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
      - name: cuda-sample-vector-add
        image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
        command: ["/bin/bash", "-c", "--"]
        args:
          - while true; do /cuda-samples/vectorAdd; done
        resources:
          limits:
            nvidia.com/gpu: 1

All five replicas run concurrently on the same GPU, timesharing compute time. The vectorAdd sample completes successfully in each pod:

$ kubectl logs deploy/time-slicing-verification

Found 5 pods, using pod/time-slicing-verification-7cdc7f87c5-s8qwk
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Time-Slicing Limitations

warning: DCGM-Exporter doesn't support container metrics when GPU time-slicing is enabled. You lose per-container GPU metrics, which complicates observability for production deployments.

When failRequestsGreaterThanOne=true, pods requesting more than one GPU replica fail with UnexpectedAdmissionError — the failure is explicit. When it's false (the default), requests succeed but the scheduler doesn't guarantee proportional compute time across replicas.

tip: Set failRequestsGreaterThanOne=true for production deployments to catch misconfigurations early. Request exactly one replica per pod.

MPS Sharing: Memory-Managed Oversubscription

Multi-Process Service (MPS) shares GPU memory between processes. The MPS control daemon enforces memory and compute limits per partition. MPS doesn't partition the GPU at the hardware level—instead, it manages memory allocations across multiple CUDA contexts.

warning: MPS and MIG are mutually exclusive on the same GPU. MPS requires native GPU access to enforce memory limits. If you enable MIG, MPS cannot run on MIG-partitioned GPUs. Choose based on your isolation requirements: use MIG for hardware isolation, MPS for memory-managed sharing on non-MIG GPUs.

Enabling MPS

Enable MPS via the device plugin ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mps-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      mps:
        resources:
          - name: nvidia.com/gpu
            replicas: 10

The MPS control daemon runs as a sidecar in the GPU Operator namespace. It enforces memory and compute limits per partition, throttling processes that exceed their allocation. This prevents one workload from exhausting GPU memory and starving others.

MPS works best for inference workloads with predictable memory usage. Training workloads with volatile memory spikes are a poor fit — the daemon throttles aggressively when limits are hit. MPS and time-slicing are mutually exclusive on the same GPU — configure one or the other per node.

MPS vs MIG

MPS and MIG target different isolation levels. MIG partitions at the hardware level — dedicated memory, compute engines, and fault domains per instance. MPS manages memory allocation in software across CUDA contexts sharing the same GPU.

The two are mutually exclusive on a given GPU. MIG partitions the hardware; MPS needs full GPU access to manage memory across processes. In a cluster, you can run MIG on some nodes and MPS on others, but never both on the same GPU.

Gotchas and Lessons Learned

MIG Requires Node Reboot

Changing MIG geometry requires a node reboot on some platforms, especially cloud providers. MIG Manager sets nvidia.com/mig.config.state: rebooting when a reboot is required.

Plan MIG configuration changes during maintenance windows. Cordon nodes before applying MIG changes to avoid scheduling surprises.

Time-Slicing Request Behavior

When failRequestsGreaterThanOne=true, pods requesting more than one replica fail with UnexpectedAdmissionError. When false (the default), requests succeed silently but without guaranteed proportional compute time — multiple replicas share the same GPU context with no fairness guarantees.

DRA Requires Kubernetes 1.34+

The NVIDIA DRA Driver for GPUs requires Kubernetes 1.34.2 or newer. Earlier versions don't support the DeviceClass resources that DRA uses for GPU and MIG allocation. After installing the NVIDIA DRA driver, verify the device classes exist:

$ kubectl get deviceclass

NAME               AGE
gpu.nvidia.com     5m
mig.nvidia.com     5m

Resource Request Confusion

When requesting MIG resources, you request nvidia.com/mig-1g.10gb, not nvidia.com/gpu. The device plugin doesn't create nvidia.com/gpu replicas for MIG-enabled nodes.

$ kubectl describe node <node-name>

...
Capacity:
  nvidia.com/mig-1g.10gb: 5
  nvidia.com/mig-2g.20gb: 1
...

Request MIG resources in your pod spec, not the base nvidia.com/gpu resource.

Choosing the Right Strategy

GPU Operator Configuration Stack Figure 4: GPU Operator Configuration Stack

	MIG	Time-Slicing	MPS
Isolation	Hardware (dedicated memory, compute, I/O)	None (shared GPU context)	Software (memory limits enforced)
GPU generations	Ampere+ (A100, H100, H200, Blackwell)	All	All
Memory isolation	Yes — per-instance memory space	No — shared memory	Enforced limits per process
Reboot required	Sometimes (geometry changes)	No	No
Observability	Full DCGM metrics per instance	No per-container metrics	Limited
Best for	Multi-tenant, strict SLA	Throughput, inference batching	Latency-sensitive, predictable memory

Wrap-up

Start with time-slicing for inference workloads where throughput matters more than isolation — it works on any GPU and requires zero reboots. Move to MIG when you need hard isolation between tenants or SLA-bound workloads on Ampere+ hardware. Use MPS when you need memory-managed sharing on non-MIG GPUs with predictable allocation patterns.

Configure each mechanism through the NVIDIA GPU Operator, validate with kubectl describe node, and request the appropriate resources in your pod specs. Benchmark with your actual workloads before production — GPU utilization improvements of 2-4x are typical when moving from whole-GPU allocation to shared strategies.

gpu nvidia mig time-slicing mps multi-tenancy inference cost-optimization

KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

hami gpu virtualization

HAMi: GPU Virtualization as the Reference Pattern for AI Infrastructure

HAMi (CNCF Sandbox) emerged at KubeCon EU 2026 as the reference implementation for GPU resource management across NVIDIA, AMD, Huawei, and Cambricon accelerators.

by KubeDojo

nvidia gpu kubernetes

NVIDIA AI Cluster Runtime: Validated GPU Kubernetes Recipes

NVIDIA released AI Cluster Runtime, an open-source project providing validated, version-locked Kubernetes configurations for GPU infrastructure.

by KubeDojo

kueue scheduling batch

Kueue: The Community Standard for Kubernetes AI Batch Scheduling

Kueue manages GPU quotas, enforces fair sharing across teams, and dispatches jobs to remote HPC clusters — the standard for production AI batch scheduling.

by KubeDojo

Introduction

MIG Partitioning: Hardware-Level Isolation

Single vs Mixed MIG Strategy

Custom MIG Configurations

Time-Slicing: Lightweight Multiplexing

Configuring Time-Slicing

Verifying Time-Slicing

Time-Slicing Limitations

MPS Sharing: Memory-Managed Oversubscription

Enabling MPS

MPS vs MIG

Gotchas and Lessons Learned

MIG Requires Node Reboot

Time-Slicing Request Behavior

DRA Requires Kubernetes 1.34+

Resource Request Confusion

Choosing the Right Strategy

Wrap-up

Subscribe to KubeDojo

Related Articles

HAMi: GPU Virtualization as the Reference Pattern for AI Infrastructure

NVIDIA AI Cluster Runtime: Validated GPU Kubernetes Recipes

Kueue: The Community Standard for Kubernetes AI Batch Scheduling