KubeDojo

AI Gateway Working Group and Gateway API Inference Extension GA

KubeDojo
by KubeDojo··15 min read·
AI Gateway Working Group and Gateway API Inference Extension GA

Introduction

Self-hosting LLMs on Kubernetes hit a wall fast. Your traditional load balancer sees only HTTP paths and headers — it doesn't understand that a request is for llama-3.1-8b-instruct versus qwen3-32b, whether a pod has the right LoRA adapter loaded, or which replica has the warmest KV-cache for your prompt. The result? Cache evictions on every request, cold adapters cycling in and out, and tail latency that spikes as load increases.

The Gateway API Inference Extension graduated to GA in March 2026, backed by the newly formed AI Gateway Working Group. This isn't a new product category — it's a standards-based extension pattern that upgrades ext-proc-capable gateways (Istio, Envoy Gateway, kgateway, NGINX, Agentgateway) with model-aware routing, KV-cache-aware scheduling, and LoRA adapter affinity.

What is an AI Gateway?

An AI Gateway in the Kubernetes context is network gateway infrastructure — proxies, load balancers, ingress controllers — that implements the Gateway API specification with enhanced capabilities for AI workloads. The AI Gateway Working Group charter is explicit: this isn't about defining a new product category, but standardizing the capabilities that matter for inference traffic.

Core capabilities:

  • Token-based rate limiting — Enforce quotas on AI API calls, not just HTTP requests
  • Payload inspection — Read the request body to extract model names, enable semantic routing, detect prompt injection attempts
  • Model-aware routing — Route based on the model field in the JSON body, not URL paths
  • KV-cache-aware scheduling — Track prefix cache status across replicas, route to the pod with the warmest cache
  • LoRA adapter affinity — Route fine-tuned adapter requests to pods with the right adapters loaded

The extension achieves this through Envoy's External Processing (ext-proc) filter. Any gateway supporting ext-proc and Gateway API can become an inference gateway by coupling with an Endpoint Picker (EPP) — an ext-proc server that tracks metrics from model servers and makes intelligent routing decisions.

Architecture at a glance:

┌─────────────┐     ┌─────────────────────┐     ┌──────────────────┐
│   Client    │ ──→ │  Inference Gateway  │ ──→ │  Endpoint Picker │
│  (curl,     │     │ (Istio/Envoy/       │     │  (EPP ext-proc   │
│   app)      │     │  Agentgateway)      │     │   server)        │
└─────────────┘     └─────────────────────┘     └──────────────────┘
                            │                          │
                            │                          │ queries metrics
                            │                          ▼
                            │              ┌──────────────────────────┐
                            │              │  Model Server Pods       │
                            │              │  - vLLM replica 1        │
                            └─────────────→│  - vLLM replica 2        │
                               routes to   │  - vLLM replica 3        │
                                           └──────────────────────────┘

The request flow: (1) Gateway receives HTTP request, matches to HTTPRoute pointing to InferencePool; (2) Gateway forwards request metadata to EPP via ext-proc; (3) EPP queries async metrics from model servers (KV-cache utilization, queue depth, active LoRAs); (4) EPP selects optimal endpoint; (5) Gateway routes to selected pod.

Traditional load balancing stops at step 1 — it's blind to everything that matters for inference performance.

The Inference Extension API

The extension introduces two inference-focused API resources: InferencePool (GA) and InferenceObjective (alpha). These replace Kubernetes Service for AI workloads, providing abstraction tailored to model serving.

InferencePool

An InferencePool defines a group of pods dedicated to serving AI models. All pods in a pool share the same compute configuration, accelerator type, base model, and model server. The pool references an EPP service that monitors metrics and makes routing decisions.

note: API versions InferencePool is GA (inference.networking.k8s.io/v1) — stable for production. InferenceObjective remains alpha (inference.networking.x-k8s.io/v1alpha2) — use with caution.

InferencePool spec:

# github.com/kubernetes-sigs/gateway-api-inference-extension/config/manifests/inferencepool.yaml (line 1-15)
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: vllm-qwen3-32b
spec:
  targetPorts:
    - number: 8000
  selector:
    app: vllm-qwen3-32b
  extensionRef:
    name: vllm-qwen3-32b-epp
    port: 9002
    failureMode: FailOpen

The spec has three critical parts:

  1. selector — Matches labels on model server pods. Only pods with app: vllm-qwen3-32b join this pool.
  2. targetPorts — The port the gateway forwards traffic to on selected pods (8000 for vLLM's HTTP endpoint).
  3. extensionRef — References the EPP service. The failureMode: FailOpen means requests drop if EPP is unresponsive — plan for redundancy.

Pods joining an InferencePool must implement the model server protocol defined by the project. This ensures the EPP receives standardized metrics on KV-cache status, queue length, and LoRA availability.

InferenceObjective

InferenceObjective (alpha since v1.0.0) defines serving objectives for specific requests. Currently it supports only priority, but will expand to SLO targeting and other constraints.

InferenceObjective examples:

# github.com/kubernetes-sigs/gateway-api-inference-extension/config/manifests/inferenceobjective.yaml (line 1-20)
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: small-segment-lora
spec:
  priority: 1
  poolRef:
    group: inference.networking.k8s.io
    kind: InferencePool
    name: vllm-qwen3-32b
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: base-model
spec:
  priority: 2
  poolRef:
    group: inference.networking.k8s.io
    kind: InferencePool
    name: vllm-qwen3-32b

Clients associate requests with an InferenceObjective by setting the x-gateway-inference-objective header to the objective's metadata name. Priority 1 gets preferential treatment over priority 2 — useful for separating latency-sensitive chat workloads from batch summarization tasks.

warning: Alpha API InferenceObjective is alpha (v1alpha2) — breaking changes are possible in future releases. Use for production workloads with caution.

Deploying the Inference Gateway

The getting started guide supports four gateway implementations: GKE Inference Gateway (managed), Istio 1.28+, Agentgateway 1.0+, and NGINX Gateway Fabric. The deployment flow is similar across all: install CRDs, deploy the gateway, configure InferencePool via Helm.

Prerequisites

  • Kubernetes 1.29+ (sidecar containers enabled by default)
  • Gateway API CRDs installed
  • Helm 3.x
  • LoadBalancer support (or MetalLB for Kind)

Step 1: Deploy a model server

For test/dev, use the vLLM simulator — no GPU required:

# github.com/kubernetes-sigs/gateway-api-inference-extension/config/manifests/vllm/sim-deployment.yaml (line 1-30, modified)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen3-32b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-qwen3-32b
  template:
    metadata:
      labels:
        app: vllm-qwen3-32b
        inference.networking.k8s.io/engine-type: vllm
    spec:
      containers:
      - name: vllm-sim
        image: ghcr.io/llm-d/llm-d-inference-sim:v0.7.1
        args:
        - --model
        - Qwen/Qwen3-32B
        - --port
        - "8000"
        - --max-loras
        - "2"
        - --lora-modules
        - '{"name": "food-review-1"}'
        ports:
        - containerPort: 8000
          name: http

note: Modified example This example uses Qwen3-32B instead of the default Llama-3.1-8B to match the InferencePool examples. The official manifest uses meta-llama/Llama-3.1-8B-Instruct.

The simulator mimics vLLM's metrics and responses. For production, replace with GPU-backed vLLM deployment (requires 3 GPUs for the sample config).

warning: CPU deployments unreliable The vLLM CPU deployment option is marked unreliable — pods crash/restart under resource constraints. Tests showed 64GB+ RAM and 48+ CPUs needed for acceptable response times. GPU-backed deployments are the production path.

Step 2: Install CRDs and Gateway

# Get latest release version (github.com/kubernetes-sigs/gateway-api-inference-extension/releases)
IGW_LATEST_RELEASE=$(curl -s https://api.github.com/repos/kubernetes-sigs/gateway-api-inference-extension/releases \
  | jq -r '.[] | select(.prerelease == false) | .tag_name' \
  | sort -V \
  | tail -n1)

# Install CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${IGW_LATEST_RELEASE}/manifests.yaml

# Deploy Istio inference gateway
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/${IGW_LATEST_RELEASE}/config/manifests/gateway/istio/gateway.yaml

For Istio, install with the inference extension flag:

# istioctl install with inference extension (gateway-api-inference-extension/docs/dev.md)
ISTIO_VERSION=1.28.0
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=${ISTIO_VERSION} sh -
./istio-$ISTIO_VERSION/bin/istioctl install \
  --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true

Step 3: Deploy InferencePool via Helm

# Deploy InferencePool with Helm (gateway-api-inference-extension.sigs.k8s.io/guides/)
export INFERENCE_POOL_NAME=vllm-qwen3-32b
export GATEWAY_PROVIDER=istio

helm install ${INFERENCE_POOL_NAME} \
  --dependency-update \
  --set inferencePool.modelServers.matchLabels.app=${INFERENCE_POOL_NAME} \
  --set provider.name=$GATEWAY_PROVIDER \
  --set experimentalHttpRoute.enabled=true \
  --version $IGW_LATEST_RELEASE \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

The Helm chart installs the InferencePool, EPP deployment, and provider-specific resources (HTTPRoute for Istio).

Step 4: Verify deployment

# Check HTTPRoute status (gateway-api-inference-extension.sigs.k8s.io/guides/)
kubectl get httproute ${INFERENCE_POOL_NAME} -o yaml
# Expected: Accepted=True, ResolvedRefs=True

# Check InferencePool status
kubectl get inferencepool ${INFERENCE_POOL_NAME} -o yaml
# Expected: Accepted=True, ResolvedRefs=True

# Test inference endpoint
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
curl -i ${IP}:80/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "prompt": "Write as if you were a critic: San Francisco",
    "max_tokens": 100,
    "temperature": 0
  }'

When Inference Extension Makes Sense

The Inference Extension isn't free — it adds an EPP sidecar, requires ext-proc support, and introduces another failure domain. Use it when:

  • You're running 3+ model server replicas (single-pod deployments don't benefit from intelligent routing)
  • You're serving multiple models or LoRA adapters from the same pool
  • Tail latency matters more than absolute simplicity

Skip it if you're running a single model with a single replica. Standard Kubernetes Service with session affinity is simpler and sufficient.

Production Patterns

The Inference Extension enables several production patterns that traditional load balancers can't support.

Model rollouts with traffic splitting

Route different model versions by name for A/B testing or canary deployments. The Body Based Router (BBR) extension extracts the model field from the request body and sets an internal header that the gateway can match:

# HTTPRoute with model-based traffic splitting
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: model-router
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - headers:
      - name: x-gateway-inference-model
        value: llama-3.1-8b-instruct-v1
    backendRefs:
    - name: vllm-pool-v1
      weight: 90
    - name: vllm-pool-v2
      weight: 10

Gradually shift traffic from v1 to v2 by adjusting weights. The BBR extracts model from the JSON body and sets x-gateway-inference-model for matching.

LoRA adapter routing

Serve multiple fine-tuned adapters from the same base model pool. Configure vLLM with multiple LoRA modules and a sidecar for dynamic adapter management:

# vLLM deployment with LoRA support (simplified)
containers:
- name: vllm
  args:
  - --enable-lora
  - --max-loras
  - "2"
  - --lora-modules
  - '{"food-review": "/adapters/food-review", "code-assist": "/adapters/code-assist"}'

note: Simplified example Production deployments use a lora-adapter-syncer sidecar and ConfigMap for dynamic adapter management without pod restarts.

The EPP tracks which adapters are loaded on each pod. Requests specifying model: food-review route to pods with that adapter active, avoiding cold-start latency.

Priority-based scheduling

Use InferenceObjective to separate latency-sensitive workloads from batch tasks:

# Priority 1 for interactive chat
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: interactive-chat
spec:
  priority: 1
  poolRef:
    group: inference.networking.k8s.io
    kind: InferencePool
    name: vllm-pool
---
# Priority 2 for batch summarization
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: batch-summarization
spec:
  priority: 2
  poolRef:
    group: inference.networking.k8s.io
    kind: InferencePool
    name: vllm-pool

Clients set the header: x-gateway-inference-objective: interactive-chat. The EPP prioritizes these requests during congestion.

Integration with higher-level AI Gateways

The Inference Extension operates at the Kubernetes networking layer. Integrate with higher-level AI gateways like LiteLLM, Gloo AI Gateway, or Apigee for:

  • Multi-cloud AI service aggregation (self-hosted + OpenAI + Vertex AI + Bedrock)
  • Unified authentication and token management
  • Cost tracking and analytics across providers
  • Fallback and failover policies

The extension exposes your self-hosted models as OpenAI-compatible endpoints that these gateways can consume.

Gotchas

EPP failure mode drops requestsfailureMode: FailOpen means requests are dropped if the EPP is unresponsive. Run EPP with multiple replicas and configure health checks. Consider FailClose if you prefer requests to queue rather than fail.

CPU deployments crash under load — The vLLM CPU deployment option is unreliable. Testing showed pods need 64GB+ RAM and 48+ CPUs for acceptable performance. GPU-backed deployments are the only production-viable option.

InferenceObjective is alpha — The priority API is alpha (v1alpha2). Breaking changes are possible. Use for internal workloads, but wait for GA before building critical SLAs around it.

Gateway implementation variance — GKE, Istio, Agentgateway, and NGINX have different setup paths and feature support. GKE is managed (lowest friction), Istio requires the ENABLE_GATEWAY_API_INFERENCE_EXTENSION flag, Agentgateway needs the inferenceExtension.enabled Helm value. Test your chosen implementation thoroughly.

Model server protocol required — Pods must implement the model server protocol for EPP to route intelligently. vLLM and Triton have integrations; custom model servers need to expose metrics in the expected format.

Wrap-up

The Gateway API Inference Extension GA gives you model-aware routing without vendor lock-in. InferencePool replaces Service for AI workloads, coupling with an EPP that tracks KV-cache state, queue depth, and LoRA availability. InferenceObjective adds priority scheduling for mixed workloads — though it's still alpha.

The real win: your gateway now understands what your workloads are doing. KV-cache-aware routing means fewer cache evictions. LoRA affinity means adapters stay warm. Both mean lower latency and better GPU utilization.

Try it: deploy the vLLM simulator on a Kind cluster, configure Istio with the inference extension, and watch the EPP metrics as you send traffic. The difference between round-robin and cache-aware routing becomes obvious within minutes.

KubeDojo
KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Subscribe to KubeDojo

Get the latest articles delivered to your inbox.

Related Articles