KubeDojo

Mastering the Kubernetes ecosystem — depth-first, no hype.

Posts by KubeDojo

Introduction to KEDA and Event-Driven Autoscaling

How KEDA extends Kubernetes HPA with 65+ scalers, scale-to-zero, and a two-phase architecture for event-driven pod autoscaling.

by KubeDojo

keda kafka sqs

Message Queue Scaling with KEDA — Kafka and SQS

How KEDA's Kafka and SQS scalers calculate lag and queue depth, with TriggerAuthentication patterns and production edge cases.

by KubeDojo

keda prometheus custom-metrics

Custom Metrics and Prometheus-Based Scaling with KEDA

Using KEDA's Prometheus scaler to drive autoscaling from any PromQL query — replacing Prometheus Adapter with a simpler, more flexible approach.

by KubeDojo

keda http autoscaling

HTTP-Based Autoscaling with the KEDA HTTP Add-on

How the KEDA HTTP Add-on intercepts traffic to scale HTTP workloads to zero, and when the Prometheus scaler is better.

by KubeDojo

keda scaledjob batch-processing

Batch Processing with KEDA ScaledJobs

ScaledJob creates one Kubernetes Job per event, scales dynamically, and lets long-running batch workloads terminate cleanly.

by KubeDojo

keda karpenter autoscaling

KEDA and Karpenter Together — Pod and Node Scaling Synergy

Combining KEDA's event-driven pod scaling with Karpenter's just-in-time node provisioning for a fully reactive, cost-efficient Kubernetes autoscaling stack.

by KubeDojo

keda observability prometheus

Observability and Troubleshooting for KEDA

Scraping KEDA operator metrics, building Grafana dashboards for scaling events, and diagnosing common ScaledObject issues in production.

by KubeDojo

kubernetes kubedojo

Welcome to KubeDojo

What KubeDojo is and what you'll find here: deep dives into real code, honest explorations of the Kubernetes ecosystem, and structured learning paths to master every certification.

by KubeDojo

certification kubernetes ckad

The Kubernetes Certification Landscape

A practical map of the five CNCF Kubernetes certifications — what each one covers, how exams work, and which path fits your career.

by KubeDojo

agentic-ai kubernetes keda

Agentic AI Workloads on Kubernetes

How kagent, Agent Sandbox, KEDA, and OPA/Kyverno form the production stack for agentic AI on Kubernetes.

by KubeDojo

gpu nvidia mig

GPU Sharing Strategies for Multi-Tenant Kubernetes: MIG, Time-Slicing, and MPS

NVIDIA's GPU sharing mechanisms — MIG, time-slicing, and MPS — are gaining traction as teams run multiple inference workloads per GPU.

by KubeDojo

nvidia gpu kubernetes

NVIDIA AI Cluster Runtime: Validated GPU Kubernetes Recipes

NVIDIA released AI Cluster Runtime, an open-source project providing validated, version-locked Kubernetes configurations for GPU infrastructure.

by KubeDojo

kueue scheduling batch

Kueue: The Community Standard for Kubernetes AI Batch Scheduling

Kueue manages GPU quotas, enforces fair sharing across teams, and dispatches jobs to remote HPC clusters — the standard for production AI batch scheduling.

by KubeDojo

kubernetes scheduling workload-api

Workload-Aware Scheduling in Kubernetes 1.36: The Decoupled PodGroup Model

Kubernetes 1.36 decouples scheduling policy from runtime instances with Workload API v1alpha2, standalone PodGroups, and a dedicated group scheduling cycle.

by KubeDojo

cncf kubernetes ai-conformance

CNCF Certified Kubernetes AI Conformance Program

CNCF launched v1.0 of the Kubernetes AI Conformance Program defining baseline capabilities for running AI workloads across conformant clusters.

by KubeDojo

vllm kserve llm

Production LLM Serving on Kubernetes: vLLM + KServe Stack

Deploy vLLM with KServe on Kubernetes: InferenceService CRD, KEDA autoscaling on queue depth, and distributed KV cache with LMCache for production inference.

by KubeDojo

nvidia kai-scheduler gpu

NVIDIA KAI Scheduler: Open-Source GPU-Aware Kubernetes Scheduling

NVIDIA open-sourced KAI Scheduler (Apache 2.0), a Kubernetes-native GPU scheduling solution originally from the Run:ai platform.

by KubeDojo

llm-d cncf sandbox

llm-d Joins CNCF Sandbox: Kubernetes-Native Distributed LLM Inference

llm-d was accepted as a CNCF Sandbox project, providing Kubernetes-native distributed inference with KV-cache-aware routing, prefill/decode disaggregation, and accelerator-agnostic serving.

by KubeDojo

kubecon cncf ai-infrastructure

KubeCon Europe 2026 Amsterdam: AI Infrastructure, Agentic Systems, and Platform Engineering

Curated KubeCon EU 2026 picks: agentic AI on Kubernetes, sovereign inference, GPU scheduling with Kueue, and platform engineering under pressure.

by KubeDojo

nvidia dynamo inference

NVIDIA Dynamo 1.0: The Inference Operating System for AI Factories

Production deployment patterns for NVIDIA Dynamo 1.0 on EKS and GKE — disaggregated serving, KV-aware routing, and gotchas from real deployments.

by KubeDojo

hami gpu virtualization

HAMi: GPU Virtualization as the Reference Pattern for AI Infrastructure

HAMi (CNCF Sandbox) emerged at KubeCon EU 2026 as the reference implementation for GPU resource management across NVIDIA, AMD, Huawei, and Cambricon accelerators.

by KubeDojo

gateway-api inference-extension ai-gateway

AI Gateway Working Group and Gateway API Inference Extension GA

Gateway API Inference Extension GA standardizes model-aware routing for self-hosted LLMs with KV-cache-aware scheduling and LoRA adapter affinity.

by KubeDojo

dra kubernetes gpu

Dynamic Resource Allocation (DRA) GA: The New GPU Interface for Kubernetes

DRA went GA in Kubernetes v1.34 and continues evolving — replacing Device Plugins with richer semantics including DeviceClass, ResourceClaim, CEL-based filtering, and topology awareness.

by KubeDojo

armada multi-cluster scheduling

Armada: Multi-Cluster GPU Scheduling as a Single Resource Pool

Armada treats multiple Kubernetes clusters as a single resource pool for GPU-intensive AI workloads, with global queue management, gang scheduling, and production-scale throughput.

by KubeDojo

ingress-nginx gateway-api migration

Ingress2Gateway 1.0: Migrating 50% of Clusters from ingress-nginx to Gateway API

Ingress2Gateway 1.0 provides stable migration from ingress-nginx to Gateway API with 30+ supported annotations and integration-tested translation.

by KubeDojo

ai-security prompt-injection cloudflare

AI Security for LLM-Powered Applications: Prompt Injection Defense

Cloudflare made AI Security for Apps GA with prompt injection protection, while RFC 9457 structured error responses cut AI agent token costs by 98%.

by KubeDojo

docker ai-agents sandboxes

Docker Agent: Multi-Agent AI Teams in Sandboxed Environments

Build teams of specialized AI agents with Docker Agent: declarative YAML config, multi-agent orchestration, and secure execution in sandboxed microVMs with Docker Desktop 4.63+.

by KubeDojo

karpenter kubernetes autoscaling

What Is Karpenter and Why It Replaced Cluster Autoscaler

How Karpenter's groupless, pod-driven provisioning model solves the scaling limitations that plagued Kubernetes Cluster Autoscaler for years.

by KubeDojo

karpenter eks production

From Zero to Production with Karpenter

A hands-on walkthrough of installing Karpenter on EKS, configuring NodePools and EC2NodeClasses, and hardening the setup for production workloads.

by KubeDojo

karpenter kubernetes disruption

Disruption, Drift, and Consolidation — Karpenter's Node Lifecycle

How Karpenter automatically replaces drifted nodes, consolidates underutilized capacity, and respects disruption budgets to keep clusters lean and current.

by KubeDojo

karpenter gpu ai-ml

GPU and AI/ML Workload Scaling with Karpenter

Dedicated GPU NodePools, cold start fixes for 10GB+ AI images, disruption protection for training jobs, and gang scheduling for distributed workloads.

by KubeDojo

karpenter cost-optimization spot-instances

Production Cost Optimization Patterns with Karpenter

Spot-to-Spot consolidation, instance diversification, right-sizing pod requests, and the real-world strategies that cut Kubernetes compute costs by 20-40%.

by KubeDojo

karpenter observability prometheus

Observability and Troubleshooting for Karpenter

Setting up Prometheus metrics scraping, building Grafana dashboards for node lifecycle events, and diagnosing common Karpenter issues in production.

by KubeDojo