Kubernetes 1.36 on AWS EKS: DRA, User Namespaces, and What to Fix Now

Kubernetes 1.36 Haru release featuring DRA GPU scheduling and container security improvements

Amazon EKS added Kubernetes 1.36 “Haru” support this month. The release skips new APIs in favor of graduating half-finished features into production-ready ones — and three of those graduations directly affect teams running AI workloads: native GPU scheduling via Dynamic Resource Allocation, rootless containers without add-ons, and vertical pod resizing without a restart. Here is what to act on before your next cluster upgrade.

Dynamic Resource Allocation Is the GPU Scheduling Standard Now

The old way to schedule GPUs in Kubernetes was to request nvidia.com/gpu: 1 in a pod spec. That tells the scheduler exactly one thing: “I want a GPU.” It says nothing about VRAM, NVLink topology, or MIG partition configuration — the attributes that determine whether a large language model will actually fit and run efficiently on that device. Device Plugins counted integers. That was always the problem.

Dynamic Resource Allocation (DRA) replaces that model with attribute-based claims. Workloads declare what hardware properties they need; DRA’s control plane locates hardware that satisfies those claims. You can now write a ResourceClaim that requests any GPU with at least 40 GB VRAM and let Kubernetes find it across the cluster, regardless of which node it sits on.

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: nvidia.com/gpu
      selectors:
      - cel:
          expression: device.attributes["memory"].isGreaterThan(quantity("40Gi"))

The hardware vendor signals are unambiguous. At KubeCon Europe 2026, NVIDIA donated its DRA GPU driver to CNCF. Google open-sourced the DRA TPU driver at the same event. Both major AI hardware vendors have committed to DRA as the standard interface for hardware scheduling in Kubernetes. If you are still writing Device Plugin configurations for new AI infrastructure, stop. DRA is GA; the Device Plugin model will not receive new capabilities.

Kubernetes 1.36 also introduces Workload Aware Scheduling in Alpha — a PodGroup API that treats related pods (a training job’s head and worker pods, for example) as a single scheduling unit. The group succeeds or fails placement atomically rather than each pod competing independently. Watch this feature; it closes the gang-scheduling gap that has made distributed training on Kubernetes awkward for years.

User Namespaces: One Field That Changes Your Security Posture

Before Kubernetes 1.36, achieving genuine container isolation required running gVisor or Kata Containers — both effective, both operationally heavy. Most teams skipped them and accepted a weaker boundary: containers running as effective root on the host kernel.

User Namespaces, now stable in 1.36, maps a container’s root user (UID 0) to an unprivileged user on the host. A process that escapes the container sandbox gets no node-level access. Enable it with one field:

spec:
  hostUsers: false
  containers:
  - name: app
    image: my-app:latest

The caveats are real: a shared kernel is still a shared attack surface, and User Namespaces does not eliminate the risk of kernel exploits. It reduces the blast radius when a container escape does happen. For workloads currently running with default settings, enabling hostUsers: false is a meaningful, low-cost improvement that requires no infrastructure changes.

Resize Pods Vertically Without a Restart

In-place pod vertical scaling is stable and ships with Kubernetes 1.36. Patch CPU or memory requests on a running pod and the container adjusts without eviction. CPU changes apply immediately. Memory changes can be configured via the restartPolicy field in the container spec to either apply in-place or trigger a controlled container restart.

The most relevant use case for AI infrastructure: LLM inference servers. Loading a large model takes time — 60 seconds or more for 70B+ parameter models. Previously, vertical resizing meant evicting the pod, losing the loaded model, and waiting for the replacement to load it again. In-place resizing eliminates that cycle. Patch the spec; the container gets the resources; the model stays loaded.

Webhooks: Start Your Migration to MutatingAdmissionPolicy

MutatingAdmissionPolicy is GA in 1.36. It lets you write CEL expressions directly in the API server to mutate resources — adding labels, injecting sidecar defaults, normalizing resource limits — without a separate TLS-terminated webhook server. For the majority of admission webhook use cases, CEL in-process is faster, simpler, and has no network failure mode.

Webhooks are not gone. Complex mutations requiring external data lookups still need them. But if you are maintaining a webhook for something expressible as a CEL expression, you are carrying unnecessary infrastructure. Audit your webhooks before your next 1.36 upgrade.

Pre-Upgrade Checklist: What Breaks in Kubernetes 1.36

AWS added Kubernetes 1.36 to EKS this month. Before upgrading, check these breaking changes:

IPVS mode removed from kube-proxy. Deprecated in 1.35, deleted in 1.36. Check your config: kubectl get cm kube-proxy-config -n kube-system -o yaml | grep mode. If it returns ipvs, migrate to iptables before upgrading.
gitRepo volume driver removed. Deprecated since Kubernetes 1.11. Manifests referencing this volume type will not schedule after upgrade.
Ingress-NGINX retired. Plan your migration. The NGINX-based ingress controller is no longer maintained upstream.
CoreDNS add-on update required. EKS 1.36 runs CoreDNS 1.11.x. Update the add-on as part of your upgrade sequence.
externalIPs deprecation warnings are live. These fields disappear in 1.43. Migrate to LoadBalancer or Ingress.

Most EKS upgrade incidents are not caused by the upgrade itself — they happen because deprecated manifests slip through undetected until the new API server rejects them. Run pluto or kubent against your cluster before upgrading to catch API removals early.

The Bigger Picture

Kubernetes 1.36 is a graduation release, not an announcement release. The project is in a stability-over-novelty phase: cementing features that matter for production — AI hardware scheduling, container security, operational simplicity — instead of shipping new APIs. The 70 enhancements in this release break down to 18 stable, 25 beta, and 25 alpha. That ratio tells you where the project’s energy is: making existing features solid, not adding more.

For teams that have been deferring upgrades, 1.36 on EKS is a good forcing function. The features that just graduated are worth having. The breaking changes are manageable if you audit before upgrading. Start with the IPVS check — that is the one most likely to surprise you.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.