llm-d 0.7: Kubernetes LLM Inference That Cuts GPU Waste

Abstract visualization of llm-d distributed Kubernetes pods routing LLM inference requests with KV-cache-aware scheduling

llm-d 0.7 — Kubernetes-native distributed LLM inference with disaggregated serving and KV-cache routing

If your LLM serving is slow and expensive, buying more GPUs is probably the wrong prescription. The real problem is that most teams are wasting the hardware they already have — running prefill and decode on the same pods, cold-routing requests to nodes with empty caches, and watching GPU utilization spike while time-to-first-token climbs. llm-d 0.7, now a CNCF Sandbox project with Google, Red Hat, IBM, NVIDIA, and AWS behind it, is the Kubernetes-native scheduling layer built to fix exactly this.

The Problem: Prefill and Decode Are Not the Same Workload

Most vLLM deployments treat all inference the same. That’s the mistake. Prefill — processing the input tokens in parallel — is compute-intensive. Decode — generating output tokens one at a time — is sequential and constrained by memory bandwidth. These two phases have completely different hardware profiles, and running them on the same GPU means each is constantly throttled by the other’s requirements.

llm-d’s answer is disaggregated serving: prefill requests go to compute-optimized pods, decode goes to memory-bandwidth-optimized pods. In production tests on GPT-OSS-120B and Llama 3.3 70B on AMD MI300X, disaggregated inference delivered 57x faster time-to-first-token and 2x throughput on the same hardware. That is not a benchmark artifact — it is what happens when you stop forcing two incompatible workloads to share the same resource pool.

KV-Cache-Aware Routing: The Optimization Nobody Talks About

Every time a request lands on a vLLM pod that does not have the relevant KV cache state, you pay for the prefill computation from scratch. In a multi-instance deployment where requests land round-robin, this happens constantly. llm-d’s Inference Gateway (IGW) tracks which pods hold which cached KV states and routes requests to the pod that already has the answer. The savings compound — the longer the system runs, the warmer the caches, the cheaper each request becomes.

According to llm-d’s production data, cache-aware routing reduces infrastructure costs by 30–50% while holding latency SLOs. You are not trading performance for savings — you are getting both.

What’s New in v0.7

The headline in v0.7 is predicted-latency scheduling going GA. The Endpoint Picker (EPP) now integrates with in-pod latency predictor sidecars that learn continuously from live traffic. They estimate p90 TTFT and p90 TPOT for each candidate pod, compare against the per-request SLO, and direct traffic to pods with headroom. When no pod can meet the SLO, requests are shed rather than queued until they time out. The result is a 40% reduction in TTFT and inter-token latency versus heuristic-based routing on NVIDIA GPUs.

v0.7 also ships an experimental batch gateway for async inference workloads alongside real-time serving — a capability missing from vLLM standalone. The documentation was fully rewritten with a kustomize-first approach, and nightly CI now runs against OpenShift, GKE, and CoreWeave.

CNCF + AWS: This Is Infrastructure Now

In March 2026, llm-d was accepted as a CNCF Sandbox project at KubeCon Europe. CNCF Sandbox status means governance, community CI, and a roadmap that does not depend on any single vendor. The founding contributors — Red Hat, Google Cloud, IBM Research, NVIDIA, and CoreWeave — are joined by AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI.

More telling is the AWS signal. Amazon launched disaggregated inference on AWS powered by llm-d, shipping a dedicated llm-d-aws container with Elastic Fabric Adapter support for high-speed inter-node KV transfer. It runs on both SageMaker HyperPod and Amazon EKS. When AWS builds and ships a custom container for an open-source project, that project has crossed from “interesting experiment” to “production infrastructure.”

llm-d vs. vLLM: When to Use Which

llm-d does not replace vLLM — it orchestrates a fleet of vLLM instances. For a single model on 1–4 GPUs, vLLM standalone is the right choice: less complexity, less overhead, full performance. llm-d becomes the right answer when you are running multiple vLLM instances, serving models above 70B parameters, handling multi-tenant workloads, or need to drive down GPU costs at scale.

The integration point is the Kubernetes Gateway API — no new control plane required. A minimal deployment adds an InferencePool and an Endpoint Picker in front of your existing vLLM pods:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llama-70b-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama-70b
  extensionRef:
    name: llm-d-epp

The EPP handles everything from there: routing, cache locality decisions, latency prediction, and request shedding when SLOs cannot be met.

The Bigger Picture

GPU compute cost is the fastest-growing infrastructure line item for AI teams running production workloads in 2026. The instinct is to add hardware. The better move is to make the hardware you already have route smarter. llm-d’s GitHub repository and documentation are the right starting point before your next GPU procurement decision.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.