Edge AI isn’t optional anymore. In 2026, running inference locally cuts latency from seconds to milliseconds, slashes cloud costs by 60-80%, and keeps sensitive data where it belongs. But standard Kubernetes is a 1GB+ behemoth that chokes on edge hardware. K3s fixes this. At less than 100MB, it brings full orchestration to NVIDIA Jetson devices and edge servers, turning them into production-ready AI clusters. Here’s the complete setup: from bare Jetson Nano to serving real-time object detection in under 30 minutes.
What You’ll Need
Hardware-wise, any NVIDIA Jetson works: Nano ($99), Xavier NX ($399), or Orin ($899+). Already running a GPU server? That works too. Software: Ubuntu 20.04+ and NVIDIA drivers (JetPack handles this on Jetson devices). You’ll need basic Kubernetes knowledge – pods, deployments, kubectl commands. If you’ve run Docker, you’re ready.
Install NVIDIA Container Toolkit First
This is critical. The NVIDIA Container Toolkit lets containers access your GPU. Install it BEFORE K3s – K3s will auto-detect it and configure GPU support automatically. Skip this order and you’ll fight manual configuration hell.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Verify with nvidia-smi. You should see your GPU, driver version, and CUDA version. No output? Fix your drivers before continuing.
Install K3s with GPU Auto-Detection
K3s installation is embarrassingly simple – one command. Since we installed Container Toolkit first, K3s automatically configures containerd for GPU access. No flags needed.
curl -sfL https://get.k3s.io | sh -
Wait 30 seconds. Check cluster status:
sudo kubectl get nodes
Your node should show Ready. K3s is running. But Kubernetes doesn’t know about your GPU yet – that’s next.
Configure GPU Runtime Access
Containerd needs explicit configuration to expose the NVIDIA runtime to pods. Create /etc/containerd/config.d/99-nvidia.toml:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
Restart K3s to apply:
sudo systemctl restart k3s
This creates a RuntimeClass called nvidia. Pods requesting this runtime get GPU access. Clean abstraction – GPU scheduling doesn’t pollute your pod specs.
Deploy NVIDIA Device Plugin
The NVIDIA device plugin advertises GPUs as schedulable Kubernetes resources. Without it, kubectl has no idea your node has a GPU.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
Verify GPU nodes:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
You should see 1 (or more) under the GPU column. If it’s blank, check device plugin logs: kubectl logs -n kube-system -l name=nvidia-device-plugin-ds.
Run Your First GPU Workload
Time to test. Create gpu-test.yaml:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
runtimeClassName: nvidia
containers:
- name: cuda
image: nvidia/cuda:11.8.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
Deploy and check logs:
kubectl apply -f gpu-test.yaml
sleep 5
kubectl logs gpu-test
You should see nvidia-smi output showing your GPU from inside the container. GPU model, driver version, CUDA version – all visible. That’s your edge cluster running GPU-accelerated workloads.
Deploy Real AI Inference
Hello world is done. Let’s run actual AI. Here’s a TensorFlow Serving deployment requesting GPU access:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-inference
spec:
replicas: 1
selector:
matchLabels:
app: tf-inference
template:
metadata:
labels:
app: tf-inference
spec:
runtimeClassName: nvidia
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
command: ["python", "-c", "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"]
resources:
limits:
nvidia.com/gpu: 1
Deploy it:
kubectl apply -f tf-inference.yaml
kubectl logs -f deployment/tf-inference
TensorFlow should detect your GPU and print physical device info. Swap that Python command for actual model serving (TensorFlow Serving, Triton, custom FastAPI) and you’ve got production inference at the edge.
When Things Break: Troubleshooting Checklist
GPU not detected by Kubernetes?
- Run
nvidia-smion the host. No output = driver issue, not Kubernetes. - Check Container Toolkit:
dpkg -l | grep nvidia-container-toolkit. Not installed? Go back to step one. - Verify device plugin is running:
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds. - Check node GPU capacity:
kubectl describe node | grep nvidia.com/gpu. Should show allocatable GPUs.
Pod can’t access GPU?
- Add
runtimeClassName: nvidiato your pod spec. Missing this is the #1 cause. - Specify resource request:
nvidia.com/gpu: 1underresources.limits. - Check GPU availability:
kubectl describe nodeshows allocated vs. available. Another pod hogging it?
CUDA version mismatch errors?
Your NVIDIA driver supports specific CUDA versions. Check the NVIDIA CUDA compatibility matrix. Upgrade drivers or downgrade your container image CUDA version.
Production Hardening: What’s Next
You’ve got GPU workloads running. To productionize:
Monitoring: Deploy Prometheus with DCGM Exporter to track GPU utilization, temperature, memory. Edge clusters running blind fail silently.
Multi-node clusters: Add worker nodes with curl -sfL https://get.k3s.io | K3S_URL=https://server-ip:6443 K3S_TOKEN=your-token sh -. High availability matters at the edge – hardware fails.
Model serving: Integrate Triton Inference Server for production-grade multi-model serving with dynamic batching and model versioning.
Security: Apply network policies. Edge deployments are exposed – default-allow is a liability. Use pod security standards (restricted profile minimum).
K3s isn’t a lightweight Kubernetes – it’s a better Kubernetes for edge. Full orchestration, 1% of the bloat. GPU acceleration isn’t optional in 2026, and neither is running it where your data lives. This setup moves you from cloud-dependent to edge-capable in 30 minutes. The rest is just scale.

