Microsoft Foundry Local Is GA: Run AI On-Device Free

Microsoft Foundry Local on-device AI inference runtime - developer workstation with neural network visualization

Microsoft Foundry Local hits General Availability at Build 2026

Microsoft just ended the per-token tax for developers willing to run models locally. Foundry Local hit General Availability at Build 2026 on June 2, giving developers a production-grade, cross-platform runtime for on-device AI inference — no cloud dependency, no per-token costs, no network latency. Python, JavaScript, C#, and Rust SDKs ship today. Hardware acceleration is automatic: GPU, NPU, or CPU, with zero detection code required.

This Is Not Another Ollama Clone

That distinction matters. Ollama, LM Studio, and LocalAI all require a separate server process running in the background before your app can make inference calls. Foundry Local ships as a native SDK — you import a library, not spin up a daemon. The runtime manages the full model lifecycle (download, load, inference, unload) inside your process. For production deployments, this simplifies things considerably: fewer moving parts, no daemon health checks, no port management.

It’s also OpenAI API-compatible. If you’re already using the OpenAI Python SDK, switching to local inference is one URL change:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5272/v1",
    api_key="not-needed"
)
response = client.chat.completions.create(
    model="phi-4",
    messages=[{"role": "user", "content": "Explain async/await in Python."}]
)
print(response.choices[0].message.content)

The same API call that hits GPT-4o today can target a local Phi-4 tomorrow, and the same code can point back to Azure AI Foundry cloud when you need a larger model. Write once, deploy anywhere on the Microsoft stack.

Getting Started

Installation is platform-aware:

# Windows (hardware-accelerated via WinML)
pip install foundry-local-sdk-winml

# macOS (Apple Silicon) or Linux x64
pip install foundry-local-sdk

JavaScript developers get the same split: npm install foundry-local-sdk-winml on Windows, npm install foundry-local-sdk everywhere else. C# and Rust SDKs are available too — this is not a Python-first project that bolted on other languages. The quickstart docs run a model in under five minutes on any supported platform.

What’s in the Model Catalog

At GA, Microsoft Foundry Local ships roughly two dozen models optimized for local inference:

Phi-4, Phi-3.5 Mini — Microsoft’s own; strong quality-to-size ratio for on-device AI inference
Qwen 2.5 — 0.5B through 7B variants; the 0.5B is useful for testing on constrained hardware
DeepSeek R1 — 7B and 14B reasoning models
Mistral 7B — solid general-purpose baseline
Whisper — audio transcription, fully on-device

The catalog focuses on small-to-medium open models optimized specifically for edge inference — don’t expect GPT-4-class capability. This is the right trade-off: models sized for real hardware, not cloud clusters.

When to Use Foundry Local Instead of Ollama

The honest answer: it depends on what you’re building.

Ollama wins on ecosystem breadth. Any GGUF model runs on it, the community integration list is vast (Open WebUI, Continue.dev, dozens more), and macOS Intel support exists. If you’re experimenting or building personal tooling, Ollama is still the path of least resistance.

Microsoft Foundry Local wins on three things Ollama doesn’t have:

NPU support — Copilot+ PC NPUs are now mainstream. Foundry Local uses them automatically. Ollama does not.
No server process — embed inference directly in your application without requiring a background daemon.
Enterprise governance — data sovereignty, lifecycle management, a path to Azure cloud, and enterprise support SLAs.

If you’re shipping a product to customers — especially in regulated industries — Foundry Local is worth evaluating seriously.

The Privacy Angle Is Real

Healthcare, finance, and government teams have spent years trying to use LLMs while keeping data off third-party servers. Microsoft Foundry Local is a production answer to that problem. Prompts and outputs stay on the device. The runtime has already been deployed in air-gapped sovereign cloud environments alongside Azure Local Disconnected — this isn’t a future roadmap item, it’s shipping now.

For HIPAA-covered applications, GDPR-constrained deployments, or classified-environment tooling, the combination of a mature SDK, official Microsoft support, and zero data egress is a genuine unlock.

Limitations Worth Knowing

macOS Intel is not supported — Apple Silicon only. The model catalog, while solid, is narrower than Ollama’s (no arbitrary GGUF models). And local inference performance still depends on your hardware; a machine without a GPU or NPU will use CPU, which is usable but slow for larger models.

None of these are dealbreakers for the right use case. They are, however, reasons Ollama isn’t dead.

What to Do Now

The GitHub repository has working samples for all four SDKs. The VS Code Foundry Toolkit extension adds IDE integration for model management and local inference testing. If you’ve been waiting for local AI to be production-grade, Microsoft Foundry Local is the release that changes that.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.