Technology

Thinking Machines AI: Ex-OpenAI CTO Beats GPT 3x Faster

Mira Murati Thinking Machines real-time AI interaction model concept with 0.4-second latency visualization beating OpenAI GPT-Realtime performance benchmarks

On May 11, 2026, Mira Murati’s Thinking Machines Lab announced “Interaction Models” that achieved 0.4-second real-time AI responses while listening and speaking simultaneously. The TML-Interaction-Small model scored 77.8/100 on FD-bench interaction quality—crushing OpenAI’s GPT-Realtime-2.0 (46.8 score, 1.18s latency). Eight months after leaving as OpenAI’s CTO, Murati’s back with a model 66% better and 3x faster than her former employer’s offering.

However, this isn’t incremental improvement. It’s a direct challenge to how every major AI lab builds voice interaction.

200ms Micro-Turns: Rethinking AI Interaction From Scratch

Thinking Machines built a 200ms micro-turn architecture that processes inputs and outputs in continuous chunks, eliminating turn-taking entirely. In fact, current voice AI feels awkward because it is—humans expect ~500ms response times, and anything longer registers as lag. Traditional systems wait for you to finish speaking, detect the pause, process, then respond. Consequently, that’s why Siri and Alexa feel robotic.

Moreover, TML’s approach makes interaction native to the model itself. No external voice activity detection. No scaffolding. The AI handles interruptions, overlapping speech, and proactive interjections because it’s designed that way from the ground up. On CueSpeak (responding to verbal cues at the right moment), TML scores 81.7% versus 2.9% for baseline turn-based systems. On TimeSpeak (speaking at user-specified times), TML hits 64.7% versus 4.3% baseline.

In contrast, OpenAI bolted real-time capabilities onto existing chat models. Thinking Machines argues that’s fundamentally wrong. “Every major AI lab has built its interaction layer as an afterthought, and the resulting latency and limitation is not a tuning problem but an architectural one,” the company states in its official blog. As a result, the benchmarks suggest native interaction wins.

Ex-OpenAI CTO Proves Former Employer Built It Wrong

Mira Murati stepped down as OpenAI’s CTO in September 2024 to “do my own exploration.” Previously, she led development of ChatGPT, DALL-E, Codex, and Sora. Now her Thinking Machines Lab has released a model that dramatically outperforms OpenAI’s GPT-Realtime-2.0 across every interaction metric: 66% better interaction quality, 3x faster latency.

Furthermore, this is more than competitive positioning—it’s an architectural philosophy debate the industry hasn’t answered. OpenAI took existing chat infrastructure and added real-time features. Thinking Machines built interaction natively. Different approaches, radically different results. The question isn’t just “who’s faster?” but “which architecture is right?”

TML’s 0.40-second latency hits the human perception threshold. Meanwhile, GPT-Realtime-2.0’s 1.18 seconds crosses into “this feels slow” territory. That difference isn’t tuning—it’s design. Ultimately, turn-based systems will always have that gap because they’re waiting for signals that continuous processing doesn’t need.

Two-Component Architecture: Speed and Intelligence Together

TML uses a two-component system: an interaction model handling immediate real-time dialogue (200ms chunks), and a background model processing deeper reasoning tasks asynchronously. Specifically, the interaction model stays present while the background model runs tool calls, searches, and complex analysis. Subsequently, results stream back naturally into conversation.

Consequently, this solves the latency versus intelligence trade-off. Real-time responses can’t wait for complex reasoning. By separating immediate interaction from deep thinking, TML maintains 0.4s latency while still accessing sophisticated capabilities when needed. The 276B total parameters (12B active mixture-of-experts) balance size with speed—only the necessary expert networks activate per inference.

For developers building AI assistants, this architecture pattern matters. You don’t want fast but dumb, or smart but slow. The two-component design shows how to get both. Therefore, expect competitors to copy this approach.

What This Means for Developers

Eighty-four percent of developers already use AI coding tools, but current assistants are text-based with awkward turn-taking. In addition, sub-500ms latency plus simultaneous listening and speaking enables true voice+video collaboration. Essentially, the AI watches your screen, hears you thinking aloud, and provides real-time feedback without waiting for you to finish sentences.

For instance, imagine describing what you’re building while AI watches your code, notices errors mid-typing, and asks “Did you mean to use that API version?” without breaking your flow. Current tools require explicit prompts. Real-time interaction makes AI feel like a pair programming partner, not a Q&A bot. That’s the shift from “AI as tool” to “AI as collaborator.”

TML-Interaction-Small has limited research partner access now, with broader public rollout planned later in 2026. To clarify, contact feedback@thinkingmachines.ai for research access. Additionally, the team upstreamed their custom kernels to SGLang framework, so the community can use the same optimizations powering TML.

Key Takeaways

  • Thinking Machines Lab (founded by ex-OpenAI CTO Mira Murati) announced Interaction Models on May 11, 2026, achieving 0.4s latency and 77.8/100 interaction quality—66% better than OpenAI GPT-Realtime-2.0
  • 200ms micro-turn architecture eliminates turn-taking by processing inputs and outputs continuously, making interruption and overlap native to the model rather than requiring external voice activity detection
  • The architectural debate: OpenAI bolted real-time onto chat models; Thinking Machines built interaction natively. Benchmarks show native interaction wins: TML scores 81.7% on CueSpeak vs 2.9% baseline, 64.7% on TimeSpeak vs 4.3% baseline
  • Two-component design (interaction model + background model) solves latency vs intelligence trade-off, maintaining 0.4s responses while accessing complex reasoning asynchronously via 276B MoE (12B active)
  • Developer impact: 84% already use AI coding tools; sub-500ms real-time voice+video interaction enables natural pair programming that watches screens and collaborates continuously, shifting from “AI as tool” to “AI as collaborator”
ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:Technology