Together AI raised 800 million dollars on July 1 at an 8.3 billion dollar valuation. The investors tell the story: Aramco Ventures led, NVIDIA wrote a check, General Catalyst came back. This is not venture money chasing a language model. It is infrastructure capital betting that enterprises will run the bulk of their AI workloads on open-source models and that they will need purpose-built cloud infrastructure to do it cheaply at scale. Annual bookings hit 1.15 billion dollars last quarter. Enterprises are not hedging. They are committing.
What Together AI Actually Does
Together AI has never built a foundation model. That is intentional. The company’s entire business is inference infrastructure for other companies’ open-source models — over 200 of them, including Llama, DeepSeek, Qwen, Mistral, and Nemotron. You send an API call, get a response, pay per token. No GPU provisioning, no replica sizing, no cold-start anxiety on popular models. And because the API is OpenAI-compatible, migrating existing applications is a one-line change.
The platform goes deeper than serverless inference. Dedicated H100 and H200 endpoints handle high-volume production workloads. GPU clusters and fine-tuning pipelines let teams train on proprietary data. Batch inference runs at a 50 percent discount for non-real-time workloads — relevant for teams running document pipelines, classification jobs, or overnight data processing. Together AI’s research team also developed FlashAttention-3, which delivers 1.5 to 2x more performance from H100 GPUs and reaches approximately 75 percent of the chip’s theoretical maximum FLOPS.
The Cost Case Is Not Theoretical
Together AI claims customers can cut inference costs by up to 60x compared with closed-model providers. Even the conservative figure — 6 to 20x savings — changes business models when you are running inference at enterprise scale. Decagon, which builds AI-powered voice agents for enterprise customer service, moved its production inference to Together AI and achieved a 6x cost reduction per turn. Latency dropped from seconds to under 400 milliseconds at p95, on inputs spanning tens of thousands of tokens. The workload that previously strained a voice pipeline now runs around the clock without degradation.
These are production numbers, not benchmark lab results.
Try It Today
If you are already using the OpenAI Python SDK, migration takes under a minute:
from openai import OpenAI
client = OpenAI(
api_key="your-together-api-key",
base_url="https://api.together.xyz/v1"
)
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-Turbo",
messages=[{"role": "user", "content": "Summarize this document"}]
)
The Together AI serverless model catalog lists all available models with current pricing. Start on the free tier, run your benchmarks, and compare total monthly cost against your current OpenAI or Anthropic bill. If you have been optimizing inference costs with vLLM for self-hosted models, Together AI’s managed option removes the infrastructure burden entirely.
What NVIDIA’s Investment Signals
NVIDIA investing in Together AI is not surprising — Together AI runs workloads on NVIDIA hardware. The more interesting signal is what it says about where NVIDIA expects GPU revenue to come from. Model training is a one-time cost. Inference is recurring and scales with usage. NVIDIA backing the open-source inference layer suggests the company believes open models will drive long-term GPU demand, not proprietary model vendors. Aramco Ventures leading reinforces this: sovereign wealth funds treat AI inference like energy infrastructure, not software startups. The 500 million dollars in secured compute commitments and the 50x capacity expansion plan over five years are consistent with that framing.
Where Closed Models Still Win
This is not a eulogy for closed models. Open-source models reach roughly 85 to 90 percent of closed-model performance on standard enterprise tasks. That remaining gap matters for complex multi-step reasoning, high-stakes decision support, and applications where quality degradation carries real risk. The hybrid architecture — open-source inference for high-volume workloads, frontier closed models for edge cases — is what most enterprise AI teams will land on. The right question is not open or closed. It is which workloads are worth closed-model prices. For most production inference at scale, the answer is fewer than you think.
The Bottom Line
Together AI’s 800 million dollar round is a credibility inflection point for open-source inference. A year ago this was a cost-optimization play for budget-constrained teams. Today it is an 8.3 billion dollar business backed by the GPU leader and sovereign infrastructure capital, with 1.15 billion dollars in annual bookings to back it up. If you are routing all inference through OpenAI or Anthropic and have not run a cost comparison against open models, this round is your signal to do it. The math is hard to ignore at scale.













