You built a Foundation Models feature. It tagged books perfectly in Xcode Previews. Then a real user wrote a three-sentence review with a typo and it generated eleven tags about nothing. This is the core problem with shipping AI features in 2026: your existing tests were never designed for this, and most developers are solving it with hope.
Apple disagrees with that approach. At WWDC26, the company introduced the Evaluations framework—a first-party Swift framework built specifically to measure the quality of generative AI features in iOS and macOS apps. It ships with Xcode 27, integrates natively with Swift Testing, and targets the gap between “it works in the demo” and “it works for real users.”
The Problem With Testing LLMs Like You Test Everything Else
Traditional unit tests assume determinism. LLMs are probabilistic—the same prompt produces different outputs on different runs, by design. XCTest's assertEqual will fail on valid but differently-worded responses and pass when an incorrect output happens to match a cached expectation. Around 40% of organizations deploying LLM-powered apps hit significant quality regressions within 90 days of launch. Most engineering teams ship their AI features with less test coverage than their login forms.
The Evaluations framework treats AI feature quality as something you measure statistically rather than assert exactly—more like performance benchmarking than unit testing. Run your feature against a dataset of 20 to 30 samples, aggregate pass rates and scores across all of them, and track whether your metrics improve as you iterate on your prompts.
Five Components, One Evaluation
The core API is built around five building blocks that compose into an Evaluation struct. The subject is the code under test. The dataset is an array of ModelSample instances, each carrying an input and optionally an expected output. The Metric defines what you are measuring—tag count, word length, sentiment label. The Evaluator turns a generated output into a pass or fail verdict. And aggregateMetrics rolls everything up into summary statistics across your full dataset.
The WWDC26 demo uses a BookTracker app where BookTaggingService automatically tags books from written reviews. A TagCount metric passes if the generated tags array has between three and eight items. That first useful evaluation takes roughly fifteen lines of Swift and runs inside Xcode's standard test report UI. Pair it with the Foundation Models @Guide macro on your @Generable type—which embeds count-range instructions directly in the output schema—and you have both the enforcement mechanism and the measurement tool in one workflow.
struct TagEvaluation: Evaluation {
typealias Subject = BookTaggingService
var subject: BookTaggingService { BookTaggingService() }
var dataset: [ModelSample<String, BookTags>] {
[
ModelSample(input: "A gripping dystopian novel set in 2047..."),
ModelSample(input: "A beginner's guide to sourdough baking...")
]
}
func evaluator(for sample: ModelSample<String, BookTags>) -> some Evaluator<BookTags> {
TagCountEvaluator() // Pass if tags.count is between 3 and 8
}
}
ModelJudgeEvaluator: When Pass/Fail Is Not Enough
Counting tags tells you something. It does not tell you whether the tags are any good. For qualitative evaluation, the framework ships ModelJudgeEvaluator—a second, more capable model (Apple's Private Cloud Compute or any model via the LanguageModel Protocol) that reads your feature's output and scores it on dimensions you define.
Split quality into distinct ScoreDimensions rather than a single axis. Apple's example uses Relevance (is this tag accurate to the content?) and Usefulness (is this tag helpful for browsing?). Use an even-numbered scale—1 to 4, not 1 to 5—to force a non-neutral judgment. The real value is not the scores but the rationales the judge generates: “tag 'dystopian' is present but 'near-future thriller' is missing” is a prompt improvement instruction, not just a number. Relevance failures and Usefulness failures have different root causes, and splitting the dimensions surfaces that directly.
The Hill-Climbing Loop
Evaluations only matter if they drive change. WWDC26 Session 335 formalizes the iteration pattern Apple calls hill-climbing: run evals, read the rationale output, identify the dominant failure pattern, adjust the prompt or @Guide instructions, re-run, check whether the aggregate metric moved. Repeat until you hit your quality target. This replaces the “run it in Simulator, squint at the output, ship it” workflow that most AI feature development currently follows.
For agentic features that call tools, Session 299 extends this with ToolCallEvaluator. Define trajectory expectations—which tools should be called and in what order for a given prompt—and the evaluator captures the full structured transcript against those expectations. The framework can also generate synthetic evaluation data from a small seed dataset, which matters because collecting 200 diverse samples by hand is not realistic.
What Apple Just Signaled
No other platform vendor ships a first-party LLM evaluation framework baked into the IDE. Apple is treating AI feature testing as a first-class practice, not an afterthought for third-party tooling to solve. If you are building with Foundation Models today and running zero evaluations, you now have no excuse—the tooling is in Xcode 27, three WWDC26 sessions walk through the full workflow, and the dataset you need is smaller than you think. Start with the BookTracker sample project, adapt the evaluation to your own feature, and get a measurable pass rate before you merge.













