ai_context, field definitions, or topic structure are guesswork. With evals, you can quantify the impact of every change, catch regressions before they reach production, and build confidence that the AI is improving.
This guide covers how to design, build, and maintain an eval set. It’s aimed at model developers and admins who are tuning their semantic model to improve AI query accuracy.
Requirements
To follow this guide, you’ll need:- An understanding of Omni’s semantic model and topics
- Access to the model IDE
- Familiarity with
ai_contextand field configuration
Understanding evals
What is an eval?
What is an eval?
An eval case pairs a natural-language question with the query the AI should produce in response. For example, “What’s our total revenue?” paired with the expected topic, fields, and filters. An eval set is a collection of these cases that you run against the AI to measure accuracy — typically expressed as the percentage of cases where the AI produced the expected query.
Why are evals important?
Why are evals important?
Without evals, you have no way to know whether a model change improved or degraded AI accuracy. Evals let you quantify accuracy before and after changes, catch regressions from field renames or topic restructuring, and track improvement over time. They turn AI optimization from trial-and-error into a measurable process.
When should you run evals?
When should you run evals?
Run evals before merging model changes —
ai_context updates, field additions, topic restructuring — to confirm accuracy isn’t regressing. Also run them after changing AI settings like thinking levels or query scope. Periodic runs (monthly or quarterly) help you monitor drift as your data and user questions evolve.What makes a good eval set
A well-designed eval set has the following characteristics:- Representative — reflects questions your users actually ask
- Broad coverage — spans multiple topics, not concentrated in one area
- Tiered complexity — includes simple lookups, filtered queries, multi-measure breakdowns, and edge cases
- Unambiguous expected results — each case has a clear “right answer”; if multiple correct queries exist, pick the canonical one
- Tagged — categorize cases so you can analyze accuracy by type
Building eval cases from existing queries
The most reliable way to build eval cases is to start from queries you know are correct.Identify common questions
Check the AI usage dashboard or interview stakeholders to find the questions users ask most frequently.
Build the correct query
In an Omni workbook, construct the query that correctly answers the question.
Export the query JSON
Export the query JSON from the workbook to capture the exact field selections, filters, and sorts.
Map workbook JSON to eval case format
Translate the exported JSON into your eval case structure:
table— use as context, but compare against the top-leveltopicfrom the API responsejoin_paths_from_topic_name— maps to expectedtopiccolumn_name/sort_descendingin sorts — keep as-is for comparison
The
table field in workbook JSON provides context, but the field you compare against in your eval is the top-level topic from the API response.Write the natural language prompt
Write the question a user would ask that should produce this query. Use natural phrasing — avoid overly technical language that mirrors field names.
Example eval case categories
Once you have a process for building cases, categorize them for granular analysis. Tag each case with a category so you can identify which types of questions the AI handles well and where it struggles.| Category | Tests | Example prompt |
|---|---|---|
basic-metric | Single measure retrieval | ”What’s our total revenue?” |
time-series | Date dimension + measure | ”Revenue by month this year” |
top-n | Sorting + limit | ”Top 10 customers by spend” |
filtered | Filter application | ”Orders in California last quarter” |
multi-measure | Multiple measures together | ”Revenue and order count by category” |
cross-topic | Correct topic selection | ”How many users signed up?” (not orders topic) |
ambiguous | Term disambiguation | ”Show me sales” (revenue? count? both?) |
complex | Multiple dimensions + filters + sorts | ”Top 5 categories by revenue in Q4, excluding returns” |
Sizing your eval set
Your eval set should grow alongside your optimization efforts. Start small and expand as you iterate.- Start with 20-30 cases covering your core scenarios and most-used topics
- Expand to 50-100 as you optimize
ai_contextandsample_queries - Prioritize breadth over depth early on — one case per topic is more valuable than ten cases for the same topic
- Add regression cases when you find failures in production — if a user reports a wrong answer, add it to the eval set
You’ll get more value from 30 well-chosen cases across different topics than 100 cases concentrated in a single area. Breadth reveals gaps; depth confirms what you already know.
Handling ambiguity
Not every question has a single correct answer. When your eval set includes ambiguous prompts, use one of these strategies to avoid false failures:- Pick the canonical answer and accept some “false failures” — track these separately with an
ambiguoustag - Use a similarity threshold — instead of requiring an exact field match, accept results where most of the expected fields are present
- Add acceptable alternatives to eval cases for known ambiguous prompts:
Golden query validation
Before trusting your expected queries, validate them to make sure they still work against your current model.Execute expected queries
Run each expected query via the Query API to confirm it executes without errors.
Confirm model validity
Verify that the topic and fields referenced in your eval still exist in the semantic model.
Maintaining eval cases
Your eval set is a living artifact. As your model evolves, your evals need to keep pace.- Update expected results when the model changes — renamed views, new fields, restructured topics
- Version your eval set alongside model snapshots — use the Models API to capture the model definition at each point in time
- Remove cases that become invalid due to model restructuring rather than letting them produce false failures
- Add regression cases when a previously-correct query breaks — this prevents the same issue from recurring
Review your eval set quarterly. Remove cases that no longer represent real user questions and add new ones based on recent AI usage patterns.
Next steps
Optimize models for Omni AI
Add ai_context, sample queries, and field metadata to improve AI accuracy.
Learn from conversation
Let the AI capture business context from your conversations.
Model AI settings
Configure AI query scope, validation, and thinking levels.
Query API
Use the Query API to programmatically validate golden queries.