Skip to main content
An eval (short for evaluation) measures how accurately Omni’s AI translates natural-language questions into semantic queries. Each eval case pairs a question — like “What’s our total revenue?” — with the expected query the AI should produce. Running a collection of these cases gives you an accuracy score you can track over time. Evals are essential when you’re optimizing your models for AI. Without them, changes to ai_context, field definitions, or topic structure are guesswork. With evals, you can quantify the impact of every change, catch regressions before they reach production, and build confidence that the AI is improving. This guide covers how to design, build, and maintain an eval set. It’s aimed at model developers and admins who are tuning their semantic model to improve AI query accuracy.

Requirements

To follow this guide, you’ll need:

Understanding evals

An eval case pairs a natural-language question with the query the AI should produce in response. For example, “What’s our total revenue?” paired with the expected topic, fields, and filters. An eval set is a collection of these cases that you run against the AI to measure accuracy — typically expressed as the percentage of cases where the AI produced the expected query.
Without evals, you have no way to know whether a model change improved or degraded AI accuracy. Evals let you quantify accuracy before and after changes, catch regressions from field renames or topic restructuring, and track improvement over time. They turn AI optimization from trial-and-error into a measurable process.
Run evals before merging model changes — ai_context updates, field additions, topic restructuring — to confirm accuracy isn’t regressing. Also run them after changing AI settings like thinking levels or query scope. Periodic runs (monthly or quarterly) help you monitor drift as your data and user questions evolve.

What makes a good eval set

A well-designed eval set has the following characteristics:
  • Representative — reflects questions your users actually ask
  • Broad coverage — spans multiple topics, not concentrated in one area
  • Tiered complexity — includes simple lookups, filtered queries, multi-measure breakdowns, and edge cases
  • Unambiguous expected results — each case has a clear “right answer”; if multiple correct queries exist, pick the canonical one
  • Tagged — categorize cases so you can analyze accuracy by type
Use the Analytics > AI usage dashboard in Omni to see what questions your users are actually asking. This is the best source for building representative eval cases.

Building eval cases from existing queries

The most reliable way to build eval cases is to start from queries you know are correct.
1

Identify common questions

Check the AI usage dashboard or interview stakeholders to find the questions users ask most frequently.
2

Build the correct query

In an Omni workbook, construct the query that correctly answers the question.
3

Export the query JSON

Export the query JSON from the workbook to capture the exact field selections, filters, and sorts.
4

Map workbook JSON to eval case format

Translate the exported JSON into your eval case structure:
  • table — use as context, but compare against the top-level topic from the API response
  • join_paths_from_topic_name — maps to expected topic
  • column_name / sort_descending in sorts — keep as-is for comparison
The table field in workbook JSON provides context, but the field you compare against in your eval is the top-level topic from the API response.
5

Write the natural language prompt

Write the question a user would ask that should produce this query. Use natural phrasing — avoid overly technical language that mirrors field names.
6

Add to your eval file

Add the completed case to your eval set with appropriate tags and category.

Example eval case categories

Once you have a process for building cases, categorize them for granular analysis. Tag each case with a category so you can identify which types of questions the AI handles well and where it struggles.
CategoryTestsExample prompt
basic-metricSingle measure retrieval”What’s our total revenue?”
time-seriesDate dimension + measure”Revenue by month this year”
top-nSorting + limit”Top 10 customers by spend”
filteredFilter application”Orders in California last quarter”
multi-measureMultiple measures together”Revenue and order count by category”
cross-topicCorrect topic selection”How many users signed up?” (not orders topic)
ambiguousTerm disambiguation”Show me sales” (revenue? count? both?)
complexMultiple dimensions + filters + sorts”Top 5 categories by revenue in Q4, excluding returns”

Sizing your eval set

Your eval set should grow alongside your optimization efforts. Start small and expand as you iterate.
  • Start with 20-30 cases covering your core scenarios and most-used topics
  • Expand to 50-100 as you optimize ai_context and sample_queries
  • Prioritize breadth over depth early on — one case per topic is more valuable than ten cases for the same topic
  • Add regression cases when you find failures in production — if a user reports a wrong answer, add it to the eval set
You’ll get more value from 30 well-chosen cases across different topics than 100 cases concentrated in a single area. Breadth reveals gaps; depth confirms what you already know.

Handling ambiguity

Not every question has a single correct answer. When your eval set includes ambiguous prompts, use one of these strategies to avoid false failures:
  • Pick the canonical answer and accept some “false failures” — track these separately with an ambiguous tag
  • Use a similarity threshold — instead of requiring an exact field match, accept results where most of the expected fields are present
  • Add acceptable alternatives to eval cases for known ambiguous prompts:
{
  "id": "show-sales",
  "prompt": "Show me sales",
  "expected": {
    "topic": "order_items",
    "fields": ["order_items.total_revenue"]
  },
  "acceptable_alternatives": [
    { "fields": ["order_items.count"] },
    { "fields": ["order_items.total_revenue", "order_items.count"] }
  ],
  "tags": ["ambiguous"]
}
Tag ambiguous cases separately so they don’t skew your overall accuracy metrics. This lets you report “core accuracy” and “ambiguous accuracy” independently.

Golden query validation

Before trusting your expected queries, validate them to make sure they still work against your current model.
1

Execute expected queries

Run each expected query via the Query API to confirm it executes without errors.
2

Verify returned data

Check that results are non-empty and contain reasonable values.
3

Confirm model validity

Verify that the topic and fields referenced in your eval still exist in the semantic model.
This catches stale expected results that reference renamed or removed fields. Run validation after any model changes.

Maintaining eval cases

Your eval set is a living artifact. As your model evolves, your evals need to keep pace.
  • Update expected results when the model changes — renamed views, new fields, restructured topics
  • Version your eval set alongside model snapshots — use the Models API to capture the model definition at each point in time
  • Remove cases that become invalid due to model restructuring rather than letting them produce false failures
  • Add regression cases when a previously-correct query breaks — this prevents the same issue from recurring
Review your eval set quarterly. Remove cases that no longer represent real user questions and add new ones based on recent AI usage patterns.

Next steps

Optimize models for Omni AI

Add ai_context, sample queries, and field metadata to improve AI accuracy.

Learn from conversation

Let the AI capture business context from your conversations.

Model AI settings

Configure AI query scope, validation, and thinking levels.

Query API

Use the Query API to programmatically validate golden queries.