> ## Documentation Index
> Fetch the complete documentation index at: https://docs.omni.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Eval design guide

> Best practices for building effective eval sets to measure and improve Omni AI query generation accuracy.

An eval (short for evaluation) measures how accurately Omni's AI translates natural-language questions into semantic queries. Each eval case pairs a question — like "What's our total revenue?" — with the expected query the AI should produce. Running a collection of these cases gives you an accuracy score you can track over time.

Evals are essential when you're [optimizing your models for AI](/modeling/develop/ai-optimization). Without them, changes to `ai_context`, field definitions, or topic structure are guesswork. With evals, you can quantify the impact of every change, catch regressions before they reach production, and build confidence that the AI is improving.

This guide covers how to design, build, and maintain an eval set. It's aimed at model developers and admins who are tuning their semantic model to improve AI query accuracy.

## Requirements

To follow this guide, you'll need:

* An understanding of Omni's [semantic model](/modeling) and [topics](/modeling/models/topics)
* Access to the [model IDE](/modeling/develop/model-management)
* Familiarity with [`ai_context`](/modeling/develop/ai-optimization) and field configuration

## Understanding evals

<AccordionGroup>
  <Accordion title="What is an eval?">
    An eval case pairs a natural-language question with the query the AI should produce in response. For example, "What's our total revenue?" paired with the expected topic, fields, and filters. An eval set is a collection of these cases that you run against the AI to measure accuracy — typically expressed as the percentage of cases where the AI produced the expected query.
  </Accordion>

  <Accordion title="Why are evals important?">
    Without evals, you have no way to know whether a model change improved or degraded AI accuracy. Evals let you quantify accuracy before and after changes, catch regressions from field renames or topic restructuring, and track improvement over time. They turn AI optimization from trial-and-error into a measurable process.
  </Accordion>

  <Accordion title="When should you run evals?">
    Run evals before merging model changes — `ai_context` updates, field additions, topic restructuring — to confirm accuracy isn't regressing. Also run them after changing [AI settings](/modeling/models/ai-settings) like thinking levels or query scope. Periodic runs (monthly or quarterly) help you monitor drift as your data and user questions evolve.
  </Accordion>
</AccordionGroup>

## What makes a good eval set

A well-designed eval set has the following characteristics:

* **Representative** — reflects questions your users actually ask
* **Broad coverage** — spans multiple topics, not concentrated in one area
* **Tiered complexity** — includes simple lookups, filtered queries, multi-measure breakdowns, and edge cases
* **Unambiguous expected results** — each case has a clear "right answer"; if multiple correct queries exist, pick the canonical one
* **Tagged** — categorize cases so you can analyze accuracy by type

<Tip>
  Use the **Analytics > AI usage** dashboard in Omni to see what questions your users are actually asking. This is the best source for building representative eval cases.
</Tip>

## Building eval cases from existing queries

The most reliable way to build eval cases is to start from queries you know are correct.

<Steps>
  <Step title="Identify common questions">
    Check the AI usage dashboard or interview stakeholders to find the questions users ask most frequently.
  </Step>

  <Step title="Build the correct query">
    In an Omni workbook, construct the query that correctly answers the question.
  </Step>

  <Step title="Export the query JSON">
    Export the query JSON from the workbook to capture the exact field selections, filters, and sorts.
  </Step>

  <Step title="Map workbook JSON to eval case format">
    Translate the exported JSON into your eval case structure:

    * `table` — use as context, but compare against the top-level `topic` from the API response
    * `join_paths_from_topic_name` — maps to expected `topic`
    * `column_name` / `sort_descending` in sorts — keep as-is for comparison

    <Note>
      The `table` field in workbook JSON provides context, but the field you compare against in your eval is the top-level `topic` from the API response.
    </Note>
  </Step>

  <Step title="Write the natural language prompt">
    Write the question a user would ask that should produce this query. Use natural phrasing — avoid overly technical language that mirrors field names.
  </Step>

  <Step title="Add to your eval file">
    Add the completed case to your eval set with appropriate tags and category.
  </Step>
</Steps>

## Example eval case categories

Once you have a process for building cases, categorize them for granular analysis. Tag each case with a category so you can identify which types of questions the AI handles well and where it struggles.

| Category        | Tests                                 | Example prompt                                         |
| --------------- | ------------------------------------- | ------------------------------------------------------ |
| `basic-metric`  | Single measure retrieval              | "What's our total revenue?"                            |
| `time-series`   | Date dimension + measure              | "Revenue by month this year"                           |
| `top-n`         | Sorting + limit                       | "Top 10 customers by spend"                            |
| `filtered`      | Filter application                    | "Orders in California last quarter"                    |
| `multi-measure` | Multiple measures together            | "Revenue and order count by category"                  |
| `cross-topic`   | Correct topic selection               | "How many users signed up?" (not orders topic)         |
| `ambiguous`     | Term disambiguation                   | "Show me sales" (revenue? count? both?)                |
| `complex`       | Multiple dimensions + filters + sorts | "Top 5 categories by revenue in Q4, excluding returns" |

## Sizing your eval set

Your eval set should grow alongside your optimization efforts. Start small and expand as you iterate.

* **Start with 20-30 cases** covering your core scenarios and most-used topics
* **Expand to 50-100** as you optimize [`ai_context`](/modeling/develop/ai-optimization) and `sample_queries`
* **Prioritize breadth over depth** early on — one case per topic is more valuable than ten cases for the same topic
* **Add regression cases** when you find failures in production — if a user reports a wrong answer, add it to the eval set

<Note>
  You'll get more value from 30 well-chosen cases across different topics than 100 cases concentrated in a single area. Breadth reveals gaps; depth confirms what you already know.
</Note>

## Handling ambiguity

Not every question has a single correct answer. When your eval set includes ambiguous prompts, use one of these strategies to avoid false failures:

* **Pick the canonical answer** and accept some "false failures" — track these separately with an `ambiguous` tag
* **Use a similarity threshold** — instead of requiring an exact field match, accept results where most of the expected fields are present
* **Add acceptable alternatives** to eval cases for known ambiguous prompts:

```json theme={null}
{
  "id": "show-sales",
  "prompt": "Show me sales",
  "expected": {
    "topic": "order_items",
    "fields": ["order_items.total_revenue"]
  },
  "acceptable_alternatives": [
    { "fields": ["order_items.count"] },
    { "fields": ["order_items.total_revenue", "order_items.count"] }
  ],
  "tags": ["ambiguous"]
}
```

<Tip>
  Tag ambiguous cases separately so they don't skew your overall accuracy metrics. This lets you report "core accuracy" and "ambiguous accuracy" independently.
</Tip>

## Golden query validation

Before trusting your expected queries, validate them to make sure they still work against your current model.

<Steps>
  <Step title="Execute expected queries">
    Run each expected query via the [Query API](/api/queries/run-query) to confirm it executes without errors.
  </Step>

  <Step title="Verify returned data">
    Check that results are non-empty and contain reasonable values.
  </Step>

  <Step title="Confirm model validity">
    Verify that the topic and fields referenced in your eval still exist in the [semantic model](/modeling).
  </Step>
</Steps>

<Tip>
  This catches stale expected results that reference renamed or removed fields. Run validation after any model changes.
</Tip>

## Maintaining eval cases

Your eval set is a living artifact. As your model evolves, your evals need to keep pace.

* **Update expected results when the model changes** — renamed views, new fields, restructured topics
* **Version your eval set alongside model snapshots** — use the [Models API](/api/models/get-model-yaml) to capture the model definition at each point in time
* **Remove cases that become invalid** due to model restructuring rather than letting them produce false failures
* **Add regression cases** when a previously-correct query breaks — this prevents the same issue from recurring

<Note>
  Review your eval set quarterly. Remove cases that no longer represent real user questions and add new ones based on recent AI usage patterns.
</Note>

## Next steps

<CardGroup cols={2}>
  <Card title="Optimize models for Omni AI" icon="brain-circuit" href="/modeling/develop/ai-optimization">
    Add ai\_context, sample queries, and field metadata to improve AI accuracy.
  </Card>

  <Card title="Learn from conversation" icon="brain" href="/ai/learn-from-conversation">
    Let the AI capture business context from your conversations.
  </Card>

  <Card title="Model AI settings" icon="sliders" href="/modeling/models/ai-settings">
    Configure AI query scope, validation, thinking levels, and context management (Claude only).
  </Card>

  <Card title="Query API" icon="code" href="/api/queries/run-query">
    Use the Query API to programmatically validate golden queries.
  </Card>
</CardGroup>
