Evals - Omni Docs

Evals measure how accurately Omni’s AI answers natural-language questions about your data. You define a prompt set (a reusable collection of questions scoped to a single Omni model), then execute an eval run that scores each AI response with a built-in accuracy judge. Because a run targets a specific model branch, evals let you measure the impact of data modeling and curation changes (new fields and joins, or ai_context and other context adjustments) before promoting them to your shared model. Use evals to catch regressions before they reach users and to compare results across branches. Evals live in the AI Hub, under the Prompt sets and Eval runs tabs.

Requirements

At least Querier access on the shared model you want to evaluate (Organization Admins and AI admins also have access)
At least one topic optimized for AI — see Optimizing models for Omni AI

Prompt sets

A prompt set is a reusable list of up to 25 natural-language prompts that you run against the AI together. You can create as many prompt sets as you need — for example, one per topic, one per release, or one for regression coverage.

Manage prompts

From the Prompt sets tab in the AI Hub, you can:

Create a prompt set and give it a descriptive name
Add or edit prompts in an existing set, up to the 25-prompt limit
Delete a prompt you no longer need
Save changes to persist edits
Archive a prompt set to remove it from the active list without losing it. Archived sets can be restored.

What makes an effective prompt set

Strong prompt sets are representative, broad, and unambiguous:

Mirror real user questions — use the AI usage analytics dashboard to find what users actually ask
Favor breadth over depth — covering several topics with one prompt each yields more signal than ten prompts on the same topic
Phrase prompts naturally — write the way a user would ask, not in language that mirrors field names
Add regression prompts when an AI answer turns out to be wrong, so the same failure is caught next time
Start small and expand toward the 25-prompt cap — begin with 10–15 prompts that cover your core scenarios, then iterate on ai_context and sample_queries and add more prompts as you go

Expected behavior

By default, the accuracy judge decides whether the AI’s response is a correct, reasonable answer to the prompt on its own terms. You can optionally describe the expected behavior for any prompt — a reference answer describing the result you expect — and the judge checks the AI’s analysis against it. Expected behavior is useful when a prompt has a known correct answer you want to pin down, such as a specific value or ranking (“The top product by revenue should be Aniseed Syrup.”), a direction (“Revenue should be up year over year.”), or a breakdown the answer must include. When a prompt defines expected behavior, the judge:

Treats a material divergence as a failing result — wrong numbers, the wrong direction, or a required result that’s missing. Differences in wording, formatting, or extra detail don’t fail the prompt.
Falls back to its best read and lowers confidence, rather than failing outright, when the expected behavior is vague or only partially addressed.
Restates the expected behavior in its rationale, so opening the chat shows exactly what was expected next to the verdict.

The expected behavior is shown only to the judge — the AI being evaluated never sees it, so it can’t steer toward the expected answer. Each run snapshots the expected behavior in effect when it started, so editing a prompt’s expected behavior later doesn’t change how earlier runs were scored.

Expected behavior is set per prompt through the prompt sets API (POST and PATCH /api/v1/ai/eval/prompt-sets) — pass an optional expectation string of up to 16,000 characters on each prompt. It isn’t yet editable from the Prompt sets tab in the AI Hub.

Running an eval

When you start an eval run, Omni runs each prompt as an asynchronous agentic AI job (the same engine behind the Create AI job API), captures the AI’s response, and then scores it with Omni’s built-in accuracy judge. The judge model is fixed; it isn’t configurable per run. When a prompt defines expected behavior, the judge also scores the response against it.

Select a prompt set

Open the Prompt sets tab in the AI Hub and pick the prompt set you want to evaluate.

Choose a model branch

Select the model branch the AI should query against, or pick main to evaluate the live shared model. Running on a branch lets you measure the impact of in-progress model changes before promotion.

Add a description

Add a short description for the run — for example, “Baseline before topic restructure” or “After adding ai_context to order_items.” Descriptions appear in the run list and the comparison view, so write something you’ll recognize later.

Start the run

Submit the run. Each prompt executes asynchronously as an AI job and is then scored by the judge. You can leave the page — results stream in as prompts complete.

You can have up to two eval runs in progress at once. Wait for a run to finish (or cancel it) before starting a third.

How the accuracy judge works

The accuracy judge is a separate AI model that reviews each completed AI response and returns a pass/fail verdict. It’s tuned to catch the analysis mistakes that most often make an answer wrong, not to grade wording, formatting, or style.

What the judge sees

The judge reads the evaluated AI’s own work for that prompt — the same internal message history the AI produced and saw while answering:

The user prompt
Every tool call the AI made and its result — including the generated queries, the query results, and any data the AI sampled or retrieved
The AI’s final written response

The judge evaluates only what the AI did and saw, including the state of the model and data as it existed when the prompt ran. It doesn’t independently re-run queries or inspect the live semantic model. A verdict therefore reflects the AI’s reasoning against the model the AI actually worked with at the time.

What the judge checks for

The judge’s instructions focus on common, high-impact analysis errors. It looks for problems like:

Hallucinations — claims in the written response that the query results don’t actually support. The judge verifies numbers and statements against the underlying query and results rather than trusting the AI’s own summary.
Date and time filtering — wrong or off-by-one date filters, missing “complete” period handling, and incorrect reasoning about fiscal periods.
Row-limit handling — treating a result that hit the row limit as if it were complete, such as summing a truncated column and reporting it as a total.
Mental math — figures the AI computed in its written response instead of in the query, which can be wrong or hallucinated.
Period-over-period errors — comparison filters that cover the wrong window, like filtering two years of data and then applying a one-year comparison on top.
Ignoring the request — not honoring an explicitly requested query structure, such as “pivot it by month.”
Query and calculation problems — using the wrong topic, degenerate results (all zero, null, or identical), or defining a calculated field the query never uses.

When a prompt defines expected behavior, the judge also checks the AI’s findings against it.

How the judge scores

The judge returns a binary verdict for each prompt:

Pass when the response clears every check above
Fail when one or more critical errors are present — a single critical error fails the prompt

Alongside the verdict, the judge reports a confidence level and a rationale anchored in evidence from the conversation. Confidence is lowered when the rubric leaves room for interpretation or when a prompt’s expected behavior is vague. Open the chat for any prompt to read the full verdict, confidence, and reasoning.

The accuracy judge is a single pass/fail correctness check, not a multi-criteria rubric. A passing verdict means the AI avoided the analysis errors above — not that the answer is the single best possible response.

Reviewing results

Once a run completes, open it to see per-prompt and aggregate results. For each prompt, the run shows:

Accuracy — a pass/fail verdict from Omni’s built-in accuracy judge on whether the AI’s response correctly answered the prompt, scored against the prompt’s expected behavior when it’s set
Prompt credits — the credits consumed generating the AI response
Judge credits — the credits consumed scoring the response
Duration — how long the prompt took to complete
Open chat — a link that opens the underlying AI conversation. Beyond the generated query, fields, and answer, this is where you can see the accuracy judge’s full verdict, confidence, and reasoning for that prompt — including how the response compared against the expected behavior, when it was set

Aggregate metrics for the run include total credits, total duration, and overall accuracy, so you can summarize a run at a glance. You can archive a completed run to remove it from the active list. Archived runs are preserved and can be restored later.

Comparing two runs

To measure the impact of a model change, run an eval on a branch, then compare it to a baseline run on main. Comparison is limited to two runs of the same prompt set, so both runs evaluate the same questions. The comparison view shows side by side, per prompt:

Accuracy — the accuracy judge’s pass/fail verdict on each run, so you can see exactly which prompts regressed or improved. Open the chat for any prompt to read the judge’s reasoning and confidence behind a verdict
Credits — the combined prompt + judge credits for each run, so you can spot changes that hurt efficiency
Duration — run-to-run latency differences

Use the comparison to validate a change before promoting it: if accuracy improves and credits hold steady, the change is a clear win; if accuracy is flat but credits spike, you may want to investigate before merging.

Next steps

Optimize models for Omni AI

Add ai_context, sample queries, and field metadata to improve AI accuracy.

Learn from conversation

Let the AI capture business context from your conversations.

Model AI settings

Configure AI query scope, validation, thinking levels, and context management.

Create AI job API

The endpoint that powers each prompt in an eval run.

​Requirements

​Prompt sets

​Manage prompts

​What makes an effective prompt set

​Expected behavior

​Running an eval

​How the accuracy judge works

​What the judge sees

​What the judge checks for

​How the judge scores

​Reviewing results

​Comparing two runs

​Next steps

Optimize models for Omni AI

Learn from conversation

Model AI settings

Create AI job API

Requirements

Prompt sets

Manage prompts

What makes an effective prompt set

Expected behavior

Running an eval

How the accuracy judge works

What the judge sees

What the judge checks for

How the judge scores

Reviewing results

Comparing two runs

Next steps