ai_context and other context adjustments) before promoting them to your shared model. Use evals to catch regressions before they reach users and to compare results across branches.
Evals live in the AI Hub, under the Prompt sets and Eval runs tabs.
Requirements
- At least Querier access on the shared model you want to evaluate (Organization Admins and AI admins also have access)
- At least one topic optimized for AI — see Optimizing models for Omni AI
Prompt sets
A prompt set is a reusable list of up to 25 natural-language prompts that you run against the AI together. You can create as many prompt sets as you need — for example, one per topic, one per release, or one for regression coverage.Manage prompts
From the Prompt sets tab in the AI Hub, you can:- Create a prompt set and give it a descriptive name
- Add or edit prompts in an existing set, up to the 25-prompt limit
- Delete a prompt you no longer need
- Save changes to persist edits
- Archive a prompt set to remove it from the active list without losing it. Archived sets can be restored.
What makes an effective prompt set
Strong prompt sets are representative, broad, and unambiguous:- Mirror real user questions — use the AI usage analytics dashboard to find what users actually ask
- Favor breadth over depth — covering several topics with one prompt each yields more signal than ten prompts on the same topic
- Phrase prompts naturally — write the way a user would ask, not in language that mirrors field names
- Add regression prompts when an AI answer turns out to be wrong, so the same failure is caught next time
- Start small and expand toward the 25-prompt cap — begin with 10–15 prompts that cover your core scenarios, then iterate on
ai_contextandsample_queriesand add more prompts as you go
Expected behavior
By default, the accuracy judge decides whether the AI’s response is a correct, reasonable answer to the prompt on its own terms. You can optionally describe the expected behavior for any prompt — a reference answer describing the result you expect — and the judge checks the AI’s analysis against it. Expected behavior is useful when a prompt has a known correct answer you want to pin down, such as a specific value or ranking (“The top product by revenue should be Aniseed Syrup.”), a direction (“Revenue should be up year over year.”), or a breakdown the answer must include. When a prompt defines expected behavior, the judge:- Treats a material divergence as a failing result — wrong numbers, the wrong direction, or a required result that’s missing. Differences in wording, formatting, or extra detail don’t fail the prompt.
- Falls back to its best read and lowers confidence, rather than failing outright, when the expected behavior is vague or only partially addressed.
- Restates the expected behavior in its rationale, so opening the chat shows exactly what was expected next to the verdict.
Expected behavior is set per prompt through the prompt sets API (
POST and PATCH /api/v1/ai/eval/prompt-sets) — pass an optional expectation string of up to 16,000 characters on each prompt. It isn’t yet editable from the Prompt sets tab in the AI Hub.Running an eval
When you start an eval run, Omni runs each prompt as an asynchronous agentic AI job (the same engine behind the Create AI job API), captures the AI’s response, and then scores it with Omni’s built-in accuracy judge. The judge model is fixed; it isn’t configurable per run. When a prompt defines expected behavior, the judge also scores the response against it.Select a prompt set
Open the Prompt sets tab in the AI Hub and pick the prompt set you want to evaluate.
Choose a model branch
Select the model branch the AI should query against, or pick
main to evaluate the live shared model. Running on a branch lets you measure the impact of in-progress model changes before promotion.Add a description
Add a short description for the run — for example, “Baseline before topic restructure” or “After adding ai_context to order_items.” Descriptions appear in the run list and the comparison view, so write something you’ll recognize later.
You can have up to two eval runs in progress at once. Wait for a run to finish (or cancel it) before starting a third.
How the accuracy judge works
The accuracy judge is a separate AI model that reviews each completed AI response and returns a pass/fail verdict. It’s tuned to catch the analysis mistakes that most often make an answer wrong, not to grade wording, formatting, or style.What the judge sees
The judge reads the evaluated AI’s own work for that prompt — the same internal message history the AI produced and saw while answering:- The user prompt
- Every tool call the AI made and its result — including the generated queries, the query results, and any data the AI sampled or retrieved
- The AI’s final written response
What the judge checks for
The judge’s instructions focus on common, high-impact analysis errors. It looks for problems like:- Hallucinations — claims in the written response that the query results don’t actually support. The judge verifies numbers and statements against the underlying query and results rather than trusting the AI’s own summary.
- Date and time filtering — wrong or off-by-one date filters, missing “complete” period handling, and incorrect reasoning about fiscal periods.
- Row-limit handling — treating a result that hit the row limit as if it were complete, such as summing a truncated column and reporting it as a total.
- Mental math — figures the AI computed in its written response instead of in the query, which can be wrong or hallucinated.
- Period-over-period errors — comparison filters that cover the wrong window, like filtering two years of data and then applying a one-year comparison on top.
- Ignoring the request — not honoring an explicitly requested query structure, such as “pivot it by month.”
- Query and calculation problems — using the wrong topic, degenerate results (all zero, null, or identical), or defining a calculated field the query never uses.
How the judge scores
The judge returns a binary verdict for each prompt:- Pass when the response clears every check above
- Fail when one or more critical errors are present — a single critical error fails the prompt
The accuracy judge is a single pass/fail correctness check, not a multi-criteria rubric. A passing verdict means the AI avoided the analysis errors above — not that the answer is the single best possible response.
Reviewing results
Once a run completes, open it to see per-prompt and aggregate results. For each prompt, the run shows:- Accuracy — a pass/fail verdict from Omni’s built-in accuracy judge on whether the AI’s response correctly answered the prompt, scored against the prompt’s expected behavior when it’s set
- Prompt credits — the credits consumed generating the AI response
- Judge credits — the credits consumed scoring the response
- Duration — how long the prompt took to complete
- Open chat — a link that opens the underlying AI conversation. Beyond the generated query, fields, and answer, this is where you can see the accuracy judge’s full verdict, confidence, and reasoning for that prompt — including how the response compared against the expected behavior, when it was set
Comparing two runs
To measure the impact of a model change, run an eval on a branch, then compare it to a baseline run onmain. Comparison is limited to two runs of the same prompt set, so both runs evaluate the same questions.
The comparison view shows side by side, per prompt:
- Accuracy — the accuracy judge’s pass/fail verdict on each run, so you can see exactly which prompts regressed or improved. Open the chat for any prompt to read the judge’s reasoning and confidence behind a verdict
- Credits — the combined prompt + judge credits for each run, so you can spot changes that hurt efficiency
- Duration — run-to-run latency differences
Next steps
Optimize models for Omni AI
Add ai_context, sample queries, and field metadata to improve AI accuracy.
Learn from conversation
Let the AI capture business context from your conversations.
Model AI settings
Configure AI query scope, validation, thinking levels, and context management.
Create AI job API
The endpoint that powers each prompt in an eval run.

