Skip to main content
Evals measure how accurately Omni’s AI answers natural-language questions about your data. You define a prompt set (a reusable collection of questions scoped to a single Omni model), then execute an eval run that scores each AI response with a built-in accuracy judge. Because a run targets a specific model branch, evals let you measure the impact of data modeling and curation changes (new fields and joins, or ai_context and other context adjustments) before promoting them to your shared model. Use evals to catch regressions before they reach users and to compare results across branches. Evals live in the AI Hub, under the Prompt sets and Eval runs tabs.

Requirements

  • At least Querier access on the shared model you want to evaluate (Organization Admins and AI admins also have access)
  • At least one topic optimized for AI — see Optimizing models for Omni AI

Prompt sets

A prompt set is a reusable list of up to 25 natural-language prompts that you run against the AI together. You can create as many prompt sets as you need — for example, one per topic, one per release, or one for regression coverage.

Manage prompts

From the Prompt sets tab in the AI Hub, you can:
  • Create a prompt set and give it a descriptive name
  • Add or edit prompts in an existing set, up to the 25-prompt limit
  • Delete a prompt you no longer need
  • Save changes to persist edits
  • Archive a prompt set to remove it from the active list without losing it. Archived sets can be restored.

What makes an effective prompt set

Strong prompt sets are representative, broad, and unambiguous:
  • Mirror real user questions — use the AI usage analytics dashboard to find what users actually ask
  • Favor breadth over depth — covering several topics with one prompt each yields more signal than ten prompts on the same topic
  • Phrase prompts naturally — write the way a user would ask, not in language that mirrors field names
  • Add regression prompts when an AI answer turns out to be wrong, so the same failure is caught next time
  • Start small and expand toward the 25-prompt cap — begin with 10–15 prompts that cover your core scenarios, then iterate on ai_context and sample_queries and add more prompts as you go

Running an eval

When you start an eval run, Omni runs each prompt as an asynchronous agentic AI job (the same engine behind the Create AI job API), captures the AI’s response, and then scores it with Omni’s built-in accuracy judge. The judge model is fixed; it isn’t configurable per run.
1

Select a prompt set

Open the Prompt sets tab in the AI Hub and pick the prompt set you want to evaluate.
2

Choose a model branch

Select the model branch the AI should query against, or pick main to evaluate the live shared model. Running on a branch lets you measure the impact of in-progress model changes before promotion.
3

Add a description

Add a short description for the run — for example, “Baseline before topic restructure” or “After adding ai_context to order_items.” Descriptions appear in the run list and the comparison view, so write something you’ll recognize later.
4

Start the run

Submit the run. Each prompt executes asynchronously as an AI job and is then scored by the judge. You can leave the page — results stream in as prompts complete.
You can have up to two eval runs in progress at once. Wait for a run to finish (or cancel it) before starting a third.

Reviewing results

Once a run completes, open it to see per-prompt and aggregate results. For each prompt, the run shows:
  • Accuracy — a pass/fail verdict from Omni’s built-in accuracy judge on whether the AI’s response correctly answered the prompt
  • Prompt cost — the cost of generating the AI response
  • Judge cost — the cost of scoring the response
  • Duration — how long the prompt took to complete
  • Open chat — a link that opens the underlying AI conversation. Beyond the generated query, fields, and answer, this is where you can see the accuracy judge’s full verdict, confidence, and reasoning for that prompt
Aggregate metrics for the run include total cost, total duration, and overall accuracy, so you can summarize a run at a glance. You can archive a completed run to remove it from the active list. Archived runs are preserved and can be restored later.

Comparing two runs

To measure the impact of a model change, run an eval on a branch, then compare it to a baseline run on main. Comparison is limited to two runs of the same prompt set, so both runs evaluate the same questions. The comparison view shows side by side, per prompt:
  • Accuracy — the accuracy judge’s pass/fail verdict on each run, so you can see exactly which prompts regressed or improved. Open the chat for any prompt to read the judge’s reasoning and confidence behind a verdict
  • Cost — the combined prompt + judge cost for each run, so you can spot changes that hurt efficiency
  • Duration — run-to-run latency differences
Use the comparison to validate a change before promoting it: if accuracy improves and cost holds steady, the change is a clear win; if accuracy is flat but cost spikes, you may want to investigate before merging.

Next steps

Optimize models for Omni AI

Add ai_context, sample queries, and field metadata to improve AI accuracy.

Learn from conversation

Let the AI capture business context from your conversations.

Model AI settings

Configure AI query scope, validation, thinking levels, and context management.

Create AI job API

The endpoint that powers each prompt in an eval run.