ai_context and other context adjustments) before promoting them to your shared model. Use evals to catch regressions before they reach users and to compare results across branches.
Evals live in the AI Hub, under the Prompt sets and Eval runs tabs.
Requirements
- At least Querier access on the shared model you want to evaluate (Organization Admins and AI admins also have access)
- At least one topic optimized for AI — see Optimizing models for Omni AI
Prompt sets
A prompt set is a reusable list of up to 25 natural-language prompts that you run against the AI together. You can create as many prompt sets as you need — for example, one per topic, one per release, or one for regression coverage.Manage prompts
From the Prompt sets tab in the AI Hub, you can:- Create a prompt set and give it a descriptive name
- Add or edit prompts in an existing set, up to the 25-prompt limit
- Delete a prompt you no longer need
- Save changes to persist edits
- Archive a prompt set to remove it from the active list without losing it. Archived sets can be restored.
What makes an effective prompt set
Strong prompt sets are representative, broad, and unambiguous:- Mirror real user questions — use the AI usage analytics dashboard to find what users actually ask
- Favor breadth over depth — covering several topics with one prompt each yields more signal than ten prompts on the same topic
- Phrase prompts naturally — write the way a user would ask, not in language that mirrors field names
- Add regression prompts when an AI answer turns out to be wrong, so the same failure is caught next time
- Start small and expand toward the 25-prompt cap — begin with 10–15 prompts that cover your core scenarios, then iterate on
ai_contextandsample_queriesand add more prompts as you go
Running an eval
When you start an eval run, Omni runs each prompt as an asynchronous agentic AI job (the same engine behind the Create AI job API), captures the AI’s response, and then scores it with Omni’s built-in accuracy judge. The judge model is fixed; it isn’t configurable per run.Select a prompt set
Open the Prompt sets tab in the AI Hub and pick the prompt set you want to evaluate.
Choose a model branch
Select the model branch the AI should query against, or pick
main to evaluate the live shared model. Running on a branch lets you measure the impact of in-progress model changes before promotion.Add a description
Add a short description for the run — for example, “Baseline before topic restructure” or “After adding ai_context to order_items.” Descriptions appear in the run list and the comparison view, so write something you’ll recognize later.
You can have up to two eval runs in progress at once. Wait for a run to finish (or cancel it) before starting a third.
Reviewing results
Once a run completes, open it to see per-prompt and aggregate results. For each prompt, the run shows:- Accuracy — a pass/fail verdict from Omni’s built-in accuracy judge on whether the AI’s response correctly answered the prompt
- Prompt cost — the cost of generating the AI response
- Judge cost — the cost of scoring the response
- Duration — how long the prompt took to complete
- Open chat — a link that opens the underlying AI conversation. Beyond the generated query, fields, and answer, this is where you can see the accuracy judge’s full verdict, confidence, and reasoning for that prompt
Comparing two runs
To measure the impact of a model change, run an eval on a branch, then compare it to a baseline run onmain. Comparison is limited to two runs of the same prompt set, so both runs evaluate the same questions.
The comparison view shows side by side, per prompt:
- Accuracy — the accuracy judge’s pass/fail verdict on each run, so you can see exactly which prompts regressed or improved. Open the chat for any prompt to read the judge’s reasoning and confidence behind a verdict
- Cost — the combined prompt + judge cost for each run, so you can spot changes that hurt efficiency
- Duration — run-to-run latency differences
Next steps
Optimize models for Omni AI
Add ai_context, sample queries, and field metadata to improve AI accuracy.
Learn from conversation
Let the AI capture business context from your conversations.
Model AI settings
Configure AI query scope, validation, thinking levels, and context management.
Create AI job API
The endpoint that powers each prompt in an eval run.