Concepts

Scenarios

A dataset is a collection of scenarios we want our AI application to answer correctly. Scenarios are stored here.

Each dataset row has:

input
expected output
metadata

These are all JSON objects.

Key	Description	Required	Examples
input	Used as inputs into your AI application. For a RAG use-case, this could be query into your RAG pipeline.	true	`{"question": "What is the capital of Australia?", "conversation_history": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris"}]}`
expected output	What we expect the AI to answer.	false	`{"answer": "Canberra"}`
metadata	Additional information for organizing examples.	false	`{"category": "Geography", "difficulty": "Easy"}`

For built-in scores like Facts Compare and Accuracy, the expected output is a single string. For some applications, there is no ‘correct’ answer. In this case, you can leave the expected output blank. You can create a custom score to calculate the correctness of your AI output without needing an expected output. For example, to determine if the LLM response contains the correct number format, the custom score would consume just the output and run the score reference-free. For other applications, there are multiple correct answers. (e.g. text => SQL, multiple-valid queries) You can create a a custom score to determine if the SQL query matches a set of correct queries.

See example scenario here: Sample scenario file.

Any changes to the scenarios (additions, deletions, or updates) are versioned, so you can revert to a previous version if needed.

Experiments

An experiment measures the quality of your Prompt, RAG & AI agent for each scenario, using a variety of scores. You can see the results of your experiments in the experiments page.

There are two ways to run experiments:

From the prompt playground - See prompt playground.
From a script outside of your AI application - see the evaluations guide for an end-to-end example.

Scores

A score is a function that compares the AI output with the desired output. We support several scores out of the box:

Name	Description	Result interpretation
Accuracy	Compare LLM output vs. desired output	Accurate - The LLM response is similar to the desired response Slightly Inaccurate - The LLM response is close but missing a critical fact captured in the desired response Completely Incorrect - The LLM response is different or contradicts the desired response
Facts Compare	A more detailed version of Accuracy with more granular buckets	Superset - The set of facts in the LLM response is a superset of the desired response Identical - LLM response is identical to the desired response Similar - There may be stylistic differences between the two, but factually they’re equivalent Subset - The set of facts in the LLM response is a subset of the desired response Disagreement - There is one or more factual disagreements between the LLM response and the desired response
SQL Eval	Compare AI-generated SQL query to desired query	Best - The LLM generated query is identical to the desired query Acceptable - The LLM query looks similar to the desired query Incorrect - The LLM query is completely wrong Undetermined - Require a human to help classify this
RAG^* Context Recall	0 to 1, higher is better	% of output sentences that exist in the context docs
RAG^* Context Precision	0 to 1, higher is better	% of contexts relevant to the input (how precise is my retrieval)
RAG^* Hallucination	0 to 1, lower is better	% of LLM output sentences that are NOT supported by the context

^*For more information on RAG scores, see our RAG guide.

User-defined scores

You can also define custom scores using Python or TypeScript code. See our Custom Scores guide.

Traces

A trace is a data record describing an event in the AI pipeline. Hamming supports different types of events, such as:

LLM call - contains information about the LLM provider, model and parameters
Document retrieval - contains information about the retrieval engine, documents retrieved and query parameters
Custom event - free-form event used to record any other information

Traces can be collected when running an evaluation experiment, or realtime from your AI application.

Get Started

Voice Agent Testing

Call Monitoring

Other Guides

Scenarios

Experiments

Scores

User-defined scores

Traces

Get Started

Voice Agent Testing

Call Monitoring

Other Guides

​ Scenarios

​ Experiments

​ Scores

​User-defined scores

​ Traces

Scenarios

Experiments

Scores

User-defined scores

Traces