
A dataset is a collection of examples we want our AI application to answer correctly. Datasets are stored here.

Each dataset row has:

  • input
  • expected output
  • metadata

These are all JSON objects.

inputUsed as inputs into your AI application. For a RAG use-case, this could be query into your RAG pipeline.true{"question": "What is the capital of Australia?", "conversation_history": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris"}]}
expected outputWhat we expect the AI to answer.false{"answer": "Canberra"}
metadataAdditional information for organizing examples.false{"category": "Geography", "difficulty": "Easy"}

For built-in scores like Facts Compare and Accuracy, the expected output is a single string.

For some applications, there is no ‘correct’ answer. In this case, you can leave the expected output blank. You can create a custom score to calculate the correctness of your AI output without needing an expected output. For example, to determine if the LLM response contains the correct number format, the custom score would consume just the output and run the score reference-free.

For other applications, there are multiple correct answers. (e.g. text => SQL, multiple-valid queries) You can create a a custom score to determine if the SQL query matches a set of correct queries.

See example dataset here: Sample dataset file.

Any changes to the datasets (additions, deletions, or updates) are versioned, so you can revert to a previous version if needed.


An experiment measures the quality of your Prompt, RAG & AI agent for each dataset example, using a variety of scores. You can see the results of your experiments in the experiments page.

There are two ways to run experiments:

  1. From the prompt playground - See prompt playground.
  2. From a script outside of your AI application - see the evaluations guide for an end-to-end example.


A score is a function that compares the AI output with the desired output. We support several scores out of the box:

NameDescriptionResult interpretation
AccuracyCompare LLM output vs. desired outputAccurate - The LLM response is similar to the desired response
Slightly Inaccurate - The LLM response is close but missing a critical fact captured in the desired response
Completely Incorrect - The LLM response is different or contradicts the desired response
Facts CompareA more detailed version of Accuracy with more granular bucketsSuperset - The set of facts in the LLM response is a superset of the desired response
Identical - LLM response is identical to the desired response
Similar - There may be stylistic differences between the two, but factually they’re equivalent
Subset - The set of facts in the LLM response is a subset of the desired response
Disagreement - There is one or more factual disagreements between the LLM response and the desired response
SQL EvalCompare AI-generated SQL query to desired queryBest - The LLM generated query is identical to the desired query
Acceptable - The LLM query looks similar to the desired query
Incorrect - The LLM query is completely wrong
Undetermined - Require a human to help classify this
RAG* Context Recall0 to 1, higher is better% of output sentences that exist in the context docs
RAG* Context Precision0 to 1, higher is better% of contexts relevant to the input (how precise is my retrieval)
RAG* Hallucination0 to 1, lower is better% of LLM output sentences that are NOT supported by the context

*For more information on RAG scores, see our RAG guide.

User-defined scores

You can also define custom scores using Python or TypeScript code. See our Custom Scores guide.


A trace is a data record describing an event in the AI pipeline. Hamming supports different types of events, such as:

  • LLM call - contains information about the LLM provider, model and parameters
  • Document retrieval - contains information about the retrieval engine, documents retrieved and query parameters
  • Custom event - free-form event used to record any other information

Traces can be collected when running an evaluation experiment, or realtime from your AI application.