A dataset is a collection of examples we want our AI application to answer correctly. Datasets are stored here.

Each dataset example has:

  • input - used as input into your AI application
  • expected output - what we expect the AI to answer
  • metadata - additional information for organizing examples

input, expected output, and metadata are all JSON objects.

See example dataset here: Sample dataset file.

Any changes to the datasets (additions, deletions, or updates) are versioned, so you can revert to a previous version if needed.


An experiment measures the quality of your Prompt, RAG & AI agent for each dataset example, using a variety of scores. You can see the results of your experiments in the experiments page.

There are two ways to run experiments:

  1. From the prompt playground - See prompt playground.
  2. From a script outside of your AI application - see the evaluations guide for an end-to-end example.


A score is a function that compares the AI output with the desired output. We support several scores out of the box:

NameDescriptionResult interpretation
AccuracyCompare LLM output vs. desired outputAccurate - The LLM response is similar to the desired response
Slightly Inaccurate - The LLM response is close but missing a critical fact captured in the desired response
Completely Incorrect - The LLM response is different or contradicts the desired response
Facts CompareA more detailed version of Accuracy with more granular bucketsSuperset - The set of facts in the LLM response is a superset of the desired response
Identical - LLM response is identical to the desired response
Similar - There may be stylistic differences between the two, but factually they’re equivalent
Subset - The set of facts in the LLM response is a subset of the desired response
Disagreement - There is one or more factual disagreements between the LLM response and the desired response
SQL EvalCompare AI-generated SQL query to desired queryBest - The LLM generated query is identical to the desired query
Acceptable - The LLM query looks similar to the desired query
Incorrect - The LLM query is completely wrong
Undetermined - Require a human to help classify this
RAG* Context Recall0 to 1, higher is better% of output sentences that exist in the context docs
RAG* Context Precision0 to 1, higher is better% of contexts relevant to the input (how precise is my retrieval)
RAG* Hallucination0 to 1, lower is better% of LLM output sentences that are NOT supported by the context

*For more information on RAG scores, see our RAG guide.

User-defined scores

You can also define custom scores using Python or TypeScript code. See our Custom Scores guide.


A trace is a data record describing an event in the AI pipeline. Hamming supports different types of events, such as:

  • LLM call - contains information about the LLM provider, model and parameters
  • Document retrieval - contains information about the retrieval engine, documents retrieved and query parameters
  • Custom event - free-form event used to record any other information

Traces can be collected when running an evaluation experiment, or realtime from your AI application.