Datasets

A dataset is a collection of examples we want our AI application to answer correctly. Each dataset example has:

  • input - used as input into your AI application
  • output - what we expect the AI to answer
  • metadata - additional information for organizing examples

Experiments

An experiment measures the quality of your LLM / RAG pipeline for each dataset example, using a variety of scores. Experiments are typically run from a script outside of your AI application.

Scores

A score is a function that compares the AI output with the desired output. We support several scores out of the box:

NameDescriptionResult interpretation
AccuracyCompare LLM output vs. desired outputAccurate - The LLM response is similar to the desired response
Slightly Inaccurate - The LLM response is close but missing a critical fact captured in the desired response
Completely Incorrect - The LLM response is different or contradicts the desired response
Facts CompareA more detailed version of Accuracy with more granular bucketsSuperset - The set of facts in the LLM response is a superset of the desired response
Identical - LLM response is identical to the desired response
Similar - There may be stylistic differences between the two, but factually they’re equivalent
Subset - The set of facts in the LLM response is a subset of the desired response
Disagreement - There is one or more factual disagreements between the LLM response and the desired response
SQL EvalCompare AI-generated SQL query to desired queryBest - The LLM generated query is identical to the desired query
Acceptable - The LLM query looks similar to the desired query
Incorrect - The LLM query is completely wrong
Undetermined - Require a human to help classify this
RAG* Context Recall0 to 1, higher is better% of output sentences that exist in the context docs
RAG* Context Precision0 to 1, higher is better% of contexts relevant to the input (how precise is my retrieval)
RAG* Hallucination0 to 1, lower is better% of LLM output sentences that are NOT supported by the context

*For more information on RAG scores, see our RAG guide.

User-defined scores

You can also define custom scores using Python or TypeScript code. See our Custom Scores guide.

Traces

A trace is a data record describing an event in the AI pipeline. Hamming supports different types of events, such as:

  • LLM call - contains information about the LLM provider, model and parameters
  • Document retrieval - contains information about the retrieval engine, documents retrieved and query parameters
  • Custom event - free-form event used to record any other information

Traces can be collected when running an evaluation experiment, or realtime from your AI application.