Concepts
Datasets
A dataset is a collection of examples we want our AI application to answer correctly. Each dataset example has:
- input - used as input into your AI application
- output - what we expect the AI to answer
- metadata - additional information for organizing examples
Experiments
An experiment measures the quality of your LLM / RAG pipeline for each dataset example, using a variety of scores. Experiments are typically run from a script outside of your AI application.
Scores
A score is a function that compares the AI output with the desired output. We support several scores out of the box:
Name | Description | Result interpretation |
---|---|---|
Accuracy | Compare LLM output vs. desired output | Accurate - The LLM response is similar to the desired response Slightly Inaccurate - The LLM response is close but missing a critical fact captured in the desired response Completely Incorrect - The LLM response is different or contradicts the desired response |
Facts Compare | A more detailed version of Accuracy with more granular buckets | Superset - The set of facts in the LLM response is a superset of the desired response Identical - LLM response is identical to the desired response Similar - There may be stylistic differences between the two, but factually they’re equivalent Subset - The set of facts in the LLM response is a subset of the desired response Disagreement - There is one or more factual disagreements between the LLM response and the desired response |
SQL Eval | Compare AI-generated SQL query to desired query | Best - The LLM generated query is identical to the desired query Acceptable - The LLM query looks similar to the desired query Incorrect - The LLM query is completely wrong Undetermined - Require a human to help classify this |
RAG* Context Recall | 0 to 1, higher is better | % of output sentences that exist in the context docs |
RAG* Context Precision | 0 to 1, higher is better | % of contexts relevant to the input (how precise is my retrieval) |
RAG* Hallucination | 0 to 1, lower is better | % of LLM output sentences that are NOT supported by the context |
*For more information on RAG scores, see our RAG guide.
User-defined scores
You can also define custom scores using Python or TypeScript code. See our Custom Scores guide.
Traces
A trace is a data record describing an event in the AI pipeline. Hamming supports different types of events, such as:
- LLM call - contains information about the LLM provider, model and parameters
- Document retrieval - contains information about the retrieval engine, documents retrieved and query parameters
- Custom event - free-form event used to record any other information
Traces can be collected when running an evaluation experiment, or realtime from your AI application.