Concepts
Datasets
A dataset is a collection of examples we want our AI application to answer correctly. Datasets are stored here.
Each dataset row has:
- input
- expected output
- metadata
These are all JSON objects.
Key | Description | Required | Examples |
---|---|---|---|
input | Used as inputs into your AI application. For a RAG use-case, this could be query into your RAG pipeline. | true | {"question": "What is the capital of Australia?", "conversation_history": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris"}]} |
expected output | What we expect the AI to answer. | false | {"answer": "Canberra"} |
metadata | Additional information for organizing examples. | false | {"category": "Geography", "difficulty": "Easy"} |
For built-in scores like Facts Compare and Accuracy, the expected output is a single string.
For some applications, there is no ‘correct’ answer. In this case, you can leave the expected output blank. You can create a custom score to calculate the correctness of your AI output without needing an expected output. For example, to determine if the LLM response contains the correct number format, the custom score would consume just the output and run the score reference-free.
For other applications, there are multiple correct answers. (e.g. text => SQL, multiple-valid queries) You can create a a custom score to determine if the SQL query matches a set of correct queries.
See example dataset here: Sample dataset file.
Any changes to the datasets (additions, deletions, or updates) are versioned, so you can revert to a previous version if needed.
Experiments
An experiment measures the quality of your Prompt, RAG & AI agent for each dataset example, using a variety of scores. You can see the results of your experiments in the experiments page.
There are two ways to run experiments:
- From the prompt playground - See prompt playground.
- From a script outside of your AI application - see the evaluations guide for an end-to-end example.
Scores
A score is a function that compares the AI output with the desired output. We support several scores out of the box:
Name | Description | Result interpretation |
---|---|---|
Accuracy | Compare LLM output vs. desired output | Accurate - The LLM response is similar to the desired response Slightly Inaccurate - The LLM response is close but missing a critical fact captured in the desired response Completely Incorrect - The LLM response is different or contradicts the desired response |
Facts Compare | A more detailed version of Accuracy with more granular buckets | Superset - The set of facts in the LLM response is a superset of the desired response Identical - LLM response is identical to the desired response Similar - There may be stylistic differences between the two, but factually they’re equivalent Subset - The set of facts in the LLM response is a subset of the desired response Disagreement - There is one or more factual disagreements between the LLM response and the desired response |
SQL Eval | Compare AI-generated SQL query to desired query | Best - The LLM generated query is identical to the desired query Acceptable - The LLM query looks similar to the desired query Incorrect - The LLM query is completely wrong Undetermined - Require a human to help classify this |
RAG* Context Recall | 0 to 1, higher is better | % of output sentences that exist in the context docs |
RAG* Context Precision | 0 to 1, higher is better | % of contexts relevant to the input (how precise is my retrieval) |
RAG* Hallucination | 0 to 1, lower is better | % of LLM output sentences that are NOT supported by the context |
*For more information on RAG scores, see our RAG guide.
User-defined scores
You can also define custom scores using Python or TypeScript code. See our Custom Scores guide.
Traces
A trace is a data record describing an event in the AI pipeline. Hamming supports different types of events, such as:
- LLM call - contains information about the LLM provider, model and parameters
- Document retrieval - contains information about the retrieval engine, documents retrieved and query parameters
- Custom event - free-form event used to record any other information
Traces can be collected when running an evaluation experiment, or realtime from your AI application.