Key | Description | Required | Examples |
---|---|---|---|
input | Used as inputs into your AI application. For a RAG use-case, this could be query into your RAG pipeline. | true | {"question": "What is the capital of Australia?", "conversation_history": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris"}]} |
expected output | What we expect the AI to answer. | false | {"answer": "Canberra"} |
metadata | Additional information for organizing examples. | false | {"category": "Geography", "difficulty": "Easy"} |
Name | Description | Result interpretation |
---|---|---|
Accuracy | Compare LLM output vs. desired output | Accurate - The LLM response is similar to the desired response Slightly Inaccurate - The LLM response is close but missing a critical fact captured in the desired response Completely Incorrect - The LLM response is different or contradicts the desired response |
Facts Compare | A more detailed version of Accuracy with more granular buckets | Superset - The set of facts in the LLM response is a superset of the desired response Identical - LLM response is identical to the desired response Similar - There may be stylistic differences between the two, but factually they’re equivalent Subset - The set of facts in the LLM response is a subset of the desired response Disagreement - There is one or more factual disagreements between the LLM response and the desired response |
SQL Eval | Compare AI-generated SQL query to desired query | Best - The LLM generated query is identical to the desired query Acceptable - The LLM query looks similar to the desired query Incorrect - The LLM query is completely wrong Undetermined - Require a human to help classify this |
RAG* Context Recall | 0 to 1, higher is better | % of output sentences that exist in the context docs |
RAG* Context Precision | 0 to 1, higher is better | % of contexts relevant to the input (how precise is my retrieval) |
RAG* Hallucination | 0 to 1, lower is better | % of LLM output sentences that are NOT supported by the context |