Multi-turn evaluation

Multi-turn evaluations are essential for assessing the performance of chat applications and conversational AI systems. These evaluations help in understanding how well the AI handles ongoing conversations, maintains context, and provides coherent and relevant responses over multiple turns.

Role	Content
User	I’m curious about machine learning.
Assistant	Machine learning is a subset of artificial intelligence that involves training algorithms to learn from data. What specifically would you like to know?
User	How does it differ from traditional programming?
Assistant	Traditional programming involves explicitly coding instructions, while machine learning allows the system to learn patterns from data and make decisions. Do you have a specific aspect in mind?

Setting Up Multi-turn Evaluations

We’ll create a dataset of multi-turn conversations. We’ll treat each dataset row as an individual dialogue turn, with a conversation history, current query and expected output.

Here is an example of how a multi-turn dataset might look:

input

{
    "question": "Can you explain that further?",
    "conversation_history": [
        {
            "role": "user",
            "content": "I'm curious about quantum mechanics.",
        },
        {
            "role": "assistant",
            "content": "Quantum mechanics is a fundamental theory in physics. What specifically would you like to know?",
        },
        {
            "role": "user",
            "content": "How does it differ from classical physics?"
        },
        {
            "role": "assistant",
            "content": "Classical physics describes macroscopic phenomena, while quantum mechanics explains the behavior of particles at the atomic and subatomic levels. Do you have a specific aspect in mind?",
        },
    ]
}

expected output

{
    "output": "Quantum mechanics differs from classical physics in several ways. For instance, it introduces the concept of wave-particle duality, where particles can exhibit both wave-like and particle-like properties. Additionally, it incorporates the principle of uncertainty, which states that certain pairs of properties, like position and momentum, cannot be simultaneously measured with arbitrary precision."
}

Before you begin

Follow the Evaluations Guide to get familiar with running experiments on Hamming AI. You should have a dataset ID and a secret key to continue with this guide.

Setting up a multi-turn evaluation - Node.js

Learn how run a multi-turn evaluation experiment with our Hamming TypeScript SDK.

Install dependencies:

npm install openai

Run the script by executing the following command in your terminal:

This will create an experiment in Hamming. Once the command runs, you’ll see a link to your experiment.

Setting up a multi-turn evaluation - Python

Learn how run a multi-turn evaluation experiment with our Hamming Python SDK.

Make sure to replace the placeholders with your actual keys and dataset ID created in the previous step.

Create a file named evals.py and add the following code:

evals.py
from hamming import ClientOptions, Hamming, RunOptions, ScoreType
from openai import OpenAI

HAMMING_API_KEY = "<your-secret-key>"
HAMMING_DATASET_ID = "<your-dataset-id>"
OPENAI_API_KEY = "<your-openai-key>"

hamming = Hamming(ClientOptions(api_key=HAMMING_API_KEY))
openai_client = OpenAI(api_key=OPENAI_API_KEY)
trace = hamming.tracing

def answer_question(input):
    question = input["question"]
    conversation_history = input["conversation_history"]
    print(f"Question: {question}")

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            *conversation_history,
            {"role": "user", "content": question},
        ],
    )
    answer = response.choices[0].message.content

    print(f"Answer: {answer}")

    # This makes it easier to view the LLM response in the experiment details page
    trace.log_generation(
        GenerationParams(
            input=question,
            output=answer,
            metadata=GenerationParams.Metadata(model="gpt-3.5-turbo"),
        )
    )

    return {"answer": answer}


def run():
    print("Running a multi-turn evaluation experiment..")

    result = hamming.experiments.run(
        RunOptions(
            dataset=HAMMING_DATASET_ID,
            name="Multi-turn evaluation from Python SDK",
            scoring=[
                ScoreType.ACCURACY_AI,
            ],
            metadata={},
        ),
        answer_question,
    )

    print("Multi-turn evaluation experiment completed.")
    print(f"See the results at: {result.url}")


if __name__ == "__main__":
    run()

Install dependencies:

pip install openai

Run the script by executing the following command in your terminal:

python evals.py

This will create an experiment in Hamming. Navigate to the Experiments page to see the results.

Get Started

Voice and Chat Agent Testing

Call Monitoring

Other Guides

Multi-turn evaluation

Setting Up Multi-turn Evaluations

Before you begin

Setting up a multi-turn evaluation - Node.js

Setting up a multi-turn evaluation - Python

Get Started

Voice and Chat Agent Testing

Call Monitoring

Other Guides

​Setting Up Multi-turn Evaluations

​Before you begin

​Setting up a multi-turn evaluation - Node.js

​Setting up a multi-turn evaluation - Python

Setting Up Multi-turn Evaluations

Before you begin

Setting up a multi-turn evaluation - Node.js

Setting up a multi-turn evaluation - Python