Custom Scores

Hamming AI comes with pre-built scorers that cover a wide range of use cases. If you need to score your experiments in a different way, you can do so by creating a custom scorer.

Definitions

A custom scorer is a function that computes a score of a given experiment result. Looking at the function output, we can define a custom scorer as:

Classification Scorer: A function that returns a categorical value. (e.g. low, medium, high)
Numerical Scorer: A function that returns a real number.

Based on the execution environment, we can define a custom scorer as:

Local Scorer: A function that runs on the machine that executes the experiment.
Remote Scorer: A function that runs on Hamming AI platform. [Coming soon]

Before you begin

Follow the Evaluations Guide to get familiar with running experiments on Hamming AI. You should have a dataset ID and a secret key to continue with this guide.

Creating a Custom Scorer (hello world) - Python

Simple custom scorer that scores if the length of the answers has the same parity.

1. Install Hamming SDK Python library

Create a dataset

2. Define your custom scorer

Create a file named scorer.py and add the following code:

scorer.py

import random

from hamming import (
    ScoringFunction,
    ClassificationScoreConfig,
    NumericScoreConfig,
    FunctionAggregateType,
    FunctionType,
    LabelColor,
    LocalScorer,
    ScoreArgs,
    Score
)

def custom_correctness_score(args: ScoreArgs) -> Score:
    output = args["output"]
    expected = args["expected"]

    # "output" and "expected" are Dict objects
    # matching the format of a dataset item Output Json.
    output_answer = output["answer"]
    expected_answer = expected["output"]

    # Define your scoring logic here.
    if (len(output_answer) % 2) == (len(expected_answer) % 2):
        return Score(value=1, reason="The length of the answers has the same parity.")
    else:
        return Score(value=0, reason="The length of the answers has different parity.")

# 1. Classification Scorer
custom_scoring_classification = ScoringFunction(
    name="TutorialScore-Classify",
    version=1,
    score_config=ClassificationScoreConfig(
        type=FunctionType.CLASSIFICATION,
        labels={
            0: "Incorrect",
            1: "Correct",
        },
        colors={
            0: LabelColor.RED,
            1: LabelColor.GREEN,
        },
    ),
    scorer=LocalScorer(
        score_fn=custom_correctness_score
    ),
)

# 2. Numerical Scorer
custom_scoring_numeric = ScoringFunction(
    name="TutorialScore-Numeric",
    version=1,
    score_config=NumericScoreConfig(
        aggregate=FunctionAggregateType.MEAN,
    ),
    scorer=LocalScorer(
        score_fn=lambda _: Score(value=random.random(), reason="Random number")
    ),
)

3. Create the evaluation script

Make sure to replace the placeholders with your actual keys and dataset ID created in the evaluations guide.

Create a file named evals.py and add the following code:

evals.py

from hamming import (
    Hamming,
    ClientOptions,
    RunOptions,
)
from scorer import (
    custom_scoring_classification, 
    custom_scoring_numeric
)

HAMMING_API_KEY = "<your-secret-key>"
HAMMING_DATASET_ID = "<your-dataset-id>"
OPENAI_API_KEY = "<your-openai-api-key>"

hamming = Hamming(ClientOptions(api_key=HAMMING_API_KEY))
openai_client = OpenAI(api_key=OPENAI_API_KEY)
trace = hamming.tracing

def answer_question(input):
    question = input["question"]

    print(f"Question: {question}")

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": question},
        ],
    )

    answer = response.choices[0].message.content

    print(f"Answer: {answer}")

    # This makes it easier to view the LLM response in the experiment details page
    trace.log_generation(
        GenerationParams(
            input=question,
            output=answer,
            metadata=GenerationParams.Metadata(model="gpt-3.5-turbo"),
        )
    )

    return {"answer": answer}

def run():
    print("Running a custom scorer experiment..")

    result = hamming.experiments.run(
        RunOptions(
            dataset=HAMMING_DATASET_ID,
            name="Custom Scorer Experiment - Python",
            scoring=[
                # Use your custom scorers here
                custom_scoring_classification,
                custom_scoring_numeric
            ],
            metadata={},
        ),
        answer_question,
    )

    print("Custom scorer experiment completed.")
    print(f"See the results at: {result.url}")

if __name__ == "__main__":
    run()

4. Run the experiment

Creating a Custom Scorer (multiple outputs) - Python

Sometimes you may have multiple correct outputs to compare your model’s output to. See example dataset row:

{
    "question":
        "Can you explain the differences between quantum mechanics and classical physics?",
    "answers": [
        "Quantum mechanics differs from classical physics in several ways. For instance, it introduces the concept of wave-particle duality, where particles can exhibit both wave-like and particle-like properties. Additionally, it incorporates the principle of uncertainty, which states that certain pairs of properties, like position and momentum, cannot be simultaneously measured with arbitrary precision.",
        "Quantum mechanics introduces wave-particle duality and the uncertainty principle, which are not present in classical physics.",
        "Classical physics is deterministic, while quantum mechanics is probabilistic.",
        "Quantum mechanics requires the use of complex numbers and operators, unlike classical physics.",
    ],
}

1. Install Hamming SDK Python library

2. Download a dataset for multiple correct outputs

3. Define your custom scorer

Create a file named scorer.py and add the following code:

scorer.py

import random

from hamming import (
    ScoringFunction,
    ClassificationScoreConfig,
    NumericScoreConfig,
    FunctionAggregateType,
    FunctionType,
    LabelColor,
    LocalScorer,
    ScoreArgs,
    Score
)

def custom_correctness_score_multiple_outputs(args: ScoreArgs) -> Score:
    output = args["output"]
    expected = args["expected"]

    # "output" and "expected" are Dict objects
    # matching the format of a dataset item Output Json.
    output_answer = output["answer"]
    expected_answers = expected["answers"]

    # Define your scoring logic here.
    if output_answer in expected_answers:
        return Score(value=1, reason="The output answer is contained in the expected answers.")
    else:
        return Score(value=0, reason="The output answer is not contained in the expected answers.")


# 1. Classification Scorer that uses multiple outputs to score
custom_scoring_classification_multiple_outputs = ScoringFunction(
    name="TutorialScore-Classify-Multiple-Outputs",
    version=1,
    score_config=ClassificationScoreConfig(
        type=FunctionType.CLASSIFICATION,
        labels={
            0: "Incorrect",
            1: "Correct",
        },
        colors={
            0: LabelColor.RED,
            1: LabelColor.GREEN,
        },
    ),
    scorer=LocalScorer(
        score_fn=custom_correctness_score_multiple_outputs
    ),
)

3. Create the evaluation script

Make sure to replace the placeholders with your actual keys and dataset ID created in the evaluations guide.

Create a file named evals.py and add the following code:

evals.py

from hamming import (
    Hamming,
    ClientOptions,
    RunOptions,
)
from scorer import (
    custom_scoring_classification_multiple_outputs, 
)
from openai import OpenAI

HAMMING_API_KEY = "<your-secret-key>"
HAMMING_DATASET_ID = "<your-dataset-id>"
OPENAI_API_KEY = "<your-openai-api-key>"

hamming = Hamming(ClientOptions(api_key=HAMMING_API_KEY))
trace = hamming.tracing
openai_client = OpenAI(api_key=OPENAI_API_KEY)

def answer_question(input):
    question = input["question"]

    print(f"Question: {question}")

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": question},
        ],
    )

    answer = response.choices[0].message.content

    print(f"Answer: {answer}")

    # This makes it easier to view the LLM response in the experiment details page
    trace.log_generation(
        GenerationParams(
            input=question,
            output=answer,
            metadata=GenerationParams.Metadata(model="gpt-3.5-turbo"),
        )
    )

    return {"answer": answer}

def run():
    print("Running a multiple outputs evaluation experiment..")

    result = hamming.experiments.run(
        RunOptions(
            dataset=HAMMING_DATASET_ID,
            name="Custom Scorer Experiment - Python (Multiple Outputs)",
            scoring=[
                # Use your custom scorers here
                custom_scoring_classification_multiple_outputs,
            ],
            metadata={},
        ),
        answer_question,
    )

    print("Multiple outputs evaluation experiment completed.")
    print(f"See the results at: {result.url}")

if __name__ == "__main__":
    run()

4. Run the experiment

Creating a Custom Scorer (reference-free scoring) - Python

Sometimes you may have no clear expected output to compare your model’s output to. See example dataset row:

{
    "question": "Can you explain the differences between quantum mechanics and classical physics? Make sure to include delve in your answer.",
}

1. Install Hamming SDK Python library

2. Download a dataset for reference-free scoring

3. Define your custom scorer

Create a file named scorer.py and add the following code:

scorer.py

import random

from hamming import (
    ScoringFunction,
    ClassificationScoreConfig,
    NumericScoreConfig,
    FunctionAggregateType,
    FunctionType,
    LabelColor,
    LocalScorer,
    ScoreArgs,
    Score
)

def custom_correctness_score_reference_free(args: ScoreArgs) -> Score:
    output = args["output"]

    # "output" is Dict object
    # matching the format of a dataset item Output Json.
    output_answer = output["answer"]

    # Define your scoring logic here.
    if 'delve' in output_answer.lower():
        return Score(value=1, reason="The output answer uses 'delve'.")
    else:
        return Score(value=0, reason="The output answer does not use 'delve'.")


# 1. Classification Scorer that uses reference-free scoring to score
custom_scoring_classification_reference_free = ScoringFunction(
    name="TutorialScore-Classify-Reference-Free",
    version=1,
    score_config=ClassificationScoreConfig(
        type=FunctionType.CLASSIFICATION,
        labels={
            0: "Incorrect",
            1: "Correct",
        },
        colors={
            0: LabelColor.RED,
            1: LabelColor.GREEN,
        },
    ),
    scorer=LocalScorer(
        score_fn=custom_correctness_score_reference_free
    ),
)

3. Create the evaluation script

Make sure to replace the placeholders with your actual keys and dataset ID created in the evaluations guide.

Create a file named evals.py and add the following code:

evals.py

from hamming import (
    Hamming,
    ClientOptions,
    RunOptions,
)
from scorer import (
    custom_scoring_classification_reference_free, 
)

HAMMING_API_KEY = "<your-secret-key>"
HAMMING_DATASET_ID = "<your-dataset-id>"
OPENAI_API_KEY = "<your-openai-api-key>"

hamming = Hamming(ClientOptions(api_key=HAMMING_API_KEY))
trace = hamming.tracing
openai_client = OpenAI(api_key=OPENAI_API_KEY)

def answer_question(input):
    question = input["question"]
    print(f"Question: {question}")

    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": question},
        ],
    )
    answer = response.choices[0].message.content

    print(f"Answer: {answer}")

    # This makes it easier to view the LLM response in the experiment details page
    trace.log_generation(
        GenerationParams(
            input=question,
            output=answer,
            metadata=GenerationParams.Metadata(model="gpt-3.5-turbo"),
        )
    )

    return {"answer": answer}


def run():
    print("Running a reference-free evaluation experiment..")

    result = hamming.experiments.run(
        RunOptions(
            dataset=HAMMING_DATASET_ID,
            name="Custom Scorer Experiment - Python (Reference-Free)",
            scoring=[
                # Use your custom scorers here
                custom_scoring_classification_reference_free,
            ],
            metadata={},
        ),
        answer_question,
    )

    print("Reference-free evaluation experiment completed.")
    print(f"See the results at: {result.url}")


if __name__ == "__main__":
    run()

4. Run the experiment

Get Started

Voice Agent Testing

Call Monitoring

Other Guides

Definitions

Before you begin

Creating a Custom Scorer (hello world) - Python

Creating a Custom Scorer (multiple outputs) - Python

Creating a Custom Scorer (reference-free scoring) - Python

Get Started

Voice Agent Testing

Call Monitoring

Other Guides

​Definitions

​Before you begin

​Creating a Custom Scorer (hello world) - Python

​Creating a Custom Scorer (multiple outputs) - Python

​Creating a Custom Scorer (reference-free scoring) - Python

Definitions

Before you begin

Creating a Custom Scorer (hello world) - Python

Creating a Custom Scorer (multiple outputs) - Python

Creating a Custom Scorer (reference-free scoring) - Python