Get Started
Voice and Chat Agent Testing
Call Monitoring
Custom Scores
Define custom scores for your AI experiments.
Hamming AI comes with pre-built scorers that cover a wide range of use cases. If you need to score your experiments in a different way, you can do so by creating a custom scorer.
Definitions
A custom scorer is a function that computes a score of a given experiment result.
Looking at the function output, we can define a custom scorer as:
- Classification Scorer: A function that returns a categorical value. (e.g.
low
,medium
,high
) - Numerical Scorer: A function that returns a real number.
Based on the execution environment, we can define a custom scorer as:
- Local Scorer: A function that runs on the machine that executes the experiment.
- Remote Scorer: A function that runs on Hamming AI platform. [Coming soon]
Before you begin
Follow the Evaluations Guide to get familiar with running experiments on Hamming AI. You should have a dataset ID and a secret key to continue with this guide.
Creating a Custom Scorer (hello world) - Python
Simple custom scorer that scores if the length of the answers has the same parity.
pip install hamming-sdk
- Download our Sample dataset file.
- Navigate to Create new dataset and use the drag and drop box to upload the file.
- For Input Columns, select “question” and “conversation_history”, and for Output Columns, select “output”.
- Name it “Multi-turn Dataset” and click Create.
- Copy the dataset ID from by clicking on the Copy ID button.
Create a file named scorer.py and add the following code:
import random
from hamming import (
ScoringFunction,
ClassificationScoreConfig,
NumericScoreConfig,
FunctionAggregateType,
FunctionType,
LabelColor,
LocalScorer,
ScoreArgs,
Score
)
def custom_correctness_score(args: ScoreArgs) -> Score:
output = args["output"]
expected = args["expected"]
# "output" and "expected" are Dict objects
# matching the format of a dataset item Output Json.
output_answer = output["answer"]
expected_answer = expected["output"]
# Define your scoring logic here.
if (len(output_answer) % 2) == (len(expected_answer) % 2):
return Score(value=1, reason="The length of the answers has the same parity.")
else:
return Score(value=0, reason="The length of the answers has different parity.")
# 1. Classification Scorer
custom_scoring_classification = ScoringFunction(
name="TutorialScore-Classify",
version=1,
score_config=ClassificationScoreConfig(
type=FunctionType.CLASSIFICATION,
labels={
0: "Incorrect",
1: "Correct",
},
colors={
0: LabelColor.RED,
1: LabelColor.GREEN,
},
),
scorer=LocalScorer(
score_fn=custom_correctness_score
),
)
# 2. Numerical Scorer
custom_scoring_numeric = ScoringFunction(
name="TutorialScore-Numeric",
version=1,
score_config=NumericScoreConfig(
aggregate=FunctionAggregateType.MEAN,
),
scorer=LocalScorer(
score_fn=lambda _: Score(value=random.random(), reason="Random number")
),
)
Create a file named evals.py and add the following code:
from hamming import (
Hamming,
ClientOptions,
RunOptions,
)
from scorer import (
custom_scoring_classification,
custom_scoring_numeric
)
HAMMING_API_KEY = "<your-secret-key>"
HAMMING_DATASET_ID = "<your-dataset-id>"
OPENAI_API_KEY = "<your-openai-api-key>"
hamming = Hamming(ClientOptions(api_key=HAMMING_API_KEY))
openai_client = OpenAI(api_key=OPENAI_API_KEY)
trace = hamming.tracing
def answer_question(input):
question = input["question"]
print(f"Question: {question}")
response = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": question},
],
)
answer = response.choices[0].message.content
print(f"Answer: {answer}")
# This makes it easier to view the LLM response in the experiment details page
trace.log_generation(
GenerationParams(
input=question,
output=answer,
metadata=GenerationParams.Metadata(model="gpt-3.5-turbo"),
)
)
return {"answer": answer}
def run():
print("Running a custom scorer experiment..")
result = hamming.experiments.run(
RunOptions(
dataset=HAMMING_DATASET_ID,
name="Custom Scorer Experiment - Python",
scoring=[
# Use your custom scorers here
custom_scoring_classification,
custom_scoring_numeric
],
metadata={},
),
answer_question,
)
print("Custom scorer experiment completed.")
print(f"See the results at: {result.url}")
if __name__ == "__main__":
run()
Run the script by executing the following command in your terminal:
python evals.py
This will create an experiment in Hamming. Navigate to the Experiments page to see the results.
Creating a Custom Scorer (multiple outputs) - Python
Sometimes you may have multiple correct outputs to compare your model’s output to. See example dataset row:
{
"question":
"Can you explain the differences between quantum mechanics and classical physics?",
"answers": [
"Quantum mechanics differs from classical physics in several ways. For instance, it introduces the concept of wave-particle duality, where particles can exhibit both wave-like and particle-like properties. Additionally, it incorporates the principle of uncertainty, which states that certain pairs of properties, like position and momentum, cannot be simultaneously measured with arbitrary precision.",
"Quantum mechanics introduces wave-particle duality and the uncertainty principle, which are not present in classical physics.",
"Classical physics is deterministic, while quantum mechanics is probabilistic.",
"Quantum mechanics requires the use of complex numbers and operators, unlike classical physics.",
],
}
pip install hamming-sdk
- Download our Sample dataset file.
- Navigate to Create new dataset and use the drag and drop box to upload the file.
- For Input Columns, select “question”, and for Output Columns, select “answers”.
- Name it “Multiple Outputs Dataset” and click Create.
- Copy the dataset ID from by clicking on the Copy ID button.
Create a file named scorer.py and add the following code:
import random
from hamming import (
ScoringFunction,
ClassificationScoreConfig,
NumericScoreConfig,
FunctionAggregateType,
FunctionType,
LabelColor,
LocalScorer,
ScoreArgs,
Score
)
def custom_correctness_score_multiple_outputs(args: ScoreArgs) -> Score:
output = args["output"]
expected = args["expected"]
# "output" and "expected" are Dict objects
# matching the format of a dataset item Output Json.
output_answer = output["answer"]
expected_answers = expected["answers"]
# Define your scoring logic here.
if output_answer in expected_answers:
return Score(value=1, reason="The output answer is contained in the expected answers.")
else:
return Score(value=0, reason="The output answer is not contained in the expected answers.")
# 1. Classification Scorer that uses multiple outputs to score
custom_scoring_classification_multiple_outputs = ScoringFunction(
name="TutorialScore-Classify-Multiple-Outputs",
version=1,
score_config=ClassificationScoreConfig(
type=FunctionType.CLASSIFICATION,
labels={
0: "Incorrect",
1: "Correct",
},
colors={
0: LabelColor.RED,
1: LabelColor.GREEN,
},
),
scorer=LocalScorer(
score_fn=custom_correctness_score_multiple_outputs
),
)
Create a file named evals.py and add the following code:
from hamming import (
Hamming,
ClientOptions,
RunOptions,
)
from scorer import (
custom_scoring_classification_multiple_outputs,
)
from openai import OpenAI
HAMMING_API_KEY = "<your-secret-key>"
HAMMING_DATASET_ID = "<your-dataset-id>"
OPENAI_API_KEY = "<your-openai-api-key>"
hamming = Hamming(ClientOptions(api_key=HAMMING_API_KEY))
trace = hamming.tracing
openai_client = OpenAI(api_key=OPENAI_API_KEY)
def answer_question(input):
question = input["question"]
print(f"Question: {question}")
response = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": question},
],
)
answer = response.choices[0].message.content
print(f"Answer: {answer}")
# This makes it easier to view the LLM response in the experiment details page
trace.log_generation(
GenerationParams(
input=question,
output=answer,
metadata=GenerationParams.Metadata(model="gpt-3.5-turbo"),
)
)
return {"answer": answer}
def run():
print("Running a multiple outputs evaluation experiment..")
result = hamming.experiments.run(
RunOptions(
dataset=HAMMING_DATASET_ID,
name="Custom Scorer Experiment - Python (Multiple Outputs)",
scoring=[
# Use your custom scorers here
custom_scoring_classification_multiple_outputs,
],
metadata={},
),
answer_question,
)
print("Multiple outputs evaluation experiment completed.")
print(f"See the results at: {result.url}")
if __name__ == "__main__":
run()
Run the script by executing the following command in your terminal:
python evals.py
This will create an experiment in Hamming. Navigate to the Experiments page to see the results.
Creating a Custom Scorer (reference-free scoring) - Python
Sometimes you may have no clear expected output to compare your model’s output to. See example dataset row:
{
"question": "Can you explain the differences between quantum mechanics and classical physics? Make sure to include delve in your answer.",
}
pip install hamming-sdk
- Download our Sample dataset file.
- Navigate to Create new dataset and use the drag and drop box to upload the file.
- For Input Columns, select “question”.
- Name it “Reference-Free Scoring Dataset” and click Create.
- Copy the dataset ID from by clicking on the Copy ID button.
Create a file named scorer.py and add the following code:
import random
from hamming import (
ScoringFunction,
ClassificationScoreConfig,
NumericScoreConfig,
FunctionAggregateType,
FunctionType,
LabelColor,
LocalScorer,
ScoreArgs,
Score
)
def custom_correctness_score_reference_free(args: ScoreArgs) -> Score:
output = args["output"]
# "output" is Dict object
# matching the format of a dataset item Output Json.
output_answer = output["answer"]
# Define your scoring logic here.
if 'delve' in output_answer.lower():
return Score(value=1, reason="The output answer uses 'delve'.")
else:
return Score(value=0, reason="The output answer does not use 'delve'.")
# 1. Classification Scorer that uses reference-free scoring to score
custom_scoring_classification_reference_free = ScoringFunction(
name="TutorialScore-Classify-Reference-Free",
version=1,
score_config=ClassificationScoreConfig(
type=FunctionType.CLASSIFICATION,
labels={
0: "Incorrect",
1: "Correct",
},
colors={
0: LabelColor.RED,
1: LabelColor.GREEN,
},
),
scorer=LocalScorer(
score_fn=custom_correctness_score_reference_free
),
)
Create a file named evals.py and add the following code:
from hamming import (
Hamming,
ClientOptions,
RunOptions,
)
from scorer import (
custom_scoring_classification_reference_free,
)
HAMMING_API_KEY = "<your-secret-key>"
HAMMING_DATASET_ID = "<your-dataset-id>"
OPENAI_API_KEY = "<your-openai-api-key>"
hamming = Hamming(ClientOptions(api_key=HAMMING_API_KEY))
trace = hamming.tracing
openai_client = OpenAI(api_key=OPENAI_API_KEY)
def answer_question(input):
question = input["question"]
print(f"Question: {question}")
response = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": question},
],
)
answer = response.choices[0].message.content
print(f"Answer: {answer}")
# This makes it easier to view the LLM response in the experiment details page
trace.log_generation(
GenerationParams(
input=question,
output=answer,
metadata=GenerationParams.Metadata(model="gpt-3.5-turbo"),
)
)
return {"answer": answer}
def run():
print("Running a reference-free evaluation experiment..")
result = hamming.experiments.run(
RunOptions(
dataset=HAMMING_DATASET_ID,
name="Custom Scorer Experiment - Python (Reference-Free)",
scoring=[
# Use your custom scorers here
custom_scoring_classification_reference_free,
],
metadata={},
),
answer_question,
)
print("Reference-free evaluation experiment completed.")
print(f"See the results at: {result.url}")
if __name__ == "__main__":
run()
Run the script by executing the following command in your terminal:
python evals.py
This will create an experiment in Hamming. Navigate to the Experiments page to see the results.