Skip to main content

Custom Code Evaluator

Custom code evaluators let you write your own evaluation logic in Python, JavaScript, or TypeScript. Your code has access to the application inputs, outputs, and the full execution trace (spans, latency, token usage, costs).

Function signature

Your code must define an evaluate function with the following signature:

from typing import Dict, Any

def evaluate(
inputs: Dict[str, Any],
outputs: Any,
trace: Dict[str, Any],
) -> float:

Parameters

ParameterTypeDescription
inputsDict[str, Any]In batch evaluation: the testcase data (all columns). In online evaluation: the application's input from the trace.
outputsAnyThe application's output (string or dict).
traceDict[str, Any]The full execution trace with spans, metrics (latency, token counts, costs), and child spans.

Return value

The function can return one of:

  • dict — a dictionary of metrics, such as {"score": 0.8, "success": True} or {"relevance": 0.9, "tone": 0.4, "reason": "missed the greeting"}. Each key becomes a separate metric in the evaluation results. Values must be JSON-serializable. Nested dictionaries are flattened into dotted metric names.
  • float (0.0 to 1.0) — a single score where 0.0 is worst and 1.0 is best. Agenta normalizes it to {"score": <value>, "success": <value >= threshold>}.
  • bool — normalized to {"success": <value>}.

Examples

Exact match

from typing import Dict, Any

def evaluate(
inputs: Dict[str, Any],
outputs: Any,
trace: Dict[str, Any],
) -> dict:
success = outputs == inputs.get("correct_answer")
return {
"score": 1.0 if success else 0.0,
"success": success,
}

Multiple metrics

Return a dict to report several metrics from one evaluator. Each key shows up as its own column in the results:

from typing import Dict, Any

def evaluate(
inputs: Dict[str, Any],
outputs: Any,
trace: Dict[str, Any],
) -> dict:
text = outputs if isinstance(outputs, str) else str(outputs)
return {
"length_ok": len(text) <= 280,
"mentions_brand": "Agenta" in text,
"word_count": float(len(text.split())),
}

Latency check

Use trace data to check whether the response was generated within a time budget:

from typing import Dict, Any

def evaluate(
inputs: Dict[str, Any],
outputs: Any,
trace: Dict[str, Any],
) -> float:
if not trace or not trace.get("spans"):
return 0.0

root = list(trace["spans"].values())[0]
ag = root.get("attributes", {}).get("ag", {})
duration = ag.get("metrics", {}).get("unit", {}).get("duration", {}).get("total", 0)

# Fail if response took more than 5 seconds
if duration > 5.0:
return 0.0
return 1.0

Token budget check

Verify the application stayed within a token budget:

from typing import Dict, Any

def evaluate(
inputs: Dict[str, Any],
outputs: Any,
trace: Dict[str, Any],
) -> float:
if not trace or not trace.get("spans"):
return 0.5

root = list(trace["spans"].values())[0]
ag = root.get("attributes", {}).get("ag", {})
tokens = ag.get("metrics", {}).get("unit", {}).get("tokens", {})
total_tokens = tokens.get("total", 0)

max_tokens = 500
if total_tokens <= max_tokens:
return 1.0
elif total_tokens <= max_tokens * 1.5:
return 0.5
return 0.0

Accessing ground truth

In batch evaluation, testcase columns are available directly in inputs. If your testset has a correct_answer column, access it as inputs["correct_answer"] or inputs.get("correct_answer").

You do not need to configure a separate "correct answer key" — just read the column name directly from inputs.

Accessing trace data

The trace parameter contains the full OpenTelemetry trace serialized as a dict. The structure looks like:

{
"spans": {
"<span_id>": {
"name": "my_app",
"start_time": "2025-01-15T10:30:00Z",
"end_time": "2025-01-15T10:30:02.5Z",
"status_code": "OK",
"attributes": {
"ag": {
"data": {
"inputs": {"country": "France"},
"outputs": "The capital is Paris"
},
"metrics": {
"unit": {
"costs": {"total": 0.001},
"tokens": {"prompt": 50, "completion": 20, "total": 70},
"duration": {"total": 2.5}
}
}
}
},
"children": [...]
}
}
}

Useful paths:

DataPath
Root spanlist(trace["spans"].values())[0]
App inputsroot["attributes"]["ag"]["data"]["inputs"]
App outputsroot["attributes"]["ag"]["data"]["outputs"]
Latency (seconds)root["attributes"]["ag"]["metrics"]["unit"]["duration"]["total"]
Token countsroot["attributes"]["ag"]["metrics"]["unit"]["tokens"]
Costsroot["attributes"]["ag"]["metrics"]["unit"]["costs"]["total"]
Child spansroot["children"]

JavaScript and TypeScript

The same interface is available in JavaScript and TypeScript:

JavaScript:

function evaluate(inputs, outputs, trace) {
const success = outputs === inputs.correct_answer
return {score: success ? 1.0 : 0.0, success: success}
}

TypeScript:

function evaluate(
inputs: Record<string, any>,
outputs: any,
trace: Record<string, any>
): {score: number; success: boolean} {
const success = outputs === inputs.correct_answer
return {score: success ? 1.0 : 0.0, success: success}
}

Legacy interfaces

info

Existing evaluators keep working unchanged. There are two older interfaces:

The original 4-parameter interface:

def evaluate(app_params, inputs, output, correct_answer) -> float:

And the float-only 3-parameter interface:

def evaluate(inputs, outputs, trace) -> float:

Evaluators created with either interface continue to return a single score. Dict returns are only supported for evaluators created after this update.

To migrate an old evaluator, create a new code evaluator and copy your logic over. Use the (inputs, outputs, trace) signature, read ground truth directly from inputs, and return a dict of metrics.