Write evals

An Eval() statement logs results to a Braintrust project. (Note: you can have multiple eval statements for one project and/or multiple eval statements in one file.)

The first argument is the name of the project, and the second argument is an object with the following properties:

data, a function that returns an evaluation dataset: a list of inputs, expected outputs (optional), and metadata
task, a function that takes a single input and returns an output (usually an LLM completion)
scores, a set of scoring functions that take an input, output, and expected output (optional) and return a score
metadata about the experiment, like the model you're using or configuration values

The return value of Eval() includes the full results of the eval as well as a summary that you can use to see the average scores, duration, improvements, regressions, and other metrics.

basic.eval.ts

import { Eval } from "braintrust";
import { Factuality } from "autoevals";
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: () => {
      return [
        {
          input: "David",
          expected: "Hi David",
        },
      ]; // Replace with your eval dataset
    },
    task: (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Factuality],
  },
);

Data

An evaluation dataset is a list of test cases. Each has an input and optional expected output, metadata, and tags. The key fields in a data record are:

Input: The arguments that uniquely define a test case (an arbitrary, JSON serializable object). Braintrust uses the input to know whether two test cases are the same between evaluation runs, so the cases should not contain run-specific state. A simple rule of thumb is that if you run the same eval twice, the input should be identical.
Expected. (Optional) the ground truth value (an arbitrary, JSON serializable object) that you'd compare to output to determine if your output value is correct or not. Braintrust currently does not compare output to expected for you, since there are many different ways to do that correctly. For example, you may use a subfield in expected to compare to a subfield in output for a certain scoring function. Instead, these values are just used to help you navigate your evals while debugging and comparing results.
Metadata. (Optional) a dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log the prompt, example's id, model parameters, or anything else that would be useful to slice/dice later.
Tags. (Optional) a list of strings that you can use to filter and group records later.

To get started with evals, you need some test data. A fine starting point is to write 5-10 examples that you believe are representative. The data must have an input field (which could be complex JSON, or just a string) and should ideally have an expected output field, (although this is not required).

Once you have an evaluation set up end-to-end, you can always add more test cases. You'll know you need more data if your eval scores and outputs seem fine, but your production app doesn't look right. And once you have Braintrust's Logging set up, your real application data will provide a rich source of examples to use as test cases.

As you scale, Braintrust's Datasets are a great tool for managing your test cases.

It's a common misconception that you need a large volume of perfectly labeled evaluation data, but that's not the case. In practice, it's better to assume your data is noisy, your AI model is imperfect, and your scoring methods are a little bit wrong. The goal of evaluation is to assess each of these components and improve them over time.

Scorers

A scoring function allows you to compare the expected output of a task to the actual output and produce a score between 0 and 1. You use a scoring function by referencing it in the scores array in your eval.

We recommend starting with the scorers provided by Braintrust's autoevals library. They work out of the box and will get you up and running quickly. Just like with test cases, once you begin running evaluations, you will find areas that need improvement. This will lead you create your own scorers, customized to your usecases, to get a well rounded view of your application's performance.

Define your own scorers

You can define your own score, e.g.

import { Eval } from "braintrust";
import { Factuality } from "autoevals";
 
const exactMatch = (args: { input; output; expected? }) => {
  return {
    name: "Exact match",
    score: args.output === args.expected ? 1 : 0,
  };
};
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: () => {
      return [
        {
          input: "David",
          expected: "Hi David",
        },
      ]; // Replace with your eval dataset
    },
    task: (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Factuality, exactMatch],
  }
);

Score using AI

You can also define your own prompt-based scoring functions. For example,

import { Eval } from "braintrust";
import { LLMClassifierFromTemplate } from "autoevals";
 
const noApology = LLMClassifierFromTemplate({
  name: "No apology",
  promptTemplate: "Does the response contain an apology? (Y/N)\n\n{{output}}",
  choiceScores: {
    Y: 0,
    N: 1,
  },
  useCoT: true,
});
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: () => {
      return [
        {
          input: "David",
          expected: "Hi David",
        },
      ]; // Replace with your eval dataset
    },
    task: (input) => {
      return "Sorry " + input; // Replace with your LLM call
    },
    scores: [noApology],
  },
);

Conditional scoring

Sometimes, the scoring function(s) you want to use depend on the input data. For example, if you're evaluating a chatbot, you might want to use a scoring function that measures whether calculator-style inputs are correctly answered.

Skip scorers

Return null/None to skip a scorer for a particular test case.

calculator.eval.ts

import { NumericDiff } from "autoevals";
 
const calculatorAccuracy = ({ input, output }) => {
  if (input.type !== "calculator") {
    return null;
  }
  return NumericDiff({ output, expected: calculatorTool(input.text) });
};
 
...

Scores with null/None values will be ignored when computing the overall score, improvements/regressions, and summary metrics like standard deviation.

List of scorers

You can also return a list of scorers from a scorer function. This allows you to dynamically generate scores based on the input data, or even combine scores together into a single score. When you return a list of scores, you must return a Score object, which has a name and a score field.

calculator_accuracy.eval.ts

const calculatorAccuracy = ({ input, output }) => {
  if (input.type !== "calculator") {
    return null;
  }
  return [
    {
      name: "Numeric diff",
      score: NumericDiff({ output, expected: calculatorTool(input.text) }),
    },
    {
      name: "Exact match",
      score: output === calculatorTool(input.text) ? 1 : 0,
    },
  ];
};
 
...

Scorers with additional fields

Certain scorers, like ClosedQA, allow additional fields to be passed in. You can pass them in by initializing them with .partial(...).

closed_q_a.eval.ts

import { Eval, wrapOpenAI } from "braintrust";
import { ClosedQA } from "autoevals";
import { OpenAI } from "openai";
 
Eval("QA bot", {
  data: () => [
    {
      input: "Which insect has the highest population?",
      expected: "ant",
    },
  ],
  task: async (input) => {
    const response = await openai.chat.completions.create({
      model: "gpt-3.5-turbo",
      messages: [
        {
          role: "system",
          content:
            "Answer the following question. Specify how confident you are (or not)",
        },
        { role: "user", content: "Question: " + input },
      ],
    });
    return response.choices[0].message.content || "Unknown";
  },
  scores: [ClosedQA.partial(criteria="Does the submission specify whether or not it can confidently answer the question?")],
})

This approach works well if the criteria is static, but if the criteria is dynamic, you can pass them in via a wrapper function, e.g.

closed_q_a.eval.ts

import { Eval, wrapOpenAI } from "braintrust";
import { ClosedQA } from "autoevals";
import { OpenAI } from "openai";
 
const openai = wrapOpenAI(new OpenAI());
 
interface Metadata {
  criteria: string;
}
 
const closedQA = (args: { input: string; output: string; metadata: Metadata }) => {
  return ClosedQA({
    input: args.input,
    output: args.output,
    criteria: args.metadata.criteria,
  });
};
 
Eval("QA bot", {
  data: () => [
    {
      input: "Which insect has the highest population?",
      expected: "ant",
      metadata: {
        criteria: "Does the submission specify whether or not it can confidently answer the question?",
      },
    },
  ],
  task: async (input) => {
    const response = await openai.chat.completions.create({
      model: "gpt-3.5-turbo",
      messages: [
        {
          role: "system",
          content:
            "Answer the following question. Specify how confident you are (or not)",
        },
        { role: "user", content: "Question: " + input },
      ],
    });
    return response.choices[0].message.content || "Unknown";
  },
  scores: [closedQA],
});

Trials

It is often useful to run each input in an evaluation multiple times, to get a sense of the variance in responses and get a more robust overall score. Braintrust supports trials as a first-class concept, allowing you to run each input multiple times. Behind the scenes, Braintrust will intelligently aggregate the results by bucketing test cases with the same input value and computing summary statistics for each bucket.

To enable trials, add a trialCount/trial_count property to your evaluation:

trials.eval.ts

import { Eval } from "braintrust";
import { Factuality } from "autoevals";
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: () => {
      return [
        {
          input: "David",
          expected: "Hi David",
        },
      ]; // Replace with your eval dataset
    },
    task: (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Factuality],
    trialCount: 10,
  },
);

Hill climbing

Sometimes you do not have expected outputs, and instead want to use a previous experiment as a baseline. Hill climbing is inspired by, but not exactly the same as, the term used in numerical optimization. In the context of Braintrust, hill climbing is a way to iteratively improve a model's performance by comparing new experiments to previous ones. This is especially useful when you don't have a pre-existing benchmark to evaluate against.

Braintrust supports hill climbing as a first-class concept, allowing you to use a previous experiment's output field as the expected field for the current experiment. Autoevals also includes a number of scoreres, like Summary and Battle, that are designed to work well with hill climbing.

To enable hill climbing, use BaseExperiment() in the data field of an eval:

hill_climbing.eval.ts

import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment(),
    task: (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Battle],
  },
);

That's it! Braintrust will automatically pick the best base experiment, either using git metadata if available or timestamps otherwise, and then populate the expected field by merging the expected and output field of the base experiment. This means that if you set expected, e.g. through the UI while reviewing results, it will be used as the expected field for the next experiment.

Using a specific experiment

If you want to use a specific experiment as the base experiment, you can pass the name field to BaseExperiment():

hill_climbing_specific.eval.ts

import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment({ name: "main-123" }),
    task: (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Battle],
  },
);

Scoring considerations

Often while hill climbing, you want to use two different types of scoring functions:

Methods that do not require an expected output, e.g. ClosedQA, so that you can judge the quality of the output purely based on the input and output. This measure is useful to track across experiments, and it can be used to compare any two experiments, even if they are not sequentially related.
Comparative methods, e.g. Battle or Summary, that accept an expected output but do not treat it as a ground truth. Generally speaking, if you score > 50% on a comparative method, it means you're doing better than the base on average. To learn more about how Battle and Summary work, check out their prompts.

Custom reporters

When you run an experiment, Braintrust logs the results to your terminal, and braintrust eval returns a non-zero exit code if any eval throws an exception. However, it's often useful to customize this behavior, e.g. in your CI/CD pipeline to precisely define what constitutes a failure, or to report results to a different system.

Braintrust allows you to define custom reporters that can be used to process and log results anywhere you'd like. You can define a reporter by adding a Reporter(...) block. A Reporter has two functions:

reporter.eval.ts

import { Reporter } from "braintrust";
 
Reporter(
  "My reporter", // Replace with your reporter name
  {
    reportEval(evaluator, result, opts) {
      // Summarizes the results of a single reporter, and return whatever you
      // want (the full results, a piece of text, or both!)
    },
 
    reportRun(results) {
      // Takes all the results and summarizes them. Return a true or false
      // which tells the process to exit.
      return true;
    },
  },
);

Any Reporter included among your evaluated files will be automatically picked up by the braintrust eval command.

If no reporters are defined, the default reporter will be used which logs the results to the console.
If you define one reporter, it'll be used for all Eval blocks.
If you define multiple Reporters, you have to specify the reporter name as an optional 3rd argument to Eval().

Example: the default reporter

As an example, here's the default reporter that Braintrust uses:

reporter_default.eval.ts

import { ReporterDef } from "braintrust";
 
export const defaultReporter: ReporterDef<boolean> = {
  name: "Braintrust default reporter",
  async reportEval(
    evaluator: EvaluatorDef<any, any, any, any>,
    result: EvalResultWithSummary<any, any, any, any>,
    { verbose, jsonl }: ReporterOpts
  ) {
    const { results, summary } = result;
    const failingResults = results.filter(
      (r: { error: unknown }) => r.error !== undefined
    );
 
    if (failingResults.length > 0) {
      reportFailures(evaluator, failingResults, { verbose, jsonl });
    }
 
    console.log(jsonl ? JSON.stringify(summary) : summary);
    return failingResults.length === 0;
  },
  async reportRun(evalReports: boolean[]) {
    return evalReports.every((r) => r);
  },
};

Tracing

Braintrust allows you to trace detailed debug information and metrics about your application that you can use to measure performance and debug issues. The trace is a tree of spans, where each span represents an expensive task, e.g. an LLM call, vector database lookup, or API request.

If you are using the OpenAI API, Braintrust includes a wrapper function that automatically logs your requests. To use it, simply call wrapOpenAI/wrap_openai on your OpenAI instance. See Wrapping OpenAI for more info.

Each call to experiment.log() creates its own trace, starting at the time of the previous log statement and ending at the completion of the current. Do not mix experiment.log() with tracing. It will result in extra traces that are not correctly parented.

For more detailed tracing, you can wrap existing code with the braintrust.traced function. Inside the wrapped function, you can log incrementally to braintrust.currentSpan(). For example, you can progressively log the input, output, and expected output of a task, and then log a score at the end:

import { traced } from "braintrust";
 
async function callModel(input) {
  return traced(
    async (span) => {
      const messages = { messages: [{ role: "system", text: input }] };
      span.log({ input: messages });
 
      // Replace this with a model call
      const result = {
        content: "China",
        latency: 1,
        prompt_tokens: 10,
        completion_tokens: 2,
      };
 
      span.log({
        output: result.content,
        metrics: {
          latency: result.latency,
          prompt_tokens: result.prompt_tokens,
          completion_tokens: result.completion_tokens,
        },
      });
      return result.content;
    },
    {
      name: "My AI model",
    }
  );
}
 
const exactMatch = (args: { input; output; expected? }) => {
  return {
    name: "Exact match",
    score: args.output === args.expected ? 1 : 0,
  };
};
 
Eval("My Evaluation", {
  data: () => [
    { input: "Which country has the highest population?", expected: "China" },
  ],
  task: async (input, { span }) => {
    return await callModel(input);
  },
  scores: [exactMatch],
});

This results in a span tree you can visualize in the UI by clicking on each test case in the experiment:

Root Span Subspan

Logging SDK

The SDK allows you to report evaluation results directly from your code, without using the Eval() or .traced() functions. This is useful if you want to structure your own complex evaluation logic, or integrate Braintrust with an existing testing or evaluation framework.

import * as braintrust from "braintrust";
import { Factuality } from "autoevals";
 
async function runEvaluation() {
  const experiment = braintrust.init("Say Hi Bot"); // Replace with your project name
  const dataset = [{ input: "David", expected: "Hi David" }]; // Replace with your eval dataset
 
  const promises = [];
  for (const { input, expected } of dataset) {
    // You can await here instead to run these sequentially
    promises.push(
      experiment.traced(async (span) => {
        const output = "Hi David"; // Replace with your LLM call
 
        const { name, score } = await Factuality({ input, output, expected });
 
        span.log({
          input,
          output,
          expected,
          scores: {
            [name]: score,
          },
          metadata: { type: "Test" },
        });
      }),
    );
  }
  await Promise.all(promises);
 
  const summary = await experiment.summarize();
  console.log(summary);
  return summary;
}
 
runEvaluation();

Refer to the tracing guide for examples of how to trace evaluations using the low-level SDK. For more details on how to use the low level SDK, see the Python or Node.js documentation.

Troubleshooting

Exception when mixing `log` with `traced`

There are two ways to log to Braintrust: Experiment.log and Experiment.traced. Experiment.log is for non-traced logging, while Experiment.traced is for tracing. This exception is thrown when you mix both methods on the same object, for instance:

import { init, traced } from "braintrust";
 
function foo() {
  return traced((span) => {
    const output = 1;
    span.log({ output });
    return output;
  });
}
 
const experiment = init("my-project");
for (let i = 0; i < 10; ++i) {
  const output = foo();
  // ❌ This will throw an exception, because we have created a trace for `foo`
  // with `traced` but here we are logging to the toplevel object, NOT the
  // trace.
  experiment.log({ input: "foo", output });
}

Most of the time, you should use either Experiment.log or Experiment.traced, but not both, so the SDK throws an error to prevent accidentally mixing them together. For the above example, you most likely want to write:

import { init, traced } from "braintrust";
 
function foo() {
  return traced((span) => {
    const output = 1;
    span.log({ output });
    return output;
  });
}
 
const experiment = init("my-project");
for (let i = 0; i < 10; ++i) {
  // Create a toplevel trace with `traced`.
  experiment.traced((span) => {
    // The call to `foo` is nested as a subspan under our toplevel trace.
    const output = foo();
    // We log to the toplevel trace with `span.log`.
    span.log({ input: "foo", output: "bar" });
  });
}

In rare cases, if you are certain you want to mix traced and non-traced logging on the same object, you may pass the argument allowConcurrentWithSpans: true/allow_concurrent_with_spans=True to Experiment.log.