Interpret evaluation results

Each offline evaluation creates an experiment, a permanent record of how the evaluated task performed on a dataset.

View results

To view the results of an experiment, go to Experiments in your project and select the experiment from the list.

Traces vs. spans - By default, experiments display as a table of traces where each row represents a complete trace with its root span. To view the individual spans in traces instead, select the row type chip ( Traces by default) in the toolbar, then Spans. View individual spans when you want to:
- Analyze specific operations within traces
- Find particular function calls or API requests
- Examine timing and token usage for individual operations
Spans view is optimized for analyzing individual operations. Experiment comparisons and diff mode are only available when viewing traces.
Metrics - Along with the scores you track, Braintrust tracks a number of metrics about your LLM calls that help you assess and understand performance. For example, when you switch models, it’s useful to look at duration, token metrics, and estimated cost together to understand the tradeoffs. To compute LLM metrics like token counts, make sure you trace your LLM calls.
Experiment summary - Select Details to view:
- Comparisons to other experiments
- Scorers used in the evaluation
- Datasets tested
- Saved parameters linked to the evaluation
- Metadata like model and parameters
Copy the experiment ID from the bottom of the summary pane for referencing in code or sharing with teammates.

Filter results

Each project provides default table views with common filters for experiments, including:

Default view: Shows all traces in the experiment
Non-errors: Shows only traces without errors
Errors: Shows only traces with errors
Scorer errors: Show only traces with scorer errors
Unreviewed: Hides traces that have been human-reviewed
Assigned to me: Shows only traces assigned to the current user for human review

Use the menu to switch the table view.

Built-in views (such as “All experiments view”) cannot be modified, but you can create custom table views based on custom filters and display settings.

You can also use the Filter menu to add custom filtering. Use the Basic tab for point-and-click filtering, or switch to SQL to write precise SQL queries. To filter experiments by metadata programmatically, use the metadata query parameter on GET /v1/experiment. See Filter experiments by metadata for details.

Group results

Select Display > Group by to group the table by metadata fields or classifier results to see patterns. Classifier options appear under the Classifications heading in the group-by menu. By default, group rows show one experiment’s summary data. To view summary data for all experiments, select Include comparisons in group.

Order by regressions

Score and metric columns show summary statistics in their headers. To order columns by regressions, select Display > Columns > Order by regressions. Within grouped tables, this sorts rows by regressions of a specific score relative to a comparison experiment.

Examine a trace

Select any row to open the trace view and see complete details:

Input, output, and expected values
Metadata and parameters
All spans in the trace hierarchy
Scores and their explanations
Timing and token usage

Ask yourself: Do good scores correspond to good outputs? If not, update your scorers or test cases. Use the button to expand the trace to fullscreen or the button to open it in a separate page. For details on trace views, layouts, and actions, see Examine traces.

When comparing experiments with diff mode enabled, only the default trace view is available. Timeline, Thread, and custom views are disabled during comparison.

Assign for review

You can assign experiment rows to team members for review, analysis, or follow-up action. Assignments are particularly useful for human review workflows, where you can assign specific rows that need human evaluation and distribute review work across multiple team members. See Assign rows for review for details.

Score retrospectively

Apply scorers and classifiers to existing experiments:

Multiple cases: Select rows and use Score to apply chosen scorers and classifiers
Single case: Open a trace and use Score in the trace view

Scores and classifications appear as additional spans within the trace.

Analyze with Loop

Use Loop to analyze experiment results, identify patterns, and get improvement suggestions. Loop can help you understand why certain test cases succeeded or failed and generate actionable recommendations. Select one or more experiments and open Loop to:

Summarize results: Get high-level insights about experiment performance, score trends, and key differences between experiments.
Drill into specific rows: Ask Loop to analyze test cases that performed poorly or identify patterns across failures.
Generate improvements: Loop can suggest changes to prompts, scorers, or datasets based on experiment results.
Create datasets: Extract problematic or interesting test cases into new datasets for targeted evaluation.
Generate code: Get sample code for implementing improvements to test in your next experiment.

Example queries:

“What improved from the last experiment?”
“Categorize the errors in this experiment”
“Pick the best scorers for this task”
“Why did the factuality score drop?”
“Create a dataset from the rows where the model failed”
“What patterns do you see in the low-scoring cases?”

Use aggregate scores

Aggregate scores are formulas that combine multiple scores into a single metric. They are useful when you track many scores but need a single metric to represent overall experiment quality. See Create aggregate scores for more details.

Download results

To download an experiment’s results, select and then Download as CSV or Download as JSON.

Customize the experiments table

Adjust table layout

To switch between different layouts, select the layout chip ( List by default) in the toolbar and choose one of the following:

List: Default table view
Grid: Compare outputs side-by-side. To choose which fields appear in each card, use Display > Fields.
Summary: Large-type summary of scores and metrics across all experiments
Summary table: Scores and metrics as rows with experiments as columns, with a PDF download option.

Layouts respect view filters and are automatically saved when you save a view.

Show and hide columns

Select Display > Columns and then:

Show or hide columns to focus on relevant data
Reorder columns by dragging them
Pin important columns to the left

All column settings are automatically saved when you save a view. When topics are enabled, facet outputs appear as columns in the experiments table, similar to scores. You can filter and sort by facet columns to analyze patterns in your evaluation results. This helps identify which types of inputs (e.g., specific user tasks or sentiment categories) perform better or worse in your experiments. Classifiers also appear as columns in the experiments and logs tables and in playground results, with one column per classifier under a classifications. prefix. You can sort and filter each classifier column independently. In experiments and playgrounds, you can group rows by a classifier column.

Create custom columns

Extract specific values from traces using custom columns:

Select Display > Columns > + Add custom column.
Name your column.
Choose from inferred fields or write a SQL expression.

Once created, filter and sort using your custom columns.

Create custom table views

To create or update a custom table view:

Apply the filters and display settings you want.
Open the menu and select Save view… or Save view as….

Custom table views are visible to all project members. Creating or editing a table view requires the Update project permission.

Duplicate table views across projects

If you’ve built a useful custom table view in one project, you can duplicate it to another project via the API rather than recreating it from scratch. Experiments have two customizable views:

Experiments list: The project’s Experiments tab, where each row is a experiment.
Single experiment table: The rows of data inside one experiment.

The following steps work for either. Choose the corresponding view_type in the API call.

Use the list views API endpoint to fetch the experiment views in your source project. Pass the following query parameters:

object_type=project
object_id=<source-project-id>
view_type=experiment for a single experiment table view, or view_type=experiments for the experiments list

curl --request GET \
  --url 'https://api.braintrust.dev/v1/view?object_type=project&object_id=<source-project-id>&view_type=experiment' \
  --header 'Authorization: Bearer <your-api-key>'

curl --request GET \
  --url 'https://api.braintrust.dev/v1/view?object_type=project&object_id=<source-project-id>&view_type=experiments' \
  --header 'Authorization: Bearer <your-api-key>'

In the response, find the view you want to duplicate and copy its view_data and options payloads.

Use the create view API endpoint to create the view in the destination project. Set object_id to the destination project ID.

curl --request POST \
  --url https://api.braintrust.dev/v1/view \
  --header 'Authorization: Bearer <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '
  {
    "object_type": "project",
    "object_id": "<destination-project-id>",
    "view_type": "experiment",
    "name": "<new-view-name>",
    "view_data": <view-data-payload>,
    "options": <options-payload>
  }'

curl --request POST \
  --url https://api.braintrust.dev/v1/view \
  --header 'Authorization: Bearer <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '
  {
    "object_type": "project",
    "object_id": "<destination-project-id>",
    "view_type": "experiments",
    "name": "<new-view-name>",
    "view_data": <view-data-payload>,
    "options": <options-payload>
  }'

Set default table views

You can set default views at three levels:

Organization default: Visible to all members when they open the page. This applies per page. For example, you can set separate organization defaults for Logs, Experiments, and Review. To set an organization default, you need the Manage settings organization permission (included by default in the Owner role). See Access control for details.
Project default: Overrides the organization default for everyone viewing this project. To set a project default, you need the project-level Update permission. Project admins can set project defaults even without organization-level permissions. See Access control for details.
Personal default: Overrides the project and organization defaults for you only. Personal defaults are stored in your browser, so they do not carry over across devices or browsers.

To set a default view:

Switch to the view you want by selecting it from the menu.
Open the menu again and hover over the currently selected view to reveal its submenu.
Choose Set as personal default view, Set as project default view, or Set as organization default view.

To clear a default view:

Open the menu and hover over the currently selected view to reveal its submenu.
Choose Clear personal default view, Clear project default view, or Clear organization default view.

Default view settings are mutually exclusive on a given view. Setting one type of default on a view automatically clears any other default that was previously set on the same view. When a user opens a page, Braintrust loads the first match in this order: personal default, project default, organization default, then the standard “All …” view (for example, “All logs view”).

Change the table density

To change the table density to see more or less detail per row, select Display > Row height > Compact or Tall.

Export experiments

To export an experiment’s results, open the menu next to the experiment name. You can export as CSV or JSON, and choose whether to download all fields.

Access data from previous experiments by passing the open flag to init():

import { init } from "braintrust";

async function openExperiment() {
  const experiment = init("My Project", {
    experiment: "my-experiment",
    open: true,
  });

  for await (const testCase of experiment) {
    console.log(testCase);
  }
}

import braintrust

def open_experiment():
    experiment = braintrust.init(
        project="My Project",
        experiment="my-experiment",
        open=True,
    )
    for test_case in experiment:
        print(test_case)

Convert experiments to dataset format using asDataset()/as_dataset():

import { init } from "braintrust";

async function openExperiment() {
  const experiment = init("My Project", {
    experiment: "my-experiment",
    open: true,
  });

  for await (const testCase of experiment.asDataset()) {
    console.log(testCase);
  }
}

import braintrust

def open_experiment():
    experiment = braintrust.init(
        project="My Project",
        experiment="my-experiment",
        open=True,
    )
    for test_case in experiment.as_dataset():
        print(test_case)

Fetch experiment events via the API using Fetch experiment (POST form) or Fetch experiment (GET form).You can also query experiments with SQL for custom analysis. For example, to check review status:

import os
import requests

API_URL = "https://api.braintrust.dev/"
headers = {"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]}

def fetch_experiment_review_status(experiment_id: str) -> dict:
    # Replace "response quality" with your review score column name
    query = f"""
    SELECT
      sum(CASE WHEN scores."response quality" IS NOT NULL THEN 1 ELSE 0 END) AS reviewed,
      sum(CASE WHEN is_root THEN 1 ELSE 0 END) AS total
    FROM experiment('{experiment_id}')
    """

    return requests.post(
        f"{API_URL}/btql",
        headers=headers,
        json={"query": query, "fmt": "json"},
    ).json()

# Usage
result = fetch_experiment_review_status("your-experiment-id")
print(f"Reviewed: {result['data'][0]['reviewed']}/{result['data'][0]['total']}")

const API_URL = "https://api.braintrust.dev/";
const headers = {
  Authorization: `Bearer ${process.env.BRAINTRUST_API_KEY}`,
};

async function fetchExperimentReviewStatus(experimentId: string) {
  // Replace "response quality" with your review score column name
  const query = `
    SELECT
      sum(CASE WHEN scores."response quality" IS NOT NULL THEN 1 ELSE 0 END) AS reviewed,
      sum(CASE WHEN is_root THEN 1 ELSE 0 END) AS total
    FROM experiment('${experimentId}')
  `;

  const response = await fetch(`${API_URL}/btql`, {
    method: "POST",
    headers,
    body: JSON.stringify({ query, fmt: "json" }),
  });

  return await response.json();
}

// Usage
const result = await fetchExperimentReviewStatus("your-experiment-id");
console.log(`Reviewed: ${result.data[0].reviewed}/${result.data[0].total}`);

Download experiment data to a local NDJSON file with bt sync pull:

bt sync pull experiment:my-experiment

Query experiment data with SQL using bt sql:

bt sql "SELECT id, input, output, scores FROM experiment('my-experiment')"

Most SQL data-source functions also accept an object name in place of its ID. See Querying by name.

Experiment URLs are name-based, so a shared link breaks when the experiment is renamed. A permalink uses the experiment’s object ID instead, so it stays valid permanently. Use permalinks to share results, bookmark experiments, or include stable links in reports. To copy a permalink, use the permalink button in the experiment view. You can also construct one by hand from the experiment’s ID:

https://www.braintrust.dev/app/object?object_type=experiment&object_id=<experiment_id>

Visiting this URL redirects to the experiment’s canonical page, regardless of organization or project.

Next steps

Compare experiments systematically
Write scorers to measure what matters
Use playgrounds for rapid iteration
Run evaluations in CI/CD

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Interpret evaluation results

View results

Filter results

Group results

Order by regressions

Examine a trace

Assign for review

Score retrospectively

Analyze with Loop

Use aggregate scores

Download results

Customize the experiments table

Adjust table layout

Show and hide columns

Create custom columns

Create custom table views

Duplicate table views across projects

Set default table views

Change the table density

Export experiments

Next steps

​View results

​Filter results

​Group results

​Order by regressions

​Examine a trace

​Assign for review

​Score retrospectively

​Analyze with Loop

​Use aggregate scores

​Download results

​Customize the experiments table

​Adjust table layout

​Show and hide columns

​Create custom columns

​Create custom table views

​Duplicate table views across projects

​Set default table views

​Change the table density

​Export experiments

​Share an experiment

​Next steps

View results

Filter results

Group results

Order by regressions

Examine a trace

Assign for review

Score retrospectively

Analyze with Loop

Use aggregate scores

Download results

Customize the experiments table

Adjust table layout

Show and hide columns

Create custom columns

Create custom table views

Duplicate table views across projects

Set default table views

Change the table density

Export experiments

Share an experiment

Next steps