This tutorial will walk through using Braintrust to evaluate a conversational, multi-turn chat assistant.
These types of chat bots have become important parts of applications, acting as customer service agents, sales representatives, or travel agents, to name a few. As an owner of such an application, it's important to be sure the bot provides value to the user.
We will expand on this below, but the history and context of a conversation is crucial in being able to produce a good response. If you received a request to "Make a dinner reservation at 7pm" and you knew where, on what date, and for how many people, you could provide some assistance; otherwise, you'd need to ask for more information.
Before starting, please make sure you have a Braintrust account. If you do not have one, you can sign up here.
Let's take a look at the small dataset prepared for this cookbook. You can find the full dataset in the accompanying dataset.ts file. The assistant turns were generated using claude-3-5-sonnet-20240620.
Below is an example of a data point.
chat_history contains the history of the conversation between the user and the assistant
input is the last user turn that will be sent in the messages argument to the chat completion
expected is the output expected from the chat completion given the input
From looking at this one example, we can see why the history is necessary to provide a helpful response.
If you were asked "Who won the men's trophy that year?" you would wonder What trophy? Which year? But if you were also given the chat_history, you would be able to answer the question (maybe after some quick research).
To start, let's see how the prompt performs when no chat history is provided. We'll create a simple task function that returns the output from a chat completion.
We'll use the Factuality scoring function from the autoevals library to check how the output of the chat completion compares factually to the expected value.
We will also utilize trials by including the trialCount parameter in the Eval call. We expect the output of the chat completion to be non-deterministic, so running each input multiple times will give us a better sense of the "average" output.
61.33% Factuality score? Given what we discussed earlier about chat history being important in producing a good response, that's surprisingly high. Let's log onto braintrust.dev and take a look at how we got that score.
If we look at the score distribution chart, we can see ten of the fifteen examples scored at least 60%, with over half even scoring 100%. If we look into one of the examples with 100% score, we see the output of the chat completion request is asking for more context as we would expect:
Could you please specify which athlete or player you're referring to? There are many professional athletes, and I'll need a bit more information to provide an accurate answer.
This aligns with our expectation, so let's now look at how the score was determined.
Click into the scoring trace, we see the chain of thought reasoning used to settle on the score. The model chose (E) The answers differ, but these differences don't matter from the perspective of factuality. which is technically correct, but we want to penalize the chat completion for not being able to produce a good response.
While Factuality is a good general purpose scorer, for our use case option (E) is not well aligned with our expectations. The best way to work around this is to customize the scoring function so that it produces a lower score for asking for more context or specificity.
You can see the built-in Factuality prompt here. For our customized scorer, we've added two score choices to that prompt:
These will score (F) = 0.2 and (G) = 0 so the model gets some credit if there was any context it was able to gather from the user's input.
We can then use this spec and the LLMClassifierFromSpec function to create our customer scorer to use in the eval function.
In the table we can see the output fields in which the chat completion responses are requesting more context. In one of the experiment that had a non-zero score, we can see that the model asked for some clarification, but was able to understand from the question that the user was inquiring about a controversial World Series. Nice!
Looking into how the score was determined, we can see that the factual information aligned with the expert answer but the submitted answer still asks for more context, resulting in a score of 20% which is what we expect.
We need to edit the inputs to the Eval function so we can pass the chat history to the chat completion request.
We update the parameter to the task function to accept both the input string and the chat_history array and add the chat_history into the messages array in the chat completion request, done here using the spread ... syntax.
We also need to update the experimentData and Factual function parameters to align with these changes.
Use the updated variables and functions to run a new eval.
60% score is a definite improvement from 4%.
You'll notice that it says there were 0 improvements and 0 regressions compared to the last experiment gpt-4o assistant - no history-934e5ca2 we ran. This is because by default, Braintrust uses the input field to match rows across experiments. From the dashboard, we can customize the comparison key (see docs) by going to the project configuration page.
Turn on diff mode using the toggle on the upper right of the table.
Since we updated the comparison key, we can now see the improvements in the Factuality score between the experiment run with chat history and the most recent one run without for each of the examples. If we also click into a trace, we can see the change in input parameters that we made above where it went from a string to an object with input and chat_history fields.
All of our rows scored 60% in this experiment. If we look into each trace, this means the submitted answer includes all the details from the expert answer with some additional information.
60% is an improvement from the previous run, but we can do better. Since it seems like the chat completion is always returning more than necessary, let's see if we can tweak our prompt to have the output be more concise.
We've seen in this cookbook how to evaluate a chat assistant and visualized how the chat history effects the output of the chat completion. Along the way, we also utilized some other functionality such as updating the comparison key in the diff view and creating a custom scoring function.
Try seeing how you can improve the outputs and scores even further!