Benchmarking inference providers

Although there are a small handful of open-source LLMs, there are a variety of inference providers that can host them for you, each with different cost, speed, and as we'll see below, accuracy trade-offs. And even if one provider excels at a certain model size, it may not be the best choice for another.

Key takeaways

It's very important to evaluate your specific use case against a variety of both models and providers to make an informed decision about which to use. What I learned is that the results are pretty unpredictable and vary across both provider and model size. Just because one provider has a good 8b model, doesn't mean that its 405b is fast or accurate.

Here are some things that surprised me:

  • 8b models are consistently fast, but have high variance in accuracy
  • One provider is fastest for 8b and 70b, yet slowest for 405b
  • The best provider is different across the two benchmarks we ran

Hopefully this analysis will help you create your own benchmarks and make an informed decision about which provider to use.

Setup

Before you get started, make sure you have a Braintrust account and API keys for all the providers you want to test. Here, we're testing Together, Fireworks, and Lepton, although Braintrust supports several others (including Azure, Bedrock, Groq, and more).

Make sure to plug each provider's API key into your Braintrust account's AI secrets configuration and acquire a BRAINTRUST_API_KEY.

Put your BRAINTRUST_API_KEY in a .env.local file next to this notebook, or just hardcode it into the code below.

import dotenv from "dotenv";
import * as fs from "fs";
 
if (fs.existsSync(".env.local")) {
  dotenv.config({ path: ".env.local", override: true });
}

Task code

We are going to reuse the task function from Tool calls in LLaMa 3.1, which is below. For a detailed explanation of the task, see that recipe.

import { OpenAI } from "openai";
import { wrapOpenAI } from "braintrust";
 
import { templates } from "autoevals";
import * as yaml from "js-yaml";
import mustache from "mustache";
 
const client = wrapOpenAI(
  new OpenAI({
    apiKey: process.env.BRAINTRUST_API_KEY,
    baseURL: "https://api.braintrust.dev/v1/proxy",
    defaultHeaders: { "x-bt-use-cache": "never" },
  })
);
 
function parseToolResponse(response: string) {
  const functionRegex = /<function=(\w+)>(.*?)(?:<\/function>|$)/;
  const match = response.match(functionRegex);
 
  if (match) {
    const [, functionName, argsString] = match;
    try {
      const args = JSON.parse(argsString);
      return {
        functionName,
        args,
      };
    } catch (error) {
      console.error("Error parsing function arguments:", error);
      return null;
    }
  }
 
  return null;
}
 
const template = yaml.load(templates["factuality"]);
 
const selectTool = {
  name: "select_choice",
  description: "Call this function to select a choice.",
  parameters: {
    properties: {
      reasons: {
        description:
          "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
        title: "Reasoning",
        type: "string",
      },
      choice: {
        description: "The choice",
        title: "Choice",
        type: "string",
        enum: Object.keys(template.choice_scores),
      },
    },
    required: ["reasons", "choice"],
    title: "CoTResponse",
    type: "object",
  },
};
 
async function LLaMaFactuality({
  model,
  input,
  output,
  expected,
}: {
  model: string;
  input: string;
  output: string;
  expected: string;
}) {
  const toolPrompt = `You have access to the following functions:
 
Use the function '${selectTool.name}' to '${selectTool.description}':
${JSON.stringify(selectTool)}
 
If you choose to call a function ONLY reply in the following format with no prefix or suffix:
 
<function=example_function_name>{"example_name": "example_value"}</function>
 
Reminder:
- If looking for real time information use relevant functions before falling back to brave_search
- Function calls MUST follow the specified format, start with <function= and end with </function>
- Required parameters MUST be specified
- Only call one function at a time
- Put the entire function call reply on one line
 
Here are a few examples:
 
`;
 
  const response = await client.chat.completions.create({
    model,
    messages: [
      {
        role: "system",
        content: toolPrompt,
      },
      {
        role: "user",
        content: mustache.render(template.prompt, {
          input,
          output,
          expected,
        }),
      },
    ],
    temperature: 0,
    max_tokens: 2048,
  });
 
  try {
    const parsed = parseToolResponse(response.choices[0].message.content);
    return {
      name: "Factuality",
      score: template.choice_scores[parsed?.args.choice],
      metadata: {
        rationale: parsed?.args.reasons,
        choice: parsed?.args.choice,
      },
    };
  } catch (e) {
    return {
      name: "Factuality",
      score: -1,
      metadata: {
        error: `${e}`,
      },
    };
  }
}
 
console.log(
  await LLaMaFactuality({
    model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    input: "What is the weather in Tokyo?",
    output: "The weather in Tokyo is scorching.",
    expected: "The weather in Tokyo is extremely hot.",
  })
);
(node:12633) [DEP0040] DeprecationWarning: The \`punycode\` module is deprecated. Please use a userland alternative instead.
(Use \`node --trace-deprecation ...\` to show where the warning was created)
{
  name: 'Factuality',
  score: 0.6,
  metadata: {
    rationale: "The submitted answer 'The weather in Tokyo is scorching' is a superset of the expert answer 'The weather in Tokyo is extremely hot' because it includes the same information and adds more detail. The word 'scorching' is a synonym for 'extremely hot', so the submitted answer is fully consistent with the expert answer.",
    choice: 'B'
  }
}

Dataset

We'll use the same data as well: a subset of the CoQA dataset.

interface CoqaCase {
  input: {
    input: string;
    output: string;
    expected: string;
  };
  expected: number;
}
 
const data: CoqaCase[] = JSON.parse(
  fs.readFileSync("../LLaMa-3_1-Tools/coqa-factuality.json", "utf-8")
);
 
console.log("LLaMa-3.1-8B Factuality");
console.log(
  await LLaMaFactuality({
    model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    ...data[1].input,
  })
);
LLaMa-3.1-8B Factuality
{
  name: 'Factuality',
  score: 0,
  metadata: {
    rationale: "The submitted answer 'in a barn' does not contain the word 'white' which is present in the expert answer. Therefore, it is not a subset or superset of the expert answer. The submitted answer also does not contain all the same details as the expert answer. There is a disagreement between the submitted answer and the expert answer.",
    choice: 'D'
  }
}

Running evals

Let's create a list of the providers we want to evaluate. Each provider conveniently names its flavor of each model slightly differently, so we can use these as a unique identifier.

To facilitate this test, we also self-hosted an official Meta-LLaMa-3.1-405B-Instruct-FP8 model, which is available on Hugging Face using vLLM. You can configure this model as a custom endpoint in Braintrust to use it alongside other providers.

Provider map

const providers = [
  {
    provider: "Provider 1",
    models: [
      "accounts/fireworks/models/llama-v3p1-8b-instruct",
      "accounts/fireworks/models/llama-v3p1-70b-instruct",
      "accounts/fireworks/models/llama-v3p1-405b-instruct",
    ],
  },
  {
    provider: "Provider 2",
    models: [
      "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
      "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
      "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    ],
  },
  {
    provider: "Provider 3",
    models: ["llama3-1-8b", "llama3-1-70b", "llama3-1-405b"],
  },
  {
    provider: "Self-hosted vLLM",
    models: ["meta-llama/Meta-Llama-3.1-405B-Instruct-FP8"],
  },
];

Eval code

We'll run each provider in parallel, and within the provider, we'll run each model in parallel. This roughly assumes that rate limits are per model, not per provider.

We're also running with a low concurrency level (3) to avoid overwhelming a provider and hitting rate limits. The Braintrust proxy handles rate limits for us, but they are reflected in the final task duration.

You'll also notice that we parse and track the provider as well as the model in each experiment's metadata. This allows us to do some rich analysis on the results.

import { Eval } from "braintrust";
import { Score, NumericDiff } from "autoevals";
 
function NonNull({ output }: { output: number | null }) {
  return output !== null && output !== undefined ? 1 : 0;
}
 
async function CorrectScore({
  output,
  expected,
}: {
  output: number | null;
  expected: number | null;
}): Promise<Score> {
  if (output === null || expected === null) {
    return {
      name: "CorrectScore",
      score: 0,
      metadata: {
        error: output === null ? "output is null" : "expected is null",
      },
    };
  }
  return {
    ...(await NumericDiff({ output, expected })),
    name: "CorrectScore",
  };
}
 
async function runProviderBenchmark(provider: (typeof providers)[number]) {
  const evals = [];
  for (const model of provider.models) {
    const size = model.toLowerCase().includes("8b")
      ? "8b"
      : model.toLowerCase().includes("70b")
        ? "70b"
        : "405b";
 
    evals.push(
      Eval("LLaMa-3.1-Multi-Provider-Benchmark", {
        data: data,
        task: async (input) =>
          (await LLaMaFactuality({ model, ...input }))?.score,
        scores: [CorrectScore, NonNull],
        metadata: {
          size,
          provider: provider.provider,
          model,
        },
        experimentName: `${provider.provider} (${size})`,
        maxConcurrency: 3,
        trialCount: 3,
      })
    );
  }
  await Promise.all(evals);
}
 
await Promise.all(providers.map(runProviderBenchmark));

Results

Let's start by looking at the project view. Braintrust makes it easy to morph this into a multi-level grouped analysis where we can see the score vs. duration in a scatter plot, and how each provider stacks up in the table.

Setting up the table

Insights

Now let's dig into this chart and see what we can learn.

  1. 70b hits a nice sweet spot

It looks like on average, each weight class costs you an extra second on average. However, the jump in average accuracy from 8b to 70b is 16%+ while 70b to 405b is only 2.87%.

Pivot table

  1. 8b models are consistently really fast, but some providers' 70b models are slower than others'

The distribution among providers for 8b latency is very tight, but that starts to change with 70b and even more so with 405b models.

Speed distribution

  1. High accuracy variance in 8b models

Within 8b models in particular, there is a pretty significant difference in accuracy

Accuracy distribution

  1. Provider 1 is the fastest except for 405b

Provider 1

Interestingly, provider 1's 8b model is both the fastest and most accurate. However, its 405b model, while accurate, is the slowest by far. This is likely due to rate limits, or perhaps they have optimized it using a different method.

  1. Self-hosting strikes a nice balance

Self-hosting strikes a nice balance between latency and quality (note: we only tested self-hosted 405b). Of course, this comes at a price -- around $27/hour using Lambda Labs

Self-hosted

Another benchmark

We also used roughly the same code on a different, more-realistic, internal benchmark which measures how well our AI search bar works. Here is the same visualization for that benchmark:

AISearch

As you can see, certain things are consistent, but others are not. Again, this highlights how important it is to run this analysis on your own use case.

  • Provider 1 is less differentiated. Although Provider 1 is still the fastest, it comes at the cost of accuracy in the 70b and 405b classes, where Provider 2 wins on accuracy. Provider 2 also wins on speed for 405b.
  • Provider 3 has a hard time in the 70b class. This workload is heavy on prompt tokens (~3500 per test case). Maybe that has something to do with it?
  • More latency variance across the board. Again, this may have to do with the significant jump in prompt tokens.
  • Self-hosted seems to be about the same. Interestingly, the self-hosted model appears at about the same spot in the graph!

Where to go from here

This is just one benchmark, but as you can see, there is a pretty significant difference in speed and accuracy between providers. I'd highly encourage testing on your own workload and using a tool like Braintrust to help you construct a good eval and understand the trade-offs across providers in depth.

Feel free to reach out if we can help, or feel free to sign up to try out Braintrust for yourself. If you enjoy performing this kind of analysis, we are hiring a Devrel (among other roles).

Happy evaluating!

Thanks to Hamel for hosting the self-hosted model and feedback on drafts.