The evaluation problem

Traditional software is predictable. You write a test, you run it a thousand times, you get the same result. LLMs are different. The same prompt can give you different outputs every time. The output also differs between models. So how do you know if your AI feature actually works?

I ran into this problem. The product used LLMs for everything - parsing documents, generating tailored content, checking for issues. When I changed a system prompt, I had no way of knowing if I made things better or worse until errors started popping up. So I want to share what I learned trying to solve this.

Fallbacks are harder than they look

The first thing I realized is that I couldn't rely on a single LLM provider. APIs go down. Rate limits get hit. Sometimes a model just returns garbage. So I built a fallback system.

The basic idea is simple - try one model, if it fails, try the next:

typescript

const model_order = [
  {
    model: "gemini-2.5-flash",
    system_prompt: systemPromptGemini,
    user_prompt: userPrompt,
    provider_options: geminiConfig,
  },
  {
    model: "claude-sonnet-4-5",
    system_prompt: systemPromptClaude,
    user_prompt: userPrompt,
    provider_options: anthropicConfig,
  },
  {
    model: "gpt-4.1",
    system_prompt: systemPromptOpenAI,
    user_prompt: userPrompt,
    provider_options: openaiConfig,
  },
];

const result = await callWithFallbacks({
  model_order,
  schema: myZodSchema,
  timeout_ms: 85000,
});

The fallback function loops through each model. If one fails or times out, it moves to the next. It tracks both global and local timeouts - you don't want to wait forever, but you also don't want to give up too early on a slow response.

typescript

for (const modelConfig of model_order) {
  if (globalSignal.aborted) {
    throw new Error("Global timeout exceeded");
  }

  const localSignal = AbortSignal.timeout(local_timeout_ms);
  const combinedSignal = AbortSignal.any([globalSignal, localSignal]);

  try {
    const { output } = await generateText({
      model: getModel(modelConfig.model),
      output: Output.object({ schema }),
      system: modelConfig.system_prompt,
      prompt: modelConfig.user_prompt,
      abortSignal: combinedSignal,
    });

    return output;
  } catch (error) {
    console.warn(`Model ${modelConfig.model} failed, trying next...`);
    lastError = error;
  }
}

Structured outputs are a mess

I use Zod schemas to get structured JSON from LLMs. But each provider handles this differently.

Anthropic needs specific configuration for JSON mode. Google may reject very large or deeply nested schemas (it simply times out without throwing an error 🥲). OpenAI has strict vs non-strict JSON schemas. And they're constantly changing their APIs.

typescript

// Each provider needs different config
const anthropicConfig = {
  structuredOutputMode: "jsonTool",
};

const geminiConfig = {
  structuredOutputs: true
};

const openaiConfig = {
  strictJsonSchema: false,
};

But the real problem is that different models understand the same system prompt differently. I would write a prompt that worked great with Claude, then Gemini would interpret some instruction in a different way (especially about omitting fields).

I ended up having to test the same prompt with each model separately. Sometimes I had to write slightly different prompts for different providers. Not ideal, but necessary.

LLM-as-a-judge saved my sanity

Here's what finally worked: using another LLM to evaluate the outputs. Before pushing any prompt changes to production, I run a test suite that collects outputs from all my AI functions, then asks an LLM to evaluate them.

typescript

const judgeSchema = z.object({
  summary: z.string(),
  total_tests: z.number(),
  passed_tests: z.number(),
  failed_tests: z.number(),
  issues: z.array(
    z.object({
      test_label: z.string(),
      issue_description: z.string(),
      severity: z.enum(["low", "medium", "critical"]),
    })
  ),
});

async function llmAsAJudge({ allResults }) {
  const prompt = `
    You are an AI QA Lead. Evaluate these test results.

    Check for:
    1. Errors or Timeouts (CRITICAL)
    2. Empty or null data where data was expected (CRITICAL)
    3. Malformed JSON or nonsense output (MEDIUM)

    Results: ${JSON.stringify(allResults, null, 2)}
  `;

  const { output } = await generateText({
    model: anthropic("claude-sonnet-4-5"),
    output: Output.object({ schema: judgeSchema }),
    prompt,
  });

  return output;
}

The test suite runs each AI function with sample data, records the output, and sends everything to the judge. The judge looks for obvious problems - errors, timeouts, empty responses, malformed data.

typescript

// Run tests in parallel
const tests = [
  () => tryTest({
    fn: generateDocument,
    label: "GENERATE DOCUMENT",
    args: [sampleDocumentData],
  }),
  () => tryTest({
    fn: parseFromPDF,
    label: "PARSE PDF",
    args: [samplePDFBuffer],
  }),
  // ... more tests
];

const allResults = await Promise.all(tests.map(t => t()));
const judgment = await llmAsAJudge({ allResults });

if (judgment.failed_tests > 0) {
  console.log("Issues found:");
  judgment.issues.forEach(issue => {
    console.log(`[${issue.severity}] ${issue.test_label}: ${issue.issue_description}`);
  });
} else {
  console.log("No issues found");
}

This doesn't catch everything. The judge might miss subtle quality issues. But it catches the obvious breaks - the ones where you changed something and now the whole thing returns errors or empty data.

Human evals for everything else

LLM-as-a-judge catches technical failures, but can't do a deep quality check. Is the generated document professional? Does it match the user's industry? Is it specific enough?

This is where human evals come in. Simple thumbs up/down buttons after every AI output do the trick.

When someone clicks thumbs down, log everything - the input, the output, the model used, any feedback they provide. Then go back to the system prompt and adjust. And keep adjusting until you're mostly getting thumbs up.

The key insight is that users will tell you when something is wrong. They won't always tell you why, but they'll tell you that. Combined with the logged data, you can usually figure out what went wrong.

Over time, the collection of bad outputs becomes your regression tests. When you change a prompt, check if it would still produce good output for the cases that previously failed.

What I learned

You can't test LLM outputs the same way you test traditional software. The goal isn't to prove that the output is correct - it's to catch when something goes obviously wrong.

LLM-as-a-judge tests run before every deploy. They catch technical failures. Human evals run continuously. They catch quality issues. The combination gives enough confidence to ship changes without breaking things for users.

It's not perfect. But it's way better than deploying blind and hoping for the best.