Dovetail is about to get a major upgrade.Be the first to know
Go to app
BlogAI

Build high quality AI features with simple feedback loops

Last updated
24 September 2025
Published
24 September 2025
Content
Peter Wooden
Creative
Emi Chiba

What if testing AI prompts felt as intuitive and visual as building UI components in Storybook?

Peter Wooden is Engineering Manager of the AI team at Dovetail.

At Dovetail, we face a unique challenge: we have more AI prompts than engineers. With a product that heavily relies on AI to analyze customer research data, we need to ship quickly while maintaining the high quality and trust bar our customers expect.

Many teams dread building evaluations (evals). Software engineers often don’t know where to start, while AI/ML engineers might believe they need to spend weeks building comprehensive datasets and frameworks before they can start delivering customer value. This creates a dangerous gap where teams either ship without proper testing or get stuck in analysis paralysis.

This reality forced us to rethink how we approach AI quality. The traditional dichotomy between model performance and shipping velocity creates a false choice.

Instead, we’ve discovered that you can achieve the best of both worlds by reframing the problems as: How do you create the right feedback loop for AI feature development? What if you could create a feedback loop as powerful as Storybook for frontend development, but for AI instead?

In this blog post, I’m going to cover what didn’t work for us and five ways to build feedback loops in twenty minute increments.

What didn’t work for us

We learned this the hard way. Here’s what didn’t work for us:

1. Eyeballing in the app

This is tempting, but it’s a terrible idea. Don’t do it, even during hackathons. You can't track stable inputs or compare outputs over time. You can only test one input at a time, leading to constant quality regressions.

2. Off-the-shelf frameworks

We evaluated fifteen libraries and hosted tools last year. The hosted solutions were too expensive and poorly integrated. The libraries were too rigid, requiring domain-specific languages instead of letting us use our application code, and often relied too much on top-down metrics.

3. End-to-end evals only

When working on agentic systems or prompt chains, E2E evals have their place, but relying solely on them creates a fragile feedback loop. Instead, we learned to apply the testing pyramid: create clearly defined contracts between every agent and prompt in your system so you can build unit evals for each component. If there is a problem, unit evals can pinpoint exactly which component needs fixing, whereas E2E cannot.

Five tools to build powerful feedback loops

The breakthrough came when we realized you can build useful evals in just twenty minutes. Here’s our toolkit:

Tool 1: Snapshot “evals”

For an example, let’s evaluate a prompt which generates a title for a document.

Think of this as snapshot testing from the software engineering world. It’s not true evaluation, but it’s incredibly powerful for rapid iteration:

  1. Isolate a single prompt (unit test)

  2. Create a handful of inputs as code

  3. Write a script to output results to a file

The feedback loop is extraordinary. You can run the “eval” in watch mode, see your prompt on the left, outputs on the right, and watch changes propagate in real-time. Read through the outputs and do error analysis to figure out how it’s failing, and hypothesize what to change in the prompt to fix it. It’s like Storybook for React, but for prompt engineering—a true WYSIWYG experience for AI development.

Why start with only a handful of inputs? The highest impact quality issues may only need a very small sample size to discover. After addressing these, start adding more and more data to your eval dataset.

This approach helped us win a hackathon. We had twelve prompts, applied snapshot evals to each one, and shipped a high-quality prototype that worked reliably for a live demo.

Tool 2: Check sample outputs into Git

The best PR review experience happens when engineers can see prompt changes and the before/after outputs in the same view. Everything stays in GitHub—no need to switch between platforms. The clarity this provides for code review is game-changing.

Tool 3: Simple code evals

Keep it simple:

  • Use basic string matches and operations

  • Add human-labeled ground truths

  • Gradually increase dataset size (5 → 10 → 20 → 50 → hundreds)

For example, test which tools your AI chooses to use. Create unit tests that iterate over ground truths and verify the AI selects the expected tool.

The magic happens when you visualize results as diff files. You can see exactly which inputs failed and how, making it easy to build intuitions about what needs fixing.

Level up: Calculate quantitative scores and track them in Git over time. Having precision and recall scores in your Git diffs is another game-changer.

Tool 4: Include diagnostics

You can level up the outputs your system is outputting by including diagnostic information. This super charges error analysis—it makes it easy to build intuitions about why things are failing, and quickly gives you ideas about what to try next.

Visual diffing: Leverage your visual cortex to analyze problems quickly and recognize failure patterns. You will be able to quickly form hypotheses, such as whether the LLM is missing context, or has the wrong instructions, etc.

For multi-label classification, we visualize results in syntax-highlighted diff format where + indicates false positives, plain text shows true positives, and - shows false negatives. This makes it immediately obvious what’s breaking.

Expose reasoning: Dump chain-of-thought or extended thinking into output files next to the result. Inspecting the thinking really lets you empathize with the LLM and diagnose why it is going off track.

Spot-check LLM judges: If you’re using LLM-as-a-judge, make it easy to verify decisions.

Tool 5: Bottom-up user feedback

Use real user feedback to prioritize additional evals. Don’t implement every possible metric upfront—focus on measuring failure modes you know are important from initial testing and error analysis, then let alpha and beta users guide your next priorities.

Sometimes the most important problems aren’t covered by canonical eval metrics. User feedback like “this summary’s too basic, needs more granular breakdown” points directly to specificity issues you should measure. Focus on the failure modes particular to your system.

The path forward

Don’t rely on eyeballing in the app. Invest in a feedback loop that builds quality quickly and pragmatically.

You don’t need to implement all the standard best practices upfront, but you should aim to over time. Start with these simple ways of building feedback loops, then evolve toward bigger datasets, more metric coverage, and comprehensive best practices.

The key insight: Speed and quality aren’t opposing forces in AI development—they’re complementary when you build the right feedback loops.

Keep reading

See all

A whole new way to understand your customer is here

Get Dovetail free

Product

PlatformProjectsChannelsAsk DovetailRecruitIntegrationsEnterpriseAnalysisInsightsPricingRoadmap

Company

About us
Careers10
Legal
© Dovetail Research Pty. Ltd.
TermsPrivacy Policy