Skip to main content
RevSprint logoRevSprint
Back to Blog
EngineeringJune 10, 2026· 9 min read

30,000 Tests and Counting: How We Ship AI Software Without Breaking Things

MG

Marcus Griffith-Boyes

Chief Technology Officer

The Testing Problem Nobody Warns You About

Testing a traditional web application is a solved problem with a well-rehearsed shape: unit tests for the business logic, integration tests for the API contracts, end-to-end tests for the critical user flows, and a green build that means the same input is still producing the same output it produced yesterday. If a test fails, something broke.

Testing an AI system is a different animal entirely. The intelligence layer produces contextual, probabilistic output. The same deal with the same data might get a different priority score depending on what's happening elsewhere in the organisation. A the shared intelligence layer that spans every department creates emergent behaviours that no individual agent test can predict.

The naive approach is to test the AI like traditional software and accept the flaky tests. The sophisticated approach, which took us months to develop, is to draw a sharp line between deterministic and non-deterministic boundaries and test them completely differently.

Three Testing Layers

  • Deterministic boundaries: tenant isolation, data access controls, audit chain integrity, permission enforcement, API contracts. These are tested with conventional assertions. They must pass 100% of the time. No flakiness tolerated.
  • Signal propagation: when a support ticket is filed, does it affect the correct entities within the expected time window? These are integration tests that verify cross-agent effects without asserting specific AI output.
  • Intelligence outcomes: given a known scenario, does the system produce a reasonable recommendation? These tests validate the quality of intelligence through outcome ranges rather than exact matches.

The first layer is the largest by some margin. More than sixteen thousand tests cover the deterministic core, which is to say the security boundaries, the data integrity guarantees, the signal routing logic, the API behaviours, and the permission system. They run on every commit, every one of them, and any single failure blocks the commit until the failure is fixed rather than deferred to the next sprint.

Test count is a vanity metric. What matters is whether your tests catch the failures that would actually hurt users. A thousand tests on string formatting are worth less than ten tests on tenant isolation. We obsess over coverage strategy, not coverage percentage.

Marcus Griffith-Boyes, Chief Technology Officer, RevSprint

Blast Radius and Pre-Commit Enforcement

In a system this interconnected, every change has a blast radius. Modifying how one part of the system processes signals can affect intelligence output across every department. The question after every code change isn't 'does my test pass?' but 'what else did I break?'

We solve this with targeted blast radius analysis. Every change triggers identification of all test files that reference the modified code, not just the obvious file with a matching name. Changing a scoring function runs not only the scoring tests but every test that depends on scores downstream: notification tests, priority tests, action suggestion tests, surface rendering tests.

Pre-commit hooks enforce structural rules without asking permission. They cover the security boundaries, the architectural patterns, and the code quality standards that the team has agreed are not negotiable, and they are not guidelines waiting for a reviewer's discretion; they are automated gates. A change that violates the architecture does not reach the repository, because the automation rejects it before a human reviewer has to remember to.

What This Means for Buyers

Engineering teams evaluating AI platforms should ask one question: how do you test cross-agent behaviour? If the answer involves manual QA or 'we test each agent independently', the platform will break in production in ways that individual agent testing can't predict.

RevSprint's organisation-wide architecture means that intelligence quality is a system property, not a component property. Testing must be system-level too. Our 30,000 tests aren't a bragging number. They're the minimum viable coverage for a system where a change in one corner can ripple across every department, four surfaces, and the whole organisation. If we shipped with fewer, we'd be guessing. We don't guess. We pair this with the immutable audit chain and the tenant isolation guarantees covered elsewhere, because trustworthy AI is a whole-system property.

For the structural picture of why this matters, see Compliance as Architecture. Google's Site Reliability Engineering book on testing for reliability makes the same argument from the production-systems side: testing must measure end-to-end system behaviour, not isolated units. That is what 30,000 tests, run on every commit, are for. To see this engineering rigour on your own stack, get early access.

Tags:TestingQualityCI/CDEngineering