How to Fix Flaky Tests Without Hiding Them Behind Retries
A flaky test is one that produces different results — pass or fail — against the same code. Fixing flaky tests means doing two things in order: identifying the root cause (timing, shared state, animations, network mocks, dynamic data, test order, or environment drift), then applying a deterministic fix instead of papering over the failure with a retry. This guide walks through all seven root causes with code-level fixes, then covers the detection systems — Playwright's --repeat-each, Cypress retries, Jest isolation, and quarantine queues — that high-velocity engineering teams use to keep flakiness from compounding.
Key takeaways
- Google's own research found roughly 1.5% of test runs at the company exhibit flaky behavior, affecting about 16% of tests in their corpus — one in seven.
- Retries hide flakiness rather than fix it. Use them as a temporary signal, never as a permanent policy.
- The seven dominant root causes account for the overwhelming majority of real-world flakiness — async timing, test order, shared state, animations, network mock drift, dynamic data, environment drift.
- Quarantine queues, flakiness budgets, and
--repeat-eachstyle stress runs are the patterns Google, Spotify, Microsoft, and Atlassian have all converged on.
What makes a test flaky
A test is flaky when it produces inconsistent results against unchanged code. The test is nondeterministic. That is the line separating flakiness from a real bug: a test that fails consistently against a defect is doing its job, while a test that fails one run in fifty for reasons unrelated to the change being evaluated is noise.
In De-Flake Your Tests: Automatically Locating Root Causes of Flaky Tests in Code At Google, the research team reported that around 84% of pass-to-fail transitions in post-submit testing at Google involve a flaky test. The Crosscheck team has seen the same pattern at smaller scale — once a CI suite crosses a threshold of perceived unreliability, developers stop investigating failures and start re-running pipelines on reflex. That reflex is where real regressions slip through.
The most dangerous flaky tests are not the ones that fail every third run. Those are visible and get fixed. The dangerous ones fail once every fifty runs, or only under CI load — failures that get coded as "infrastructure noise" and then mask a real regression six months later.
The seven root causes of flaky tests
Most flakiness in real codebases traces back to one of seven causes. Each one has a distinct detection signature and a deterministic fix.
1. Async timing — the most common cause
Timing issues are the dominant source of flakiness in UI and integration tests. An action triggers async work — a fetch, a state update, an animation — and the test asserts on the resulting state before that work has settled. On a fast local machine the assertion lands after the work completes. On a loaded CI runner it lands before, and the test fails.
Wrong fix: fixed-duration waits.
// flaky — any fixed duration can be exceeded under load
await page.click('#submit');
await page.waitForTimeout(2000);
await expect(page.locator('.success')).toBeVisible();
Right fix: wait for the specific condition being tested.
// deterministic — wait for the state that matters
await page.click('#submit');
await expect(page.locator('.success')).toBeVisible({ timeout: 10_000 });
Playwright's locator.waitFor(), expect(locator).toBeVisible(), Cypress's chained cy.get().should(), and Testing Library's waitFor() are all built around this idea — poll until the condition is true, then proceed. How to detect: run the suspect test with --repeat-each 50 --workers 1 in Playwright. If you can flip the flake rate by varying CPU contention, the cause is timing.
2. Test order dependencies
Tests written assuming a specific execution order start failing the moment a CI runner parallelises them or a test framework randomises them. Test A populates a database fixture, test B reads from it, and the assertion passes only because A runs first. Reorder the suite and B fails in isolation.
How to detect: turn on test order randomisation. Jest supports --randomize with an explicit seed for reproducibility, Vitest has sequence.shuffle, Mocha exposes --sort. Tests that pass under a fixed order but fail under random order are exhibiting order dependence by definition.
The fix: every test creates the state it needs. No test reads state another test wrote.
// before — implicit dependency on test A having run first
describe('user permissions', () => {
// test A creates the user somewhere upstream
it('admin can edit posts', async () => {
const user = await db.users.findOne({ role: 'admin' });
expect(await canEdit(user)).toBe(true);
});
});
// after — the test owns its setup
describe('user permissions', () => {
it('admin can edit posts', async () => {
const user = await db.users.create({ role: 'admin' });
expect(await canEdit(user)).toBe(true);
await db.users.delete(user.id);
});
});
3. Shared and leaking state
Closely related to order dependence, but with a different signature: tests modify shared resources — a database row, a singleton cache, a global env var, localStorage, the filesystem — and forget to clean up. Subsequent tests inherit unexpected state.
The cleanest pattern is transactional teardown for anything touching a database, plus beforeEach/afterEach hooks that run unconditionally.
// Jest — guaranteed reset between tests
beforeEach(() => {
jest.resetModules();
jest.clearAllMocks();
localStorage.clear();
});
afterEach(async () => {
await db.$transaction.rollback(); // even on test failure
});
Teardown that only runs on success is itself a state leak waiting to happen — if a test fails partway through and skips its cleanup block, the next test inherits the junk. Use afterEach and afterAll, which run regardless of outcome, rather than appending teardown to the test body.
4. Animations and transitions
CSS animations, transitions, and motion libraries — Framer Motion, GSAP, Anime.js — produce intermediate DOM states that visual regression tests and selector-based assertions both stumble on. A button rendered at opacity: 0.5 is technically present in the DOM and technically not visible to a user. The test resolves either way depending on the frame it samples.
Detection: record a video of the failing run. Playwright video: 'on' and Cypress video: true capture the actual frames. Flickering elements at the moment of failure are an animation tell.
Fix: disable animations globally in the test environment.
// Playwright — disable animations and transitions site-wide
await page.addStyleTag({
content: `
*, *::before, *::after {
animation-duration: 0s !important;
animation-delay: 0s !important;
transition-duration: 0s !important;
transition-delay: 0s !important;
}
`,
});
// Cypress — equivalent
beforeEach(() => {
cy.visit('/', {
onBeforeLoad(win) {
win.document.head.insertAdjacentHTML(
'beforeend',
'<style>*, *::before, *::after { animation: none !important; transition: none !important; }</style>'
);
},
});
});
For visual regression suites — Percy, Applitools, Chromatic — render with animations disabled and explicit waits for async data. See Crosscheck's breakdown of visual regression testing tools for the trade-offs between AI-diffed and pixel-diffed comparisons.
5. Network mocking gaps
A test that makes real network calls to a third-party API is flaky by design — that service's availability and latency are now part of your test's reliability surface. Mocking outbound requests is the standard fix, but the failure mode that catches teams off guard is mock drift: the real API changed, the mocked response did not, and tests pass on a fiction.
Detection: schedule a periodic contract-test run that hits the real API and compares the response shape to your mock fixtures. Pact, MSW's response validation, or a hand-rolled diff against an OpenAPI schema all work.
Fix: tie mocks to the contract, not to ad-hoc handlers.
// Mock Service Worker — handlers shared across browser and Node tests
import { http, HttpResponse } from 'msw';
import { setupServer } from 'msw/node';
const server = setupServer(
http.get('/api/users/:id', ({ params }) => {
return HttpResponse.json({
id: params.id,
name: 'Test User',
// shape matches the real API contract — verified nightly
});
})
);
beforeAll(() => server.listen({ onUnhandledRequest: 'error' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());
The onUnhandledRequest: 'error' flag is the important detail — any test that tries to hit a real network call fails loudly, so you find the gaps rather than silently shipping a flaky integration.
6. Dynamic data — dates, IDs, randomness
Tests that compare against new Date(), Math.random(), crypto.randomUUID(), or any per-run value drift eventually. The classic version is the test that hardcodes a date and breaks at midnight UTC or at the daylight-saving transition. The subtler version is the snapshot test that captures a UUID-stamped payload and fails on every run after the first.
Fix: freeze time and seed randomness inside the test.
// Jest / Vitest — freeze the clock
beforeEach(() => {
vi.useFakeTimers();
vi.setSystemTime(new Date('2026-01-15T12:00:00Z'));
});
afterEach(() => {
vi.useRealTimers();
});
// Snapshot tests — normalise variable values before serialising
expect.addSnapshotSerializer({
test: val => typeof val === 'string' && /^[0-9a-f-]{36}$/.test(val),
print: () => '"<UUID>"',
});
For Playwright and Cypress end-to-end tests, override Date.now at the page level via page.addInitScript() or cy.clock(). Same idea, different surface.
7. Environment drift
Tests that pass locally but fail in CI are usually an environment problem rather than a code problem. The list of differences is long — memory pressure, CPU contention, container resource limits, timezone, locale, installed fonts, Node version, Chromium version, screen DPI. Any of these can change behaviour.
Fix: containerise the test environment so local and CI runs use the same image. When docker run my-test-image produces the same result as the CI runner, a failure in CI can be reproduced locally in minutes instead of hours.
# Single image for local dev and CI test execution
FROM mcr.microsoft.com/playwright:v1.50.0-jammy
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
CMD ["npx", "playwright", "test"]
Pinning the image tag is the small detail that catches people — latest drifts, a pinned version does not. Same logic applies to Node: lock to a specific minor in .nvmrc, not a >= range.
How to detect flaky tests systematically
Detection is the part most teams get wrong. They underestimate the scope of the problem because they have no measurement system — a test that fails once in fifty runs looks unremarkable in any single run and only becomes obvious when aggregated across hundreds.
Track pass/fail history per test
Every modern CI platform supports this natively or integrates with a service that does. Azure DevOps shipped its built-in flaky test management feature with customisable detection logic on the 2025 roadmap — any test that passes on rerun within the same pipeline gets flagged and tracked across subsequent runs. Trunk Flaky Tests, Datadog Test Optimization, and BuildPulse offer the same primitive as a third-party service for teams on GitHub Actions or CircleCI.
Use --repeat-each for targeted stress runs
Playwright's --repeat-each is the single most useful command for surfacing intermittent failures during development. It runs each test N times in one invocation, and combined with --workers to control parallelism, reproduces CI conditions locally.
# Run a single suspect test 50 times under parallel load
npx playwright test checkout.spec.ts --repeat-each 50 --workers 4
# Strict mode — fail the run if any test needs a retry to pass
npx playwright test --fail-on-flaky-tests --retries=2
# Workers=1 to rule parallelism out of the equation
npx playwright test --repeat-each 20 --workers 1
The --fail-on-flaky-tests flag is underused. It treats any test that needs a retry to pass as a failure even if it eventually went green — the right default for a pre-merge gate, because a test that retries today will eventually retry on the run that matters.
Use Cypress retries as a diagnostic, not a default
Cypress retries are configured per-environment, which is the right design. Run mode (CI) gets retries enabled, open mode (local development) does not — the only point of a retry is to catch flakiness without blocking the merge queue, and locally you want the failure to surface.
// cypress.config.js
const { defineConfig } = require('cypress');
module.exports = defineConfig({
retries: {
runMode: 2, // CI — retry up to 2 times before failing
openMode: 0, // Local — fail fast so you can debug
},
});
Cypress 13.4 added experimental retries with thresholds — pass below a flake rate, fail above it — closer to a true flakiness budget. The trade-off is that retries hide failures from the developer who would otherwise fix them, so any retry-dependent test should be tagged and reviewed.
Use Jest's isolation primitives
Jest sandboxes each test file by default, but tests within a file share module state unless you reset explicitly. Three flags matter:
--runInBandruns all tests serially, removing parallelism as a variable when reproducing a failure.--randomizeshuffles test order within each describe block; combined with a fixed--seedit makes random-order failures reproducible.resetModules: trueplusclearMocks: truein config ensures no test leaks module-level or mock state.
For timing flakiness specifically, --logHeapUsage surfaces memory growth — a steadily climbing heap is often a leaked event listener or untracked subscription, both of which cause downstream timing differences.
Quarantine queues — the pattern that scales
When a test is flaky but not yet fixed, the right move is to quarantine it: keep it running in CI for visibility, but stop it from blocking the build. This is the pattern Spotify documented when they published the design of Flakybot — an internal tool engineers can invoke on a PR to confirm a test is flaky before it ever reaches master — and Odeneye, their visualisation layer for test history. Their iOS team built Master Guardian, which auto-detects the flaky state, files a ticket, and skips the test pre-merge until the fix lands.
Atlassian published a near-identical design — detect on rerun, surface in a dashboard, quarantine while open, auto-close when stable. Microsoft Engineering's internal CloudTest system does the same, language-agnostically, by operating on test result telemetry rather than the test code.
All four — Google, Spotify, Microsoft, Atlassian — converge on the same shape: detect via rerun, file a bug, quarantine, monitor, auto-restore when stable. Smaller teams do not need a CloudBuild-scale platform — a Trunk integration or a hand-rolled GitHub Action that maintains a quarantined.txt file and excludes those tests from the blocking job is enough.
Set a flakiness budget
A flakiness budget is a team-level commitment to a maximum acceptable flake rate, expressed as a percentage of CI runs containing at least one flaky failure. Below the budget, work proceeds. Above the budget, fixing flakiness becomes the team's top priority — no new feature work until the rate drops back under threshold.
Google's research suggests that at 1.5% test-run flakiness across a large suite, the aggregate impact on CI runs is substantial — most of the cost is human time spent triaging false failures, not compute. A 1% budget on flake-impacted runs is a reasonable starting target.
The trust problem is the real problem
All of the above addresses technical root causes. The deeper problem is organisational — once a team stops trusting its test suite, the behaviours that erode trust are self-reinforcing. Developers who expect failures to be noise do not investigate them. They rerun the pipeline. Non-investigation means flaky tests never get fixed. More accumulate. Trust erodes further. The suite that took years to build becomes a ritual — it runs because the process requires it, not because anyone believes the results.
Rebuilding trust takes both technical remediation and visible commitment. Teams that track flakiness rates publicly, maintain a quarantine list, and ship a fix every sprint are signalling that the test suite is a maintained asset, not a legacy artifact. The metric that matters most is not coverage or suite duration — it is the percentage of CI failures that turn out to be false negatives after rerun.
For more on building a high-trust QA workflow around fast feedback, see the Crosscheck guide to setting up a QA workflow for small teams and the analysis of how Playwright overtook Selenium in modern testing teams.
FAQ
How do I tell if a failure is a real bug or a flaky test?
Re-run the test in isolation. If it passes alone but fails in the full suite, the failure is state leakage from another test. If it fails alone too, it is either timing or environment — --repeat-each 50 will tell you which. A genuine bug fails deterministically; a flaky test fails on some runs and not others against unchanged code.
How many retries should I configure for a flaky test?
Two retries in CI is the common default for both Playwright and Cypress. Zero locally so developers see the failure. Treat any test that consistently needs retries as technical debt to fix, not as a stable state to live with.
What's the difference between flakiness detection and quarantine?
Detection identifies which tests are flaky and tracks the rate. Quarantine removes those tests from the blocking critical path while keeping them running for visibility. You need both — detection without quarantine means flaky tests block delivery, quarantine without detection means tests get hidden and forgotten.
Should I use AI-powered self-healing tools to fix flaky tests?
Self-healing locators in Mabl, Testim, and Functionize help with one specific class of flakiness — selector drift when developers refactor markup. They do not solve timing, shared state, environment drift, or network mock gaps. They are a layer on top of good test hygiene, not a replacement. See the Crosscheck round-up of AI testing tools in 2026 for where they fit.
Capture the full context of a flaky failure with Crosscheck
The hardest part of fixing a flaky test is reproducing the failure. A test that fails in CI and passes locally gives a developer almost nothing to work with unless the state at the moment of failure was captured in detail.
Crosscheck is a free Chrome extension that captures the full context of a UI failure at the moment it happens — a screenshot or session recording of exactly what the browser was doing, the complete console log, every network request with timing and response payloads, and full environment details. For flaky UI and integration tests, that means the developer sees the actual sequence of events that led to the assertion error rather than a stack trace and a guess. Console logs surface the JavaScript errors happening in the background when the test appeared to pass. Network request timings reveal the race conditions that only show up under CI load. Reports go straight to Jira, Linear, ClickUp, GitHub, or Slack with one click.
Try Crosscheck free — capture your next flaky test failure with the context your team needs to fix it.



