Flaky Tests Are Killing Your Team's Productivity: Here's How to Fix Them
There is a specific kind of damage that flaky tests do to an engineering team, and it is more corrosive than most engineering leaders account for.
It starts with a failed CI run. The developer looks at the failure, sees a test that has nothing to do with their change, and re-runs the pipeline. It passes. They merge. Three weeks later, a real regression ships because that same test — the one everyone knows is unreliable — was failing again, so someone re-ran the pipeline until it went green without investigating the failure.
That sequence has a name: test suite learned helplessness. Once a team decides that a test failure probably does not mean anything, the entire test suite stops functioning as a safety net. The suite still runs. The builds still pass. But nobody trusts the results, and the protection the tests were supposed to provide has quietly evaporated.
Flaky tests are not a minor annoyance. They are a systemic failure mode. This guide covers what makes tests flaky, the most common causes in real codebases, how to detect and quantify the problem, and the specific techniques that actually eliminate flakiness rather than paper over it.
What Makes a Test Flaky
A flaky test is one that produces different results — pass or fail — when run against identical code. The test itself is nondeterministic: same inputs, same application state, but the outcome varies depending on factors outside the code under test.
This definition matters because it draws a clear line between flaky tests and legitimate test failures. A test that fails consistently against a real bug is doing exactly what it should. A test that fails intermittently for reasons unrelated to the code change being evaluated is flaky, and the two need to be handled differently.
Flakiness has a compounding effect on CI pipelines. Google published research on their test infrastructure showing that even a 1% flakiness rate across a large test suite can result in a meaningful percentage of CI runs containing at least one false failure. At scale, engineering teams at Google reported that flaky tests were responsible for a significant portion of CI reruns. The cost is not just the time to rerun — it is the cognitive overhead of evaluating every failure, the delay to deployment pipelines, and the steady erosion of trust in the suite.
The most dangerous flaky tests are not the ones that fail every third run. Those are annoying but visible. The most dangerous are the ones that fail once every fifty runs, or only under load, or only in a specific CI environment but never locally. These are the tests that get classified as infrastructure noise rather than genuine signals, and they are the ones most likely to mask real failures when they matter.
The Most Common Root Causes
Timing and Asynchrony
Timing issues are the single most common cause of test flakiness, particularly in UI and integration tests. They occur when a test makes an assertion before the system has reached the expected state.
The typical pattern looks like this: an action is triggered — a form is submitted, a button is clicked, an API call is made — and the test immediately asserts on the resulting state without waiting for the asynchronous operation to complete. In a fast environment or when the test runner happens to schedule execution favorably, the test passes. Under load, on a slower CI runner, or simply due to normal variation in execution timing, the assertion runs before the state has settled and the test fails.
Fixed waits — sleep(500) or wait(2000) — are the standard wrong fix for timing issues. They make tests slower and still fail occasionally, because any fixed duration can be exceeded under the right conditions. The correct approach is waiting for the specific condition being tested: waiting for an element to appear, waiting for a network request to complete, waiting for a DOM mutation to occur. Deterministic waits are always preferable to fixed-duration waits.
In JavaScript-heavy frontends, timing issues are compounded by React's batched state updates, animation frame timing, and virtual DOM reconciliation. Tests that assert on rendered output need to account for when the rendering has actually completed, not just when the triggering action has been dispatched.
Shared and Leaking State
Shared state between tests is the second most common root cause. Tests that modify shared resources — a database, a cache, a global singleton, environment variables, the filesystem, browser local storage — without cleaning up after themselves leave side effects that affect subsequent tests.
This causes flakiness that is tightly coupled to test execution order. When tests run in a consistent order, the state dependencies may happen to be satisfied and the suite passes. When the test runner parallelizes execution or randomizes order — which is best practice for exactly this reason — the hidden dependencies surface as intermittent failures.
The correct principle is test isolation: every test should create the state it needs, execute against that state, and clean up after itself. Tests that share database connections without transaction rollbacks, tests that write to shared temporary directories without teardown, and tests that set global configuration values without restoring them are all creating state leak vulnerabilities.
In practice, complete isolation is expensive. The cost is usually worth paying for the reliability it provides, but teams taking shortcuts often introduce isolation problems at the edges — particularly in teardown logic that only runs on success and gets skipped when a test fails partway through.
Network Dependencies
Tests that make real network calls to external services are flaky by design. External services have their own availability characteristics, rate limits, latency distributions, and response time variability. A test suite that passes when an external API responds in 200ms and fails when the same API responds in 800ms is not testing application behavior — it is testing network conditions.
The standard remedy is network mocking: intercepting outbound requests and returning controlled responses from within the test environment. This eliminates the dependency on external service availability, makes tests deterministic, and typically makes them faster. The tradeoff is that the mocked responses need to be maintained as the real API changes — stale mocks are themselves a source of misleading test results.
For integration tests where the real network interaction is the thing being tested, the preferred approach is a dedicated test environment with stable, controlled dependencies rather than production or shared staging services.
Network flakiness also manifests inside applications themselves. Race conditions in API call management — duplicate requests, request cancellations, out-of-order response handling — can cause intermittent failures in integration tests that expose actual application bugs. These failures are worth investigating rather than suppressing, because the underlying race condition likely affects real users as well.
Test Order and Parallelism
Test suites that were developed with implicit ordering assumptions become flaky when the execution environment changes. A common scenario: tests were written and run locally in a single process with a consistent execution order for years. CI is updated to run tests in parallel to reduce pipeline duration. Failures appear that were never seen before, because the parallelism exposes state dependencies that sequential execution happened to satisfy.
Parallelism also exposes resource contention issues: two tests writing to the same file, two tests binding to the same port, two tests inserting rows with the same unique identifier into a shared test database. These failures are intermittent because they depend on which tests happen to execute concurrently in a given run.
The diagnosis involves looking for patterns in which tests fail together. Tests that fail consistently in combination with each other, but not when run in isolation, are exhibiting order or parallelism sensitivity. Running the suite with forced randomization — most modern test runners support this — accelerates discovery of these dependencies.
Environment and Infrastructure Variability
Tests that pass locally and fail in CI are frequently an environment problem rather than a code problem. Differences in available memory, CPU speed, disk I/O, container resource limits, timezone configuration, locale settings, available system fonts, and installed software versions can all cause tests to behave differently across environments.
This category of flakiness is particularly hard to diagnose because the failure cannot be reproduced in the most convenient environment for debugging — the developer's local machine. Containerizing the test environment so that local and CI runs use identical environments is the most effective mitigation. When the environments are genuinely identical, a failure that occurs in CI can be reproduced locally by running the same container image.
Date and time-dependent tests are a specific and common subcategory. Tests that compare against hardcoded dates, tests that assume certain time relationships based on current time, or tests that behave differently at daylight saving time transitions are a persistent source of flakiness that is often not recognized as such until the specific conditions occur.
How to Detect and Measure Flakiness
The first step is visibility. Most teams underestimate the scope of their flakiness problem because they have no systematic way to measure it. A test that fails once in twenty runs looks unremarkable in any individual run. Aggregated across hundreds of runs, the pattern becomes obvious.
Track pass/fail history per test. Every major CI platform — GitHub Actions, GitLab CI, CircleCI, Jenkins — either has native test result tracking or integrates with tools that provide it. Test history over time makes flaky tests visible. A test with a 90% pass rate is failing one run in ten. Over the course of a sprint, that adds up to significant pipeline noise.
Run tests in isolation. When a failure occurs, re-run only the failing test in isolation, with no other tests in the same process. If it passes in isolation but fails in the full suite, the failure is caused by state leakage from another test. If it fails in isolation as well, the failure is an intrinsic timing or environment issue.
Run tests with randomized order. Most test frameworks support randomized execution order. Enabling this as a standard practice surfaces order dependencies quickly. A test that was passing reliably under consistent order may fail immediately under randomization, revealing an implicit dependency that was previously hidden.
Run tests in parallel if you do not already. Parallelism stress-tests isolation. Tests that share state without proper isolation fail under parallel execution in ways they never would sequentially.
Use repeat runs to measure flakiness rate. Running a test fifty times in succession and measuring the pass rate gives a reliable estimate of its flakiness. This is expensive to do for every test but is valuable for any test that has shown any sign of flakiness. A test with a 98% pass rate in fifty runs is not reliable enough for a production test suite.
Fixing Approaches That Actually Work
Replace fixed waits with condition-based waits. Audit your test suite for any call to sleep, wait with a fixed duration, or equivalent. Replace each one with a wait that blocks until the specific condition being tested is satisfied: element present, element visible, network request completed, DOM attribute changed. Playwright's waitForSelector, Cypress's cy.get with timeout configuration, and testing-library's waitFor utility all support this pattern.
Enforce test isolation at the framework level. Use database transactions that roll back after each test rather than manually deleting test data. Use dependency injection to provide fresh instances of stateful services per test rather than sharing singletons. Use beforeEach and afterEach hooks to reset global state rather than relying on tests to clean up after themselves. Teardown that runs unconditionally, even on test failure, is essential — teardown logic that only runs on success is a state leak waiting to happen.
Mock external network calls by default. Establish a policy that unit and integration tests do not make real outbound network calls. Use tools like msw (Mock Service Worker) for browser tests, nock or axios-mock-adapter for Node.js HTTP, or built-in mocking capabilities in your test framework. Reserve real network calls for end-to-end tests running against a controlled environment.
Containerize the test environment. If local and CI environments diverge, containerizing test execution — running the same Docker image locally that CI uses — eliminates the most common class of environment-specific failures. The upfront cost is setup time; the ongoing benefit is that failures are reproducible.
Quarantine rather than delete. When a flaky test is identified but not yet fixed, quarantine it: move it to a separate test suite that runs in CI but does not block the build. Track the quarantine list. Set a policy that tests cannot stay in quarantine indefinitely — every quarantined test needs a fix or a deletion decision within a defined time window. Quarantine prevents flaky tests from blocking delivery while ensuring they are not simply forgotten.
Fix the root cause, not the symptom. Retry logic — automatically rerunning a failed test and marking it as passed if any retry succeeds — is the standard wrong fix for flakiness. It reduces visible build failures without addressing the underlying problem. The flaky test continues to run, continues to consume resources, and continues to occasionally mask real failures. Retry logic can be a temporary measure during a flakiness remediation effort, but it should not be a permanent fixture.
Prioritize flaky tests by blast radius. Not all flaky tests are equally worth fixing immediately. Tests that cover core user flows, tests that are frequently in the critical path, and tests that fail at high rates should be prioritized over tests that cover edge cases and fail rarely. A flakiness backlog is easier to sustain when it is ordered by actual impact.
The Trust Problem Is the Real Problem
All of the techniques above address the technical root causes of flakiness. But the deeper problem is organizational: once a team stops trusting their test suite, the behaviors that erode trust are self-reinforcing.
Developers who expect failures to be noise do not investigate failures. They rerun the pipeline. Non-investigation means flaky tests are never fixed. More flaky tests accumulate. Trust erodes further. The suite that took years to build becomes a ritual rather than a tool — it runs because the process requires it to run, not because anyone believes the results.
Rebuilding trust requires both technical remediation and visible commitment. Teams that track flakiness rates, maintain a quarantine list, and make progress on that list each sprint are sending a signal that the test suite is a maintained asset, not a legacy artifact. The metric that matters most is not test coverage or suite execution time — it is the failure rate of CI runs that turn out to be false negatives after re-run. That number should be trending toward zero.
Crosscheck Helps You Debug Flaky Tests Faster
One of the hardest parts of fixing a flaky test is reproducing the failure. A test that fails in CI but passes locally gives the developer almost nothing to work with unless the environment at the moment of failure was captured in detail.
Crosscheck is a browser extension that captures the full context of a failure at the moment it occurs: a screenshot or session replay of exactly what the browser was doing, the complete console log, every network request with timing and response data, and full environment details including browser version and viewport dimensions. For flaky UI and integration tests, that means instead of a stack trace and a log message, your developer gets a replay of the failure — the precise sequence of events that led to the assertion error, the state of the network at the time, and any console errors or warnings that were active.
The replay is the difference between a developer reproducing a flaky timing issue in thirty minutes and spending two days trying to recreate the conditions. Console logs surface the JavaScript errors that were happening in the background when the test appeared to pass. Network request timings reveal the race conditions that only occur under CI load.
Flaky tests are fundamentally a debugging problem before they are a fixing problem. The faster you can gather reliable evidence about what is happening when a test fails, the faster you can eliminate the root cause and restore trust in your suite.
Try Crosscheck free — capture your next flaky test failure with the full context your team needs to fix it for good.



