The Real Cost of Skipping QA: Case Studies from CrowdStrike, Barclays, and Shopify
Every engineering team has felt the pressure. The release date is fixed, the backlog is long, and QA is the last thing standing between code and production. The conversation usually goes something like: "We'll catch anything major in staging. Ship it and we'll monitor."
Sometimes that works. Often it does not. And when it doesn't work at scale — when the code runs on millions of endpoints, processes billions in transactions, or serves hundreds of thousands of merchants — the consequences are measured in nine-figure losses, regulatory investigations, and reputational damage that takes years to repair.
This article examines three real-world software failures that have been extensively documented in public post-mortems, regulatory filings, and press coverage. Each one traces back, at least in part, to gaps in quality assurance processes. The goal is not to assign blame — software is hard, and large-scale failures are almost always systemic. The goal is to understand what inadequate QA actually costs, so that teams can make better-informed decisions about where testing belongs in their workflow.
Case Study 1: CrowdStrike — The $10 Billion Outage
What Happened
On July 19, 2024, CrowdStrike deployed a content configuration update to its Falcon sensor — the endpoint detection software running on roughly 8.5 million Windows machines worldwide. The update contained a logic error in a Channel File that caused the Falcon sensor to trigger a null pointer dereference on affected systems. Windows does not handle kernel-level faults gracefully. The result was immediate: millions of machines entered a boot loop, displaying the Blue Screen of Death and rendering themselves unrecoverable without manual intervention.
Airlines grounded flights. Hospitals reverted to paper records. Broadcasters went dark. Banks, retailers, emergency services, and government agencies were all affected. Delta Air Lines alone reported losses exceeding $500 million from the disruption. Total global economic damage estimates from various insurance and financial analysts ranged from $5 billion to over $10 billion.
The QA Gap
CrowdStrike's own post-incident review, published in the weeks following the outage, identified several contributing factors. The content configuration update — a Channel File update rather than a full software release — was deployed using a mechanism that bypassed the staged rollout process normally applied to full sensor updates. Content updates were not subject to the same level of automated testing and canary deployment scrutiny as software releases.
The specific type of data the Channel File contained had not been validated against a content validator capable of catching the erroneous field count. A template type that was new to the system introduced 21 input fields where the content interpreter expected 20. The mismatch was not caught in pre-production because the test coverage for this specific configuration path was insufficient.
The update was deployed globally and simultaneously, with no staged rollout, no geographic canary, and no automated rollback triggered before the damage had spread to millions of endpoints.
The Financial and Reputational Cost
CrowdStrike's stock fell approximately 32% in the weeks following the incident, erasing roughly $25 billion in market capitalization. The company faced class-action lawsuits from shareholders and contract disputes from customers seeking compensation. Beyond the direct financial impact, CrowdStrike's reputation as a security software vendor — an industry where trust is the core product — was severely damaged. The incident was cited in competitive sales cycles against CrowdStrike for months afterward.
The Lesson
Content configuration updates are code. Any artifact that changes the behavior of running software — whether it is a feature flag, a content file, a model weight, or a configuration parameter — must go through QA processes proportional to its blast radius. Treating "not a software release" as equivalent to "lower risk" is a category error that the CrowdStrike incident made visible at a global scale.
Staged rollouts, automated validation of configuration schemas, and canary deployments are not bureaucratic overhead. They are the difference between a bug that affects a test environment and a bug that grounds airlines.
Case Study 2: Barclays — The January 2025 Payment System Failure
What Happened
On January 31, 2025, Barclays experienced a major outage affecting its retail banking platform. The timing was particularly damaging: January 31 is the UK's self-assessment tax return deadline, one of the highest-traffic days of the year for UK banking systems. Hundreds of thousands of customers were unable to access their accounts, make payments, or file their tax returns on time. Some customers reported being unable to pay salaries, rent, and invoices.
The outage lasted the better part of two days for many customers. Some transactions that were initiated during the outage disappeared entirely, neither processing nor returning funds — they were simply lost in transit, requiring manual reconciliation that took weeks for some account holders.
The QA Gap
The specific technical root cause of the Barclays outage has not been fully disclosed publicly, as the incident is subject to ongoing regulatory review by the Financial Conduct Authority. However, public statements from Barclays and analysis from financial technology observers pointed to a failure in a core banking processing system that was triggered under the load conditions of year-end tax deadline traffic.
What is known from the pattern of the incident — and from Barclays' history of outages, which the FCA had previously flagged — is that the system had not been adequately tested under peak load conditions matching a foreseeable high-traffic event. The January 31 deadline is not a surprise. It happens every year, at the same time, with predictable volume patterns. A system that fails under predictable peak load has not been tested against realistic conditions.
Further, the manual reconciliation process that followed — weeks of effort to resolve "ghost transactions" — suggests that the error handling and recovery paths in the payment processing system were also undertested. Well-tested systems fail gracefully. They either succeed or they produce a clean failure with a clear audit trail. Systems that produce partial state — transactions that neither complete nor cleanly reverse — have almost certainly not been put through failure scenario testing.
The Financial and Reputational Cost
Barclays faced immediate public backlash, a formal investigation by the FCA, and compensation claims from customers who incurred late penalties on their tax returns as a result of being unable to access their accounts. The FCA investigation added regulatory scrutiny to an already damaged public image. The bank's net promoter score — already under pressure from previous outage incidents — took a measurable hit in the weeks following the January incident.
The cost of customer compensation, regulatory compliance activities, emergency engineering response, and the manual reconciliation operation ran into tens of millions of pounds by most estimates. The reputational cost, in a market where retail banking customers are increasingly willing to switch providers, is harder to quantify but arguably larger.
The Lesson
Load testing is not optional for systems that have known peak traffic patterns. Tax deadlines, Black Friday, end-of-month payroll cycles, and product launch days are foreseeable. Testing against predicted peak load — including failure mode behavior, graceful degradation, and clean error recovery — is the difference between an incident that is contained and one that produces weeks of manual remediation.
Failure path testing is also testing. A function that handles errors poorly is a defect, even if the happy path works perfectly. QA processes that only validate success scenarios leave an entire class of production bugs uncaught.
Case Study 3: Shopify — The Partner and Merchant Data Exposure
What Happened
In 2021, Shopify disclosed that two rogue members of its support team had improperly accessed transaction records for approximately 200 merchants. While this incident had a human element — insider access abuse rather than a pure software defect — the investigation that followed revealed that the access control architecture of Shopify's internal tooling had allowed support staff to query merchant transaction data far beyond what their role required. The principle of least privilege had not been adequately implemented or tested in the internal tooling used by support teams.
Separately, Shopify has faced multiple incidents over the years related to partner API access and merchant data scoping. In various disclosed incidents, third-party apps operating through Shopify's Partner API were able to access data outside their granted scopes due to authorization logic defects in the API layer. These incidents resulted in merchant data being exposed to apps the merchants had not explicitly granted that level of access.
The QA Gap
Authorization logic is among the most consequential and most frequently undertested areas of software systems. The happy path — user has the right permission, request succeeds — is straightforward to test. The failure paths — user lacks permission, request is denied; user has permission for resource A but not resource B; scoped token cannot escalate beyond its granted scope — require deliberate, systematic adversarial testing.
In the support tooling case, access control boundaries had apparently not been tested against the realistic range of queries that support staff could execute. The tooling worked as designed in the sense that support staff could resolve customer issues — the intended function. The unintended function — that the same tooling allowed access to data outside any reasonable scope of a support interaction — had not been surfaced by QA.
In the partner API cases, scope enforcement in API authorization middleware is exactly the kind of boundary condition that unit tests and integration tests need to cover explicitly. An API that correctly handles requests from a token with the right scope, but fails to correctly reject requests from a token with an adjacent but different scope, has a defect that only adversarial test cases will catch.
The Financial and Reputational Cost
The merchant data incidents resulted in regulatory notifications in multiple jurisdictions, compensation arrangements with affected merchants, and a sustained period of public scrutiny over Shopify's data handling practices. For a platform whose value proposition to merchants is trust — trust that their transaction data and customer information is safe — incidents that undermine that trust have an outsized commercial impact relative to their technical scope.
Merchants who lose confidence in a platform's data security practices do not usually announce it loudly. They migrate quietly. The churn that follows a data incident is often attributable to the incident only in retrospect, when the correlation between the disclosure and the merchant retention curve becomes visible.
The Lesson
Authorization logic requires adversarial testing. Testing that a feature works for authorized users is necessary but not sufficient. Teams need test cases that verify the negative: that unauthorized users are denied, that scoped tokens cannot escalate, that adjacent permissions do not bleed into each other. This is especially true for internal tooling, which frequently has weaker access control scrutiny than customer-facing APIs because it is perceived as lower risk.
The Common Thread
Three different companies, three different types of failures, but the same underlying dynamic: QA processes that were not proportionate to the risk.
In the CrowdStrike case, a content update mechanism was treated as lower-risk than a software release and received correspondingly less testing scrutiny. The assumption that content updates are safe was never validated.
In the Barclays case, a system that operated under predictable peak load conditions had not been tested at those conditions. The failure paths — partial transaction states, failed recovery — had not been put through adversarial testing. Known risk was not tested.
In the Shopify cases, authorization boundaries — one of the most consequential correctness properties of any multi-tenant system — had not been systematically tested from an adversarial perspective. The happy path worked. The failure paths were not covered.
The pattern is consistent: QA scope is narrowed, usually under time and resource pressure, and the scope that gets cut is almost always the adversarial, edge-case, failure-mode testing that is hardest to write and easiest to deprioritize. That testing is also the testing most likely to catch the bugs that end up in post-mortems.
What Better QA Actually Looks Like
None of these incidents required exotic testing techniques to prevent. They required consistent application of practices that the industry already knows work.
Risk-proportionate testing coverage. The scope of QA should match the blast radius of what is being released. A configuration update that runs on 8.5 million endpoints deserves the same staged rollout and validation scrutiny as a major software release, regardless of how the deployment mechanism is categorized internally.
Load and failure testing against realistic conditions. If your system has known peak events — tax deadlines, payroll cycles, product launches — test against those conditions in a pre-production environment. Test the failure paths, not just the success paths. Verify that failures produce clean, recoverable state.
Adversarial authorization testing. For every authorization boundary in your system, write tests that verify denial as rigorously as you verify access. Scope enforcement, permission boundaries, and least-privilege constraints are correctness requirements, not operational concerns.
Staged rollouts with automated rollback. Canary deployments, feature flags, and geographic staged rollouts exist precisely to limit the blast radius of defects that escape pre-production testing. They are not a substitute for QA, but they are a critical backstop.
Capturing bugs when they surface. Even with rigorous QA, bugs reach production. The difference between a bug that is fixed in hours and one that festers for weeks is the quality of the information available to the team investigating it. Console logs, network requests, reproduction steps, and session state at the moment of failure are not nice-to-haves — they are the data that determines how fast a bug gets fixed.
Where Crosscheck Fits
The post-mortems from CrowdStrike, Barclays, and Shopify all share a common subtext: by the time the failure was understood well enough to fix, significant damage had already been done. The investigation lag — the time between incident detection and root cause identification — is where most of the secondary damage accumulates.
Crosscheck is a browser extension built to close that lag. When a QA engineer, developer, or user encounters a bug, Crosscheck captures the complete context in one click: a session recording, full console logs, all network requests with headers and payloads, and a screenshot. Every bug report automatically includes the technical evidence that would otherwise require painstaking manual reconstruction.
For teams practicing exploratory testing, Crosscheck means that any bug found during a session is immediately documented with enough context for a developer to reproduce and fix it — no back-and-forth, no missing steps, no vague descriptions. For teams running formal QA cycles, Crosscheck captures edge cases and failure modes with the same fidelity as a structured test case.
The failures described in this article were expensive not only because they happened, but because they were hard to diagnose quickly. Better QA processes reduce the likelihood of failures reaching production. Crosscheck reduces the cost of the ones that do.
If your team ships software — and especially if you ship software that touches financial data, user authentication, or high-traffic production systems — try Crosscheck free and see what a complete bug report looks like.



