The Cost of Skipping QA: 6 Famous Software Bugs and What They Teach

Written By  Crosscheck Team

Content Team

May 29, 2025 13 minutes

The Cost of Skipping QA: 6 Famous Software Bugs and What They Teach

The Cost of Skipping QA, Told Through 6 Famous Software Bugs

The cost of skipping QA is not theoretical — it is documented in court filings, NASA inquiry boards, SEC enforcement actions, and FDA reports. Six software failures across four decades each share the same shape: a release that bypassed a layer of testing, a defect that survived to production, and a bill measured in lives, dollars, or both. This article walks through CrowdStrike, Knight Capital, Boeing 737 MAX, Ariane 5, Mars Climate Orbiter, and Therac-25 — what each failure cost, what QA missed, and the practice that would have caught it.

Key takeaways

  • CrowdStrike's July 2024 update crashed roughly 8.5 million Windows machines worldwide; Delta Air Lines alone is suing for over $500 million in damages.
  • Knight Capital lost approximately $460 million in 45 minutes on August 1, 2012 when a deployment script left dormant code active on one of eight production servers.
  • Boeing's 737 MAX MCAS failures killed 346 people and cost the company $2.5 billion in a January 2021 DOJ settlement, with total grounding costs estimated near $20 billion.
  • Ariane 5's maiden flight exploded 40 seconds after launch in 1996 because of a 64-bit-to-16-bit integer overflow — a $370 million loss.
  • Mars Climate Orbiter burned up in 1999 because one team used pound-force and another used newtons; the mission cost $327.6 million.
  • Therac-25 delivered massive radiation overdoses to six patients between 1985 and 1987, killing at least three — a race condition that unit testing alone could not surface.

The common pattern is not exotic. In every case, the failure mode lived in the gaps between components or between processes — the seams where testing tends to be thin.


Case Study 1: CrowdStrike, July 2024 — The $500M+ Endpoint Outage

What happened

On July 19, 2024, CrowdStrike pushed a content configuration update — internally called a Channel File — to its Falcon endpoint sensor. The file, intended to refine threat detection logic, contained a malformed template that caused the kernel-mode sensor to dereference an invalid pointer on Windows machines. Roughly 8.5 million devices entered a boot loop, displaying the Blue Screen of Death and requiring manual recovery from a non-network boot environment.

Airlines cancelled flights, hospitals reverted to paper records, broadcasters went dark, and emergency services in multiple jurisdictions reported degraded capability. Delta Air Lines was the worst-hit major carrier — it cancelled approximately 7,000 flights affecting 1.3 million passengers, and in its October 2024 Georgia state court filing it is seeking over $500 million in out-of-pocket losses plus unspecified consequential damages.

What QA missed

CrowdStrike's own post-incident review identified two specific gaps. First, the Content Validator — the tool that screens configuration updates before release — contained a logic bug that allowed a malformed template instance through. Second, content updates were not subject to the same staged-rollout discipline as full sensor releases. There was no canary, no geographic phasing, and no automated rollback once endpoints began to fault. The defect was deployed to every Falcon-protected Windows machine on the planet within a tight time window.

The categorisation error is the interesting one. CrowdStrike's engineering practice treated software releases as high-risk (with staged rollouts) and content updates as low-risk (with direct global push). The blast radius was identical. The risk classification was not.

The lesson

Any artifact that changes the behaviour of running software is code, regardless of file extension or release channel. Feature flags, model weights, content files, and configuration parameters all need the same risk-proportionate QA as compiled releases. Staged rollouts and automated rollback are not bureaucratic — they are the difference between a contained incident and a global one.


Case Study 2: Knight Capital, August 2012 — $440M Lost in 45 Minutes

What happened

On the morning of August 1, 2012, Knight Capital Group was the largest equities market maker in the United States, with roughly 17% of NYSE volume. By 10:15 a.m. — about 45 minutes after the opening bell — it had lost approximately $460 million and was insolvent. The cause was a software deployment that left a piece of dormant code active on one of eight production servers.

The defunct code was called Power Peg, a routing function deprecated in 2003 and meant to test a parent-order accumulator. A new SMARS (Smart Market Access Routing System) release reused the flag that had previously activated Power Peg. On seven of eight servers, the new release shipped cleanly. On the eighth, the deployment engineer missed the file copy, so the flag now invoked the old Power Peg code, which had no working completion-tracking logic. For 212 incoming parent orders, that one server sent millions of child orders — over 397 million shares traded across 154 stocks in under an hour, per the SEC's October 2013 enforcement order.

What QA missed

Three layers of testing should have caught this and did not:

  • Deployment verification. No automated check confirmed that the new binary was present on all eight servers. A simple checksum sweep would have flagged the eighth host.
  • Production canary. Knight could have routed a small share of traffic through the new deployment before opening it to the full order book. It did not.
  • Risk controls. The SEC found that Knight had no pre-trade risk controls capable of halting the runaway. A position-limit circuit breaker — standard practice elsewhere — would have stopped the bleeding inside minutes rather than 45.

The bug was not in the new code. It was in the combination of new code and stale code running side by side because a single server was missed. That is exactly the failure class that integration testing in a staging environment that mirrors production scale is designed to catch.

The lesson

Deployment is part of the software, and deployment scripts deserve the same test rigour as the binaries they ship. Removing dead code is testing work too — undeployed code that still wakes up on a flag is a latent defect waiting for a flag collision.


Case Study 3: Boeing 737 MAX MCAS — 346 Deaths, $20B in Costs

What happened

Lion Air Flight 610 crashed into the Java Sea on October 29, 2018, killing 189 people. Ethiopian Airlines Flight 302 crashed near Bishoftu on March 10, 2019, killing 157. Both involved the Boeing 737 MAX 8 and the same underlying defect: a flight-control system called MCAS (Maneuvering Characteristics Augmentation System) that received input from a single angle-of-attack sensor, and when that sensor returned bad data, pushed the nose of the aircraft down repeatedly until the pilots ran out of altitude.

The financial reckoning followed. In January 2021, Boeing agreed to a $2.5 billion DOJ settlement, comprising a $243.6 million criminal fine, $1.77 billion in airline compensation, and a $500 million crash-victim beneficiaries fund. The 20-month grounding of the entire MAX fleet, the longest ever for a US airliner, generated total estimated losses around $20 billion before family civil settlements — and as of late 2024, Boeing's counsel acknowledged billions more had been paid to families through civil litigation. A Chicago jury awarded a single Ethiopian Airlines family $49.5 million in May 2026.

What QA missed

The MCAS failure is not a single missed test. It is a cascade in which several testing and review practices were short-circuited:

  • Single point of failure. MCAS relied on input from one of two angle-of-attack sensors. Any aerospace fault-tree analysis would identify a single sensor driving a control authority that can override the pilot as a critical risk. The Joint Authorities Technical Review later found the system safety analysis had not adequately considered the consequences of erroneous sensor input.
  • Pilot workload assumptions. The certification analysis assumed pilots would recognise and respond to a runaway stabiliser within four seconds. Simulator testing across a representative pilot population was insufficient to validate that assumption.
  • Documentation and training. MCAS was omitted from flight crew manuals and differences training. From a testing standpoint, the user-facing behaviour was not part of the validation scope.

The deeper QA gap was process-level: the FAA delegated certification activities to Boeing employees under the Organization Designation Authorization programme, and the structural pressure to keep the MAX cost-competitive with the Airbus A320neo discouraged surfacing tests that would have triggered expensive design changes.

The lesson

When the failure mode of a software system is loss of life, the QA boundary cannot stop at the unit level. Fault-tree analysis, redundancy review, and adversarial scenario simulation are part of the test plan. Outsourcing those analyses to the team incentivised to ship is not a process — it is an accident waiting to be scheduled.


Case Study 4: Ariane 5 Flight 501 — $370M Lost to a 16-Bit Integer

What happened

On June 4, 1996, the maiden flight of the European Space Agency's Ariane 5 rocket lasted 39 seconds. At T+36.7 seconds, the on-board computer received what it interpreted as flight data and commanded a full nozzle deflection that placed the rocket at a 20-degree angle of attack. Aerodynamic forces tore it apart, and the self-destruct system finished the job. The ESA inquiry board put the direct loss at over $370 million, including four Cluster science satellites that had taken a decade to build.

What QA missed

The fault was a 64-bit-to-16-bit integer conversion in the Inertial Reference System. Ariane 5 reused the SRI code from Ariane 4, where horizontal velocity values were small enough to fit comfortably in 16 bits. Ariane 5 had a more aggressive flight profile and reached horizontal velocities several times higher. The variable BH (Horizontal Bias) exceeded 32,767 — the maximum positive value for a 16-bit signed integer — and the conversion raised an unhandled exception.

Three specific QA failures stand out:

  • No reuse validation. The SRI code had been validated against the Ariane 4 trajectory. It was never re-validated against the Ariane 5 flight envelope, because management treated the SRI module as "already qualified."
  • Selective overflow protection. Engineers had explicitly chosen to protect only four of seven variables against overflow, to keep the on-board computer below an 80% workload target. The unprotected variables included BH.
  • Redundancy was not actually redundant. Both the primary and backup SRIs ran identical code on identical data. When the primary faulted, the backup faulted in the same way within milliseconds. The system was duplicated, not diversified.

The lesson

Reused code is not tested code. Any assumption that held in the previous environment — operating range, input distribution, downstream consumer — has to be re-verified against the new one. And redundancy that runs the same logic against the same inputs is theatre.


Case Study 5: Mars Climate Orbiter — $327M Lost to a Unit Mismatch

What happened

On September 23, 1999, after a 286-day cruise, the Mars Climate Orbiter entered the Martian atmosphere at an altitude of around 57 kilometres rather than the planned 226 kilometres. The probe disintegrated. The mission cost was $327.6 million, including the $125 million orbiter itself plus operations and the linked Mars Polar Lander mission.

What QA missed

The cause is famous: Lockheed Martin's ground software produced thruster impulse values in pound-force-seconds, while the NASA JPL navigation software ingested them as newton-seconds. The Software Interface Specification document required SI units. The actual interface delivered imperial units. No one caught it across nine months of trajectory correction manoeuvres because the discrepancy was within the noise of other navigation uncertainties — until it wasn't.

What is less famous, but more instructive, is what happened in the final 24 hours. The JPL navigation team noticed the trajectory was wrong. They ran the numbers and saw that the orbiter would enter the atmosphere well below its 80 km survivable threshold. A correction burn was proposed and discussed, then declined. The investigation report later noted the team was "unprepared for this off-nominal scenario."

The QA failures were:

  • No end-to-end unit verification. No integration test in the loop validated that the units flowing across the contractor boundary were what the interface specification said they should be.
  • No rehearsed recovery procedure. Even when the symptom was visible 24 hours out, there was no procedure to action it. Off-nominal scenarios had not been wargamed.

The lesson

Interface contracts between teams or vendors are the highest-leverage place to put automated tests. A single property-based check that runs the contractor's output through a units-aware schema would have caught this on day one of cruise. And operational readiness is testing — the team's ability to execute a recovery plan under time pressure is itself a quality attribute that has to be exercised before it matters.


Case Study 6: Therac-25 — Race Conditions That Killed

What happened

Between June 1985 and January 1987, the Therac-25 medical linear accelerator delivered massive radiation overdoses to six known patients across hospitals in the United States and Canada. Three of those patients died as a direct consequence; others suffered severe injury. Nancy Leveson and Clark Turner's 1993 case study remains the canonical reference and is required reading in most safety-critical software curricula.

What QA missed

The Therac-25 was built by Atomic Energy of Canada Limited as a successor to the Therac-6 and Therac-20. The earlier machines had hardware interlocks that physically prevented certain dangerous configurations — including the one that proved fatal in the Therac-25, where the electron beam could fire at full therapeutic intensity without the X-ray target in place. AECL removed the hardware interlocks for the Therac-25 and relied on software interlocks instead.

The fatal defect was a race condition. If an operator entered the treatment parameters on the console and then edited them quickly — within roughly eight seconds — a flag controlling beam mode was not updated before the beam was energised. The patient would receive electron-beam-level current without the X-ray target in the way, delivering radiation doses estimated at 100 times the prescribed level.

The QA gaps were severe and stacked:

  • No formal hazard analysis of the consequences of removing the hardware interlocks. AECL's safety analysis assumed the software was reliable; it did not work through what would happen if the software was wrong.
  • No independent code review of the safety-critical control software, which had been written by a single programmer and never seriously inspected by anyone else.
  • Race conditions are invisible to single-threaded unit tests. The defect only surfaced under specific operator-input timing patterns that none of the test cases exercised.
  • Error messages were unhelpful. When the software detected an inconsistency it displayed "MALFUNCTION 54" — a code so cryptic that operators learned to dismiss it and press the proceed key, which on the Therac-25 fired the beam again.

The lesson

Concurrency bugs survive unit tests because unit tests run one thing at a time. Catching them requires explicit adversarial testing: rapid-fire operator-simulation harnesses, model checking, or formal verification. Equally important: removing physical safety mechanisms in favour of software ones moves the entire safety case onto the test suite. That trade is sometimes justified, but only when the test suite is rigorous enough to bear the weight.


What the six failures have in common

FailureYearDirect costWhat QA missed
CrowdStrike2024$500M+ (Delta alone)Content-update path treated as lower-risk than code
Knight Capital2012~$460M in 45 minDeployment verification across server fleet
Boeing 737 MAX2018–19~$20B + 346 livesFault-tree on single-sensor authority
Ariane 51996$370MReused code re-validated against new envelope
Mars Climate Orbiter1999$327.6MCross-vendor units contract testing
Therac-251985–87At least 3 deathsConcurrency / race-condition testing

Six different domains, six different test gaps, one consistent pattern: the defect lived in the seam between components, teams, code and configuration, or software and the physical world it controlled. Adversarial testing — failure-injection drills, chaos engineering, fault-tree review, contract tests across vendor boundaries — is what catches the bugs that end up in post-mortems. It is also the QA discipline that gets cut first under time pressure, because the failure paths take longer to write than the happy paths and the value of preventing a defect is invisible until the defect arrives. The cost of skipping adversarial QA is not zero — it is deferred, it compounds, and the bill arrives in a press release written by someone's general counsel.

For a fuller view of how QA practice is organised around these failure classes, the Crosscheck team's piece on 10 SQA methodologies with real-world case studies walks through the static-analysis, formal-methods, and V-model traditions that grew up specifically because of failures like Therac-25 and Ariane 5.


How modern teams catch what these teams missed

Three practices have moved from optional to expected in the last few years, partly in response to incidents like CrowdStrike's. Risk-proportionate release pipelines put any artifact that changes runtime behaviour — feature flags, configuration, content files — through the same staged-rollout and automated-rollback path as compiled code. Contract testing at boundaries, using tools like Pact, enforces the schema, units, and semantics of cross-service or cross-vendor interfaces; a single property-based test would have caught the Mars Climate Orbiter units mismatch on day one. Adversarial and chaos testing in CI runs failure-injection as part of every build, and AI-assisted test generation is starting to be genuinely useful for the negative-path cases humans skip. For where the tooling stands today, the team's best AI testing tools 2026 round-up is the current reference.

The pattern is consistent: move the adversarial work earlier, automate it, and treat failure-path tests as first-class deliverables. For teams thinking about how the bug-reporting side of the loop fits in once a defect does escape, the perfect bug report template covers the evidence a developer actually needs to reproduce and close out a production issue quickly.


FAQ

What is the most expensive software bug in history?

By direct cost, the Boeing 737 MAX MCAS failures are the most expensive single software-related incident on record — roughly $20 billion in grounding costs plus the $2.5 billion DOJ settlement and billions more in family civil litigation. The CrowdStrike July 2024 outage produced the largest non-aerospace impact, with damages estimated in the tens of billions across affected industries. Knight Capital's $460 million in 45 minutes remains the record for fastest large-scale loss from a single deployment.

Was the Mars Climate Orbiter failure really just a units mistake?

The proximate cause was a pound-force versus newton mismatch at the contractor-NASA interface. The deeper cause — documented in the JPL investigation report — was that the navigation team noticed the trajectory error 24 hours before orbital insertion and chose not to execute a correction burn because no procedure existed for the off-nominal scenario. The unit error created the problem; the lack of rehearsed recovery sealed it.

How does the Therac-25 case apply to modern software?

Race conditions, undocumented assumptions, and removal of hardware-level safeguards in favour of untested software ones are not 1980s problems. Any modern system using optimistic UI updates, eventual consistency, or asynchronous event handling has the same failure surface that killed Therac-25 patients. The remediation — adversarial concurrency testing, model checking, defence in depth — is also unchanged.

What is the single most cost-effective QA practice from these case studies?

Contract testing at boundaries. Five of the six incidents — every one except Therac-25 — involved a defect at an interface between teams, vendors, or release channels. A discipline that automatically verifies the shape, units, and semantics of every cross-boundary interaction would have prevented or contained each of them. Modern contract-testing tooling makes this cheap; the cultural shift to actually do it is the harder part.


Catch bugs before they cost millions

The case studies above all share an after-the-fact characteristic: by the time the team understood the failure well enough to fix it, the damage was already done. Most of the cost lived in the investigation lag — the hours or weeks between symptom and root cause — and most of that lag traces back to incomplete information about what was happening when the failure occurred.

That is the part Crosscheck addresses directly. The free Chrome extension captures a complete bug report in one click: screenshot or screen recording, full console logs, every network request with headers and payloads, browser and environment metadata, and the user's exact reproduction path. Reports route straight to Jira, Linear, ClickUp, GitHub, or Slack, so the engineer fixing the bug has the evidence in front of them on the first read.

It does not replace adversarial QA, contract testing, or staged rollouts. It does close the loop on the bugs that escape anyway — which, as the six case studies above demonstrate, will always be some of them.

Try Crosscheck free

Related Articles

Contact us
to find out how this model can streamline your business!
Crosscheck Logo
Crosscheck Logo
Crosscheck Logo

Speed up bug reporting by 50% and
make it twice as effortless.

Overall rating: 5/5