The Feature Flag Nightmare: A QA Perspective

Modified June 19, 2026

Mike Sell

crashnaut.com

The Feature Flag Nightmare: A QA Perspective

My former boss, Jerome Dane, wrote a piece I keep coming back to: Feature Flags are Dangerous. His argument is from the engineering and architecture seat — flags blur what's actually running in production, leave landmines of dead code, create multiple code paths, and quietly drift between environments and caches.

I want to add the view from the chair next to his: the QA chair. Because everything he says is true on the way in, and it gets worse on the way out. If feature flags are dangerous to write, they are pure hell to test.

Here's the part that doesn't fit in a standup update: every independent flag doubles the test surface. Not adds to it — doubles it.

The maths nobody wants to hear link

A single on/off flag means a feature has two states. Two flags, four. Three, eight. The number of distinct configurations is 2ⁿ, and exponential growth gets out of hand faster than intuition expects.

Configurations to test grow as 2ⁿ (log scale)

3 flags8

5 flags32

8 flags256

10 flags1024

Ten flags — a number any real product blows past — is 1,024 distinct configurations of the same codebase. No QA team on earth is running 1,024 regression passes. So they don't. They test a couple of configurations and ship the rest blind.

Here's what that looks like for just three flags:

Three on/off flags create 2³ = 8 distinct paths through one feature. Teams usually test the two green ones — all-off and all-on — and ship the other six (red) without ever running them.

It's worse than 2ⁿ in the real world link

The clean 2ⁿ number assumes flags are independent and the only thing that varies is the flag itself. Neither is true.

Flags interact — flag B behaves differently depending on whether flag A is on, which is exactly the kind of bug that hides in the combinations nobody ran. And flag state is only one axis. The same matrix has to be re-run across the things Jerome warned about: behaviour differs across environments, and again across caches (a flag value cached locally can disagree with the distributed source of truth). Add the user roles or segments most B2B products carry, and the real number is a product of all of them.

The real matrix: 3 flags, multiplied out

Flags alone8

× 3 environments24

× 4 user roles96

× 2 cache states192

Three flags — three! — and you're already looking at nearly 200 meaningfully different configurations. This is why "it worked in staging" is not a sentence QA can trust when flags are involved: staging was one cell in a very large grid.

What gets tested vs. what ships link

The gap between what a developer verified and what actually reaches users is where flag bugs live.

:::compare ::do[What the developer tested]

The flag on, happy path
On their machine, one environment
As an admin / their own account
With a fresh cache ::dont[What actually ships to users]
Every reachable combination of flags
Across every environment
Across every role and segment
With stale, warm, and cold caches :::

The developer flipped one flag on and saw their feature work. Then a customer hits it with a different flag also on, in production, as a non-admin, behind a stale cache — a cell of the grid no one ever opened. To them it's just a bug. To Jerome's point, this is "testing in production" whether you meant to or not.

Why automation doesn't rescue you link

The instinct is "we'll just parametrize the tests across flag states." You can — but the matrix is still exponential, so you get to choose between two bad outcomes: run every combination and watch CI time explode past anything usable, or pick a subset and accept the gaps. There's no free lunch hiding in the test runner.

The pragmatic middle ground is combinatorial (pairwise) testing: instead of all 2ⁿ combinations, generate the much smaller set that covers every pair of flag values at least once. It catches the large class of bugs that come from two flags interacting, at a fraction of the cost. It is a mitigation, not a cure — it will not catch a bug that only appears when three specific flags line up — but it's the honest tool for the job.

What actually helps, from the QA seat link

Jerome's prescription — fewer flags, branches over toggles, small reversible releases, a fast pipeline — is also the best testing strategy, because every one of those choices shrinks the grid. A few things I'd add from the testing side:

The QA-friendly decision: most toggles want to be a small release or a branch. If it genuinely must be a flag, treat it as a first-class test parameter and give it an expiry.

A few principles behind that flow:

Treat every live flag as a test parameter, not an afterthought. If a flag exists, its states belong in the test plan — default-tested, with pairwise coverage across the rest. A flag that isn't in the test matrix is a flag you're shipping blind.

Kill flags aggressively. Every flag you remove halves the surface you just saw explode. A sunset date on each flag is the single highest-leverage testing decision a team can make, and it costs nothing.

Make flag state observable. When a bug report comes in, the first question is "which flags were on?" If your logs answer that, a failure is reproducible. If they don't, you're debugging one unknown cell in a 200-cell grid.

Pin flag state in test environments. Tests that run against whatever the flag service happens to return today aren't tests — they're weather reports. Fix the flag values per test run so a green result means something.

The bottom line link

Feature flags don't remove risk; they move it — off the developer's machine and into the tester's matrix, which grows exponentially with every toggle added. Jerome is right that they're dangerous to write. From where I sit, the kindest thing any team can do for the people testing their software is to have fewer flags, register the ones they keep as real test parameters, and kill them fast.

The fewer cells in the grid, the more of them you can actually open.

Feel free to update this blog post on GitHub, thanks in advance!

Share this post

Support me

I appreciate it if you would support me if you have enjoyed this post and found it useful, thank you in advance.