Feature-flagging, A/B-testing & Canary-releases explained

A/B-toggle

Feature-flagging, A/B testing & Canary-releases sometimes get mixed-up, not least because there is a lot of overlap in use and features, while they target two different user-groups: software engineers and product manager respectively. But what do they do, and when & why should you consider using them, and how? Let's discuss it in this post.

What are some of the commonalities?

The most basic commonality between the three concepts is that they all allow you to show new or different features to different sets of users at the same time.

Typically a good suite for feature-flagging, A/B testing or Canary-releases will all support the following common features:

Audience selection: This involves fine-grained control over who gets to see a specific feature or release. It could be targetting users based on geography (country), device- or browser-type used, internal users to your company, being logged-in or not, a repeat user, or any other characteristics of a session.
Incremental roll-outs: Incremental roll-outs allow you to define a certain percentage of users, either all users, or from a selected audience, who will be opted into a feature. You might for instance want to initially roll-out a new feature to 10% of mobile users, check some metrics or other feedback, before rolling it out to more users.

The tools: features

Below is what we would consider the bare-minimum feature set for each:

Feature-flagging

Basic feature-flagging might be as simple as a boolean switch to turn a feature on- or off. But most likely, a good tool will also support audience selection and incremental roll-outs, as described earlier.

A/B-testing

We would consider a good A/B-testing suite something akin to "a feature-flagging tool with extra sugar for product managers & data scientists". A good A/B-testing tool will not only help you feature-flag, but also potentially allow you to:

Deliver more than one variation (version) of a feature.
Integrate with other analytics tools to measure the relative performance between variations.
Provide some in-tool analytics that measures performance between variations.

In the sense that a good A/B-testing tool will allow you to test more than one variation, the term "A/B" might be a little misleading. There are also other variations (sorry for the pun), such as Multi-Armed Bandit tests, that instead of seeking to prove statistically significant performance differences, continuously optimizes the balance of variations delivered against the trade-off of lost significance.

A final observation: if you do A/B-testing (and you should!), it is quite likely that your A/B-testing tool will also be a more than capable feature-flagging tool. Hence, no need to pay for two different tools, when one can be used for both.

Canary-releases

Canary-releases are perhaps the most different from the aforementioned tools. In essence, Canary-releases have a few key characteristics:

Two or more releases of the same base-software are run in the same environment at the same time.
Traffic shaping is used to send either a percentage of all users, or a specific audience to one or the other release.
The Canary-software monitors some pre-defined metrics (for instance from Prometheus), and as success-thresholds are met, traffic gets increasingly shifted towards the newer release version.

Examples of metrics in a web-context could be "latency does not rise above some threshold", "memory usage is at most 10% higher than the previous version". There are numerous metrics that might make sense in a given scenario. If any metrics-thresholds failed, traffic is turned off from the new version, and the old version remains.

Mapping use-cases to the right tool

Now that we understand the basic capabilities of each type of tool, let's see what the potential use-cases are, and where each one fits. To make the use-cases concrete, lets use a format that we are all likely to be familiar with.

"As a software engineer, I want to ensure the release does not adversely impact non-functional characteristics, like latency, or CPU- or memory-usage"

Solution: Canary-releases.

As you might have guessed from the functional description of canary-releases, canaries are used specifically to mitigate risk of unexpected non-functional behavioural changes of software.

For instance, you might have introduced an in-memory cache into your application, which blows up the memory and makes the app crash due to Out Of Memory. Canaries could save you from this scenario. Maybe some changes you didn't even realise would impact latency all of a sudden make your app slow. Canaries to the rescue.

Canary-releases save your bacon from non-functional known-unknowns and unknown-unknowns that could affect your uptime, availability or quality of service.

You could also use Canaries to test-in-production, by allowing internal users test the canary before switching traffic. A conceivable reason to do so might be that a system does not behave representatively to production in a non-production environment without production data, but data replication from production to testing is either unfeasibly complex, time-consuming or simply not allowed for regulatory reasons.

"As a software engineer, I want to validate new features for correctness in production on a subset of users"

Solution: Feature-flagging.

In most organisations, engineers write tests to validate their features. They might test them in a staging/testing environment. But we all know that until features hit product data or load, it's difficult to know for certain if the tests covered all the possible edge-cases. This is where feature-flagging is useful: maybe we'll turn on the feature and see if it has an adverse effect on any meaningful metrics for the application. Maybe we are so unsure we want to do an incremental roll-out, before turning it all the way up to 100%.

Feature-flagging gives us an extra-level of safety in delivering new features, turning them on, and perhaps rolling them back, without having to do a new production deployment. The ability to do a rollback without doing a deployment is a very underrated feature of flagging, giving extra confidence to new deployments.

"As a product manager, I want to control how and when features get released, without having to control what goes into production releases"

Solution: Feature-flagging.

Have you ever been in the situation where engineers "surprised" a product manager by releasing a feature into production that the product manager wanted to announce with fanfare to users at a later date?

In certain industries, there may also be regulatory reasons to time the release certain features at a pre-defined point in time.

Feature-flagging can allow engineers to decouple software deployments from feature releases, and allow product managers to control what gets released when and to whom, without worrying about whether it was already released unbeknownst to them. The only requirement is that engineers actually implement the feature flags.

"As a product manager, I want to validate the business-performance impact of product changes"

Solution: A/B-testing.

This use-case almost requires its own book (in-fact, there are several good ones that have been written). But in short: regardless of what people think they know about users, without evidence, it's just their opinion.

Whether a feature is worth having, or was worth building, can only be proven if we have:

A hypothesis of why it should exist
A measure/metric that proves or disproves the hypothesis
Ability to measure the before- and after, or alternatives against each other.

How do you do this? The answer is: in small increments. Create measurable hypothesis, build the minimum that can measure it, compare results. Rinse and repeat.

In a particularly high-performing team I was part of, we used to settle any conversation that threatened to become a discussion about "what was the best solution", with the simple sentence: "What's the hypothesis and how do we measure it?".

More often than not, the argument was settled by agreeing on parameters of a test.

The key to truly high-performing software delivery is to work on the shortest possible cycle of hypothesis -> build -> measure -> decide -> repeat. A/B-testing allows you to do exactly this.

Conclusion: should you do feature-flagging, A/B-testing or canary releases?

The answer is: yes. Do all of them if you can.

The headline of this sub-section lists them in likely order of complexity to adoption. You probably get 90% of the benefit of all three practices from feature-flagging and A/B-testing combined, as long as you have good observability-practices and fast deployment pipelines for rollbacks. But if you want to enable exceptionally reliable delivery performance, you should strive to do all three.