Reliable Continuous Deployment in 7 steps

CI/CD

Many companies aspire to do Continuous Deployment, but few ever get there. How difficult is it really to achieve? Should it really be the white rhino of our industry? We would like to put forward that it doesn't need to be hard, but requires organizations to stop some bad habits and acquire a few good ones. This post will show how.

First, let’s define what we mean by Continuous Deployment: we mean the process whereby every commit to a main branch gets deployed straight to production, if the CI pipeline passes. Another pre-requisite is doing this frequently with many small changes.

Now that we have defined what we mean, let’s get on with the how.

How feasible is Continuous Deployment for me?

This is a good news/bad news situation: if you are starting a new project, it is relatively easy. But if you have an existing codebase with poor test-coverage, unstable pipelines and deployment processes requiring many manual steps, you are in for some hard work. It is much easier to achieve Continuous Deployment on a new project than a large existing one. The reason for this is simple - reliability is not something that can be easily retrofitted. Reliability of the existing software and ability to reliably change it are absolute cornerstones of Continuous Deployment.

Required habits & practices

Our view is that there are seven foundational habits/practices for Continuous Deployment. Everything else is secondary, but these are non-negotiable. How you achieve them is, as we see it "implementation detail", though there are some guidelines later on how you might achieve these implementation details.

Prefer many small changes over large ones

This should be obvious, but isn’t always obvious when you look at what is practiced in industry. Small changes reduce the scope of changes and their potential blast area. Small changes are easy to revert if something goes wrong, and easy to diagnose. Doing frequent small changes also trains your “delivery & deployment muscles”. Because you do it frequently, it incentivises teams to remove any friction or error-proneness from the process. Any larger changes, or changes that are operationally risky, should always be hidden behind feature flags (easy to implement) or canary releases (harder) if they cannot be easily feature flagged.

Fast test-suite with high test-coverage

High test coverage should be a given. It is the primary way by which you can avoid regressions in existing code and have confidence in new code. An often missed point is that a test-suite should also be fast. A slow test-suite that fails from time to time will often have failures ignored, whereas a fast test-suite will have any failures quickly diagnosed.

Observability, monitoring & alerting

Observability is a must. Log aggregation, metrics (both technical & business metrics) & traces are the fundamental building blocks of observability. Use something like Open Telemetry and set it up immediately. Then:

Ensure only genuine bugs & outages of dependent services log at anError level, and that all these logs can be connected to a trace.
Ensure the code is instrumented for tracing.
Ensure error rates, such as 500 responses for a REST API or failure to process/send messages in messaging middleware. These should always be at 0% during normal operations, and monitored as such.
Have meaningful business & technical metrics in dashboards and keep an eye at their baselines. These will be highly contextual, but may be things like current active users, payments made or any number of other metrics. Some simple, universal metrics may be things like latency monitoring on REST endpoints.
Have proper tools, such as ElasticSearch, Grafana, Prometheus, Loki, Datadog or your Cloud providers built in observability tools to aggregate all of the above.
Have all of the above connected to monitoring and alerting.

Any outages or issues reported by users, rather than monitoring & alerting are a red flag that you have gaps in your observability. Fix them immediately!

Zero tolerance for bugs

Any and all bugs found in production, however rare or unlikely to occur, get investigated and fixed immediately. No exception. No tickets, no later prioritisation. If you want to run a reliable system, zero-tolerance for bugs is crucial: you cannot run a reliable system if you allow bugs to build up over time.

Zero tolerance for flakey pipelines, builds or deployments

Everyone has come across the pipeline or test that randomly fails 1 in 3, 5 or 10 runs. A common instinct is to ignore it and just re-run it. Don’t. A pipeline that fails 1 in 10 times, will eventually start failing first 1 in 5, then 1 in 3. A flakey build is almost always a symptom indicating a bug in your code, tests or pipeline, and should be addressed immediately. Deal with it as seriously as you would a production issue, because chances are, it will eventually become one, if it isn’t already (but you just don't know it).

Committed code arrives in production in <15 minutes

Slow pipelines are bad for a number of reasons:

A pipeline that is slow and occasionally fails is more likely to be ignored than a fast pipeline that fails.
Context switching is bad - either people start doing something else, and may forget about keeping an eye on the deployment, or they waste time just supervising a pipeline to see if everything is ok. Either alternative is sub-optimal.
In the rare case a pipeline has to be diagnosed or debugged, the faster the better. What to do if there simply is no way to get your build under 15 minutes? It depends, but maybe it is an indication your deployment artefact has grown too bloated, and it is time to split it up into more distinct services that can be built in parallel.

No manual steps anywhere, ever

Again, something most will buy into, but not everyone practices. No step in your CI or deployment should require manual intervention. Ever. Even in the case of a catastrophic failure, you should be able to rebuild all your infrastructure, services and restore data by running a simple command. You do test your disaster recovery processes from time to time, don’t you?

Pipeline & deployment patterns

Everything in this section is secondary to the practices and habits. They are some things that we have found works well, but may also be contextual. In some cases these patterns may not be suitable. Let the habits and practices be your guiding stars, and these recommendations be just that, recommendations too consider.

End-to-end tests are an anti-pattern

Possibly an unpopular opinion, but time and time again, we have seen how end-to-end tests end up being slow, unreliable and fragile. They should be avoided as much as possible. If you have decoupled components, such as frontends and backends with well-defined APIs, these surface areas are best mocked out, rather than tested together. The ideal situation is to define these APIs “API first”: have a language neutral definition from which code on both sides can be generated, then verify/validate inputs and outputs to conform to the specification. Integration tests with databases and messaging systems on the other hand are crucial. These can easily be fast running with modern tools such as Test Containers.

Feature-flag & A/B test liberally, consider canary releases

Risky or unavoidable bigger changes should always be feature-flagged. Additional bonus if they can be rolled out incrementally to some percentage of users, rather than be binary on/off switches. Being able to do this, will significantly reduce your deployment risk, and also give you path to recovery from failed deployments, without having to redeploy - simply rollback the flag! If you have the ability to do incremental roll-outs, you have the basics available to A/B test new features, which will be a powerful tool to help you and your business stakeholders decide what features to abandon and which ones to double down on. An additional step might be to consider canary releases, but we have also found that feature-flagging with incremental roll-outs can often work as a sort of “poor mans canary-release”, as long as the operational characteristics are not changed by a deployment (such as memory/CPU usage etc).

Split CI from deploy

CI should be concerned with testing, verifying and building the deployment artefact. The deployment itself should be managed separately. This can be a separate pipeline triggered by a web hook, a deployment tool such as ArgoCD, Atlantis, our own xlrte, or something home built. There are a couple of key benefits to separating CI from deployment:

Developers can use the same tooling to deploy from their local machines to a test environment, thus improving developer productivity.
On-demand feature environments become easier to build, should you need them.
Rollbacks become trivial (see next point).

Have an automated rollback built into your deployment tool or pipeline

If you end up having an issue with a release, rolling back should be as simple as pushing a button. If you followed our previous recommendation, this should be simple to implement, re-use FTW!

Conclusion

Continuous Deployment doesn’t need to be hard nor stressful. It does require discipline and good habits though, but the good news is, like any habits, they will become second nature once you have adopted them and internalised them for a few months. Habits are hard to break or change, but easy to maintain.

Good luck on your journey!