Microservices: principles & pitfalls to avoid pain

Microservices have been popular for many years now. Nevertheless, distributed systems are hard to get right. In this post, we will go beyond theory and discuss common mistakes to avoid and rules of thumb for getting it right.

These principles are borne from practical experience and we have the battle-scars to prove it. Many will be familiar, but others may be new and yet others contentious. Let's get started!

Autonomous teams that can be fed by two pizzas

The two-pizza-team rule was first coined at Amazon, but is in reality much older than that. Repeated experience shows that teams of 4-8 people are the optimal size. Any less, and you are too dependent on individual contributors and can be hurt by sickness or staff turn-over. Any more, and the communication overhead starts to grow, miscommunications increase and productivity suffers.

The second part is autonomy: teams must be able to work independently with a clear sense of purpose, without depending on other teams. Some dependencies are unavoidable, but these dependencies when found, must be aggressively removed.

API first, everywhere

Jeff Bezos famously mandated that all services and teams within Amazon communicate via API's in 2002. We agree with his sentiment. For us, this means a few concrete actions must be taken:

Define API's in language independent formats. As examples, Protobuf, Avro are good choices. For browser- and http-based clients, JSON Schema, OpenAPI and GraphQL are good options.
Prefer code-generation from API definitions for server and client bindings. This avoids drift between API definition and actual implementation and reduces communication overhead, as the API definition become the canonical source.
It deserves restating: the above also applies for HTTP API's consumed by browsers and mobile clients.

Why should we do this? There are several reasons, but technically, we reduce the opportunities for API drift, and thus the opportunities for bugs. From a communication perspective, we reduce and automate the communication overhead between teams that manage different services. API first/contract first aids the goal of autonomous teams.

Prefer async pub/sub-based messaging between services

Distributed systems should be able to continue functioning even when other services are down or are having issues. If services are coupled to other services by point-to-point communication with synchronous RPC, REST or GraphQL requests, the failure of a single service can quickly cascade to numerous other services. Another risk is that by doing synchronous communication everywhere, you limit the scalability of a system to its least scalable service. This in turn has one of two consequences: limited scalability, or excessive cost in scaling the least scalable components with more resources.

One of the easiest ways to counteract these risks is to prefer asynchronous pub/sub style messaging where ever possible. The exact type of pub/sub system depends on a number of trade-offs:

How many technologies do we want to use? We likely want to keep the number of technologies in a stack as low as possible.
What is the trade-off between message durability vs cost/risk of losing a message? This likely drives the choice of messaging technology.

A system where individual services can crash or be turned off, with no impact on user experience and full recovery upon restart is a more resilient system.

If event-sourcing, make services responsible for storing their own events

This point will be contentious and I know many people I respect will disagree.

Many vendors of messaging systems and their partners recommend using messaging systems, such as Kafka as event stores that can be replayed. From experience, we are more skeptical to this approach. Messaging systems and databases serve different purposes, and when used at scale, these different purposes become evident.

Firstly, the responsibility for events becomes vague, there is no one clear owning service of specific events. Secondly, the complexity of changing/updating message/API formats increases if migrating from one message channel/topic to another is not easily achievable, because the channel simultaneously acts as a database/event store. Instead of being able to maintain versioned API's that become deprecated and phased out, we must instead maintain backwards compatibility forever - technical debt that never dies.

Finally, the complicated issue of backwards compatibility, the lack of ease in migrating channels and the need for expensive, cascading "replays" of messages leads to difficulties in achieving zero-downtime deployments. More sophisticated deployment methods such as canary-releases become all but impossible. If we are instead able to migrate between API versioned "channels", we make our lives much easier.

From our perspective, this is as simple as a separation of concerns, between an event channel, and an event store.

Your infrastructure code is more important than your service implementation code

The one area where we have seen most teams cut corners early is automation and infrastructure code. This inevitably always backfires with severe consequences for long-term productivity. The reality of microservices is that infrastructure code and automation almost always outlives the code of individual services.

In a distributed system, the architecture is effectively the infrastructure & communication patterns, not the implementation code of individual services. I have sometimes made the analogy that a set of Kubernetes Helm chart often tells a lot more about the architecture of a distributed system than the code of any individual service.

Aggressively automate everything, be anal about the quality of infrastructure code and simplify at every opportunity. Accidental complexity and "hacks" will accrue faster than you think.

Invest in developer experience & productivity

For developers to be productive, they must be able to test and run their applications during development with a minimum of friction or time wasted. For distributed systems, there are generally two schools of thought: local development, or development in remote "feature environments".

For a larger system, running everything on a laptop may not be feasible, this in turns can be a driver towards using ephemeral "feature environments" for true end-to-end testing. Whichever way you go, local or remote, optimise aggressively for the developer experience, reduced build-deploy-run time and removed manual steps.

There are various open source solutions that try to address this problem, such as Hashicorp Waypoint, Skaffold, Garden and others. They all have slightly different focuses and address different needs, so whether adopting one or building your own dev infrastructure is appropriate depends on your needs.

Builds should take no more than 5 minutes, deployments no more than 10 minutes

One of the biggest factors in development and ultimately delivery velocity will be the speed of your build & deployment pipelines. A 10 minute turnaround will be able to produce 48 builds to an environment in an 8 hour day. A 30 minute turnaround will produce 16. This difference will show in all sorts of ways: ability to quickly collaborate with business stakeholder. Ability to quickly resolve production issues. Ability to quickly debug issues in specific environments. Ability to avoid productivity zapping context-switching.

Optimising for time and reliability between committed code and that code landing in a deployed environment is one of the most important productivity and value investments you can make.

Invest in observability, or avoid microservices altogether

The formal definition of observability is "a measure of how well internal states of a system can be inferred from knowledge of its external outputs".

For software systems, the three pillars of observability are:

Logging - this should be familiar to everyone
Metrics - things like number of successful requests vs errors, key performance numbers of a system, such as active users, sales etc compared to usually observed numbers for similar time-periods.
Tracing - being able to follow the lifecycle of a user request or external action on the system through the entire system.

The reason we want to invest in observability is quite simple:

We capture bugs, performance issues and other issues faster.
It helps in debugging, when we have issues, we can quicker find the root-cause, or problematic areas.
It helps confirming or falsifying business hypothesis about functionality, improving product-market-fit and customer conversion towards desirable business goals.
We can use observability to help us increase the velocity of reliable delivery to production and introduce new concepts such as canary-releases aided by health metrics.

Early problem detection, fast problem resolution, testing of business hypothesis and faster, more reliable delivery are core drivers of investing in observability.

Domain Driven Design: platform concerns are also bounded-contexts

Business needs should always drive technology. But this singular focus on "business features" can sometimes lead to a tangled mess. In the drive to deliver features, we bake in orthogonal concerns that should be separated out.

Typical examples are business services that include and are aware of permissions, authentication and/or authorization concerns. Services that know about the different channels they send messages to users through. There will be numerous other examples, but the main point is that when we have a repetitive technical concern cropping up across services, it is time to reflect on whether this is in fact a platform concern deserving of its own bounded context.

Microservice size and scope: can be rewritten by a handful of developers in less than a month

Any individual microservice should be optimised for disposability. Ideally, a few developers familiar with the domain should be able to rewrite it in a month or less.

Why? Firstly, if a microservice grows beyond this, it is an indication it might have too many responsibilities. Secondly, we are guaranteed to make mistakes, both in our implementation, scope and analysis of where context boundaries should be drawn. Rather than sink cost into an ever growing microlith, we should be able to be quick and agile about correcting course.

The best code is the code you don't have to write. The second best code is the one that can be replaced easily. All technical debt begins with the first line of code written on a system.

Summary

We intentionally did not try to enumerate every technical microservices pattern there is. These have been covered in detail in other source material. Rather, we tried to focus on the concrete points where we repeatedly have seen organisations struggle. For organisations new to adopting microservices, many of the above pitfalls will not be obvious until much later. Usually at a point when correcting will be orders of magnitude more expensive than getting it right from the outset.

Our firm opinion is that continuous improvement implies that we try to improve at the first signs of friction, because friction does not tend to go away. Other priorities may take precedence in the short term, but friction in the areas we have mentioned above should never go ignored, or they will only grow over time.

Image credit

Chris Richardson consulting and microservices.io