Habits of highly effective engineering teams

Ends vs Means quadrant

What is the difference between high-performing engineering teams and unremarkable teams? How come some teams seem to effortlessly over-deliver, and do so while producing reliable software that rarely has production issues? Over the course of 20 years, I have observed some common patterns in how these teams operate. Allow me to share some conclusions!

When it comes to engineering performance, the rule that trumps absolutely all other rules is:

The cycle-time between hypothesis-idea generated and hypothesis tested in production must be as short as possible.

Now that we have that out of the way, it is firstly worth mentioning, the practices we will be going through are all complementary. Some are more important than others, but remove any, and you immediately weaken the integrity of the entire system.

Secondly, these are practical things any team or engineer can adopt with some support from the rest of the organisation, but some require a higher degree of self-discipline and judgment honed by years of experience and pattern recognition. However, there is no magic to it. In fact, much is remarkably simple. That being said, many of these subjects are worthy of their own posts, and perhaps, their own books, so view this post as an overview of the key practices that make up a high-performing team.

Interacting with "the Business"

Building things right is only half the challenge in software. If the team does not have the ability to listen to feedback and figure out how to build the right thing, all the technical ability in the world is wasted. In practice, a business-oriented mindset and good communication skills are just as important as technical ability.

Take ownership of problems, not JIRA-tickets

The image accompanying this blog post (re-interpretation of Richard Hackman in his book Leading Teams (Harvard Business Press, 2002).), explains this point.

Allowing people to own problems, not be told solutions is not always easy, though:

In top-down driven, low-trust organisations, the natural instinct is often to define tasks for engineers to be assigned and complete. This, however, is supremely disempowering and results in passive, unmotivated teams that do the bare minimum required. If you want to see a self-driven team, self-organised and full of energy, give them problems to solve that are concrete enough to know where to start, but open-ended enough to leave a fair amount of figuring out to do. OKR Key Results or Key Performance Indicators are frequently a good way to drive this, which gives a natural segue to the next part…

Data decides what we build: hypothesise, test, prove/disprove

As we alluded to in our first fundamental rule, being data-driven is key. It is quite common that a Product Manager or Product Owner thinks they know best, that they do not need data because they know what the customer wants/needs. This is particularly common if the Product Manager has previously been in the role of the intended customer. What this line of thinking fails to recognise, is that no two people are the same. One person's experience is only anecdotal. It is in large numbers we can see patterns.

This has a few implications:

For a real feedback-cycle, we must measure.
For the feedback-cycle to be a feedback-cycle, we must listen to it, adapt to what we find.
To know what to measure, we must have an idea why we should measure it, and what sort of results we want to see.

In practice, this means our decision process in what we build should be:

Define and understand the problem.
Define the desirable outcome.
Define the metrics that indicate the desirable outcome.
Collect hypotheses on how to solve the problem/achieve the outcome.
Implement hypotheses, measure their outcomes against each other in a series of A/B-tests.
Pick the winners, iterate until metrics indicate you have achieved, or come close enough to the desirable outcome to move onto other priorities.

The cycle of experimentation should be as fast as possible. The organisations that are best able to iterate towards optimal outcomes are the ones that are most likely to win in the marketplace.

Written down like this, it might seem common sense to approach building the right thing this way. But unfortunately, common sense is not that common. Very few organisations reach the enlightenment of data-driven, dispassionate decision-making, and instead tend towards massaging egos by following the Highest-Paid Person's Opinion ("HiPPO").

Interacting with the rest of the engineering organisation

We have previously written at length on how to scale engineering organisations. These rules also apply for how individual teams should function within a wider organisation. I therefore leave this section somewhat blank, and simply refer to our previous writing on "People APIs", which has some rules of thumb on team autonomy (lots!), team size (small), team boundaries (DDD) and how to align with other teams.

Fundamentals of Software Engineering (and beyond)

We have covered communication with other parts of the organisation, and decision-making. Not until now do we actually reach the technical part:

Restraint: "YAGNI" & "KISS"

This is perhaps the practice which is the least concrete and the hardest to master. It requires good judgement, which often only comes with experience. You Aren't Gonna Need It and Keep It Simple Stupid as principles are easy enough to grasp, but hard in practice.

The leverage you gain, by eventually mastering them, is however almost infinite: every line of code is eventually technical debt in a legacy system. Hence, the code that is never written is the best code. The effort you do not expend on superfluous detail can be redirected towards areas with higher pay-off for the effort.

In software, what you choose not to do is almost as important as what you choose to do. So select wisely. The high-performing engineering teams know this, and are aggressive about cutting scope and prioritising high-value, low-effort work above all.

Test-coverage & Property-based testing

Good test-coverage is table-stakes in modern software development. But teams frequently make two mistakes when testing:

They invest in slow, unreliable end-to-end (e2e) tests. This is almost always wasted effort, as the test-suites tend to break down quickly.
The only write example-based tests, which limits the number of variations of inputs tests are run against.

If you want a truly robust test-suite, we suggest you try adopting Fuzz-testing & Property-based testing. This approach has a slight learning curve, but it pays off soon. Furthermore, it becomes a big productivity tool soon enough, as generating test-data and test-examples becomes more automated.

Zero-tolerance for defects

Finding a defect in a mountain of defects and spaghetti-code is challenging, to say the least. So, why should we tolerate it? Successful teams have a near zero-tolerance for defects. When a genuine defect found, is a down-tools moment, where key team-members stop what they are doing and focus on fixing the issue immediately.

The only way to avoid having to deal with a mess is to not make one in the first place. Defects are unavoidable, but they shouldn’t be allowed to pile up over time.

Talking & pairing

Pair-programming is not widely adopted in industry, yet there are people that swear by it. One potential issue with pair-programming is, that it is quite frankly mentally draining for all but the most extroverted of us.

But on the opposite end, we have Pull Request-driven development. This is suboptimal for other reasons: PR’s frequently are delayed in being reviewed, reviewed without sufficient context. Even if they are reviewed with full awareness of the available trade-offs, it still happens at a point of development that is too late. If someone has already invested 1-2 days of effort into a PR, and peers find the design suboptimal, the cost of change is high, potentially as high as the initial investment.

We would suggest an alternate approach for non-trivial improvements that take more than 1-2 hours to complete: “sandwich pair” – spend a bit of time up-front with a peer going through the problem and how you could solve it. Perhaps outline some code. Then, if you want to pair, pair! But if full-time pairing is not preferred (and that’s fine!), instead of doing a PR-review, spend a bit of time on- or just before completion again 1-on-1, where you go through the solution.

This frequently results in no time wasted on waiting for PR reviews, no time wasted on design or implementation that is suboptimal. And it solves the issue of context-less PR reviews; instead the peer-review is proper, thorough, with full context and can, due to the earlier conversation, focus on catching potential issues with design and implementation.

Greedily protect focus-time

Software engineering is a focus-demanding job, that requires long stretches of concentration. Interruptions are poison to focus. So successful teams protect their focus time greedily.

How can you do this? Here are a few ideas:

Schedule at least 1 (maybe even 2-3) days a week with no meetings other than standup.
Schedule meetings around the beginning or end of the day, never across the day.
Demand every meeting invite has an agenda and clearly defined desired outcome, so you may judge whether your attendance is necessary.
Let people take ownership of certain “themes”, so that not all engineers are required in every meeting. Share the ownership load.

Operational practices

"You build it, you run it"

AWS CTO Werner Vogels popularised the term “You build it, you run it”. We wholeheartedly agree:

No one will be able to diagnose issues and defects in a system faster than the engineers who built it.

Nothing is going to focus the mind and incentivise discipline to build reliable software the same way, as being on the hook for being woken up at 2am to fix it if it isn’t.

Yet another benefit is that being on the hook for running the system, means engineers will be less likely to cut corners on operability and observability. No one wants to leave something requiring 10 manual steps in the right order, if they have to do them every- or every other day. Google say they invented Site Reliability Engineering (“SRE”) to apply software engineering principles to operations. “You build it, you run it” takes this to its ultimate conclusion.

However, “you build it, you run it” does not imply that a developer team needs to know all about ops. This is what self-serve and paved, golden paths are for, in larger organisations usually provided by a Platform team. Platform or DevOps teams should help, they should provide templates for common problems, so teams don’t have to reinvent the wheel. But they should never be a blocker. While common solutions should be provided, when teams need to “paint outside the lines”, they should be empowered to go it alone. Team autonomy, ownership, and empowerment rules all.

Ship small, ship frequently

The ideal batch-size of changes is 1.

Shipping is like a muscle: the more you do it, the better you get at it, the same way a muscle gets stronger, the more you exercise it. The best way to deliver software is to do it frequently, in small increments. This keeps change risk low, and in the case of defects being introduced, they are easy to track down.

Shipping big batches of changes, infrequently, is about the most dangerous practice that persists in industry. It is the path that leads to multi-day outages, defects that cost millions, and in some cases, catastrophic failures that can take down an entire company.

Even if you are not good at it yet, just try shipping as small and as frequently as possible. You will get good at it. The mistakes will make you discover the flaws in your ways of working. The friction around releasing will pinpoint what parts need to be further- and better automated. Just ship it!

Observability & SRE practices

Observability & SRE are big enough to warrant a book. In fact, go read this on observability. Or this on SRE.

But, without deep-diving into the subjects worthy of their own several books, here is what we’d suggest:

Make sure to instrument your services properly, with the three building blocks (logs, traces, metrics, and connect them appropriately). Create templates for dashboards, so that anomalies can easily be identified without relying on user-reported bugs.

Connect observability to alerting. Create Service Level Objectives (“SLOs”) with Error Budgets, such as “99% availability with no errors”.

Apart from the mechanics, this also implies adopting parts of the SRE toolkit, such as:

At the latest, when an Error Budget is used up for a time-period, it is an automatic signal to switch focus to reliability work, so the Error Budget is not used up for the same root causes in the future.
Not until the causes of the Error Budget are believed to be addressed do you release new features.
Likewise, if reactive operational matters, such as releases, eat up a significant percentage of someone's time (50% or more), it is a signal to automate the causes of the toil. Ideally, we’d like to see the toil percentage to be well, well below 50%. Any organisation where ops/DevOps/SRE staff use more than 50% on reactively fixing thing, rather than engineering solutions, it is a massive red flag.

While we recommend organisations adopt SRE practices and tools, we do not recommend adopting SRE teams. We believe a Platform approach combined with “You build it, you run it” drives better outcomes.

Feature-flag all the things

Apart from small-batches, one way we can further reduce the risk of releasing is through feature-flagging.

We can use this to turn off features which are not yet complete (ship unfinished code for significant changes! We can use it to turn on new features only for a subset of users so that we can verify that there are no adverse effects by observing telemetry and business metrics. We can even use feature-flagging for release management, so engineers can release at will, but if Product Managers want to create some fanfare and timing for a given feature, they still have control to do just that.

Next steps

This is a lot to take in, but how do we know we are making progress? What indicators should I be looking for?

Metrics & Indicators will be the subject of an upcoming post. If you want to know when it is out, I suggest you take the call-to-action below.