Software engineering metrics that matter

Measuring software engineers and teams by metrics is fraught with controversy, and rightly so. There are many horror stories of individual developer productivity being measured by Lines of Code, number of commits and other activity-based metrics. This is usually ill-advised, but it doesn’t mean that metrics don’t have a place in software engineering.

Let’s look at a small set of metrics we’ve found useful!

Why you should measure, and why you shouldn’t

First, let’s address the elephant in the room: never-ever try to measure individual developer productivity based on metrics. Outside of outliers, like someone never committing anything, such metrics miss a lot of nuance. If, for instance, you measure number of commits, or number of PR’s individually, you create an individualistic metric that disincentivizes team-work. Developers who pair with others, mentor, or unblock their co-workers get penalized.

Secondly, activity-based metrics do not capture other qualities, such as communication skills, judgement, decision-making. One of the greatest ways to gain leverage in software-engineering is what you decide NOT to do. De-scoping functionality and focusing on the bare essentials has an outsized impact on outcomes. Yet this is not something that you can easily attach a metric to.

Furthermore, you should be careful about comparing teams. With the metrics we propose, there is some ability to compare, but this should be done with careful consideration—team-size, domain complexity, and several other factors will impact even the most objective metrics. No two teams or domains are alike.

Foremost, metrics should be used to compare a team to itself, the progress it makes, and any warning signs that they may be slipping in any areas. Metrics are a tool for self-improvement, not a tool for stack-ranking individuals and teams.

With that said, what metrics should we care about?

DORA metrics/”Four Key metrics”

“DORA” metrics come from DevOps Research Assessment, and were popularised through the research conducted by Nicole Forsgren, Jez Humble and Gene Kim, which eventually ended up becoming the highly recommended book “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”. For a true in-depth look at these metrics, and many other drivers of engineering performance, we highly recommend you pick up and read the book!

With that being said, the Four Key metrics are in turn:

Deployment Frequency: How often an organization releases to production.
Lead-Time For Change: The amount of time it takes a commit to get into production.
Change Failure Rate: The percentage of deployments causing a failure or incident in production.
Time To Restore Service: how long it takes an organization to recover from a failure in production.

Why do we think these metrics matter? Quite simply: measured correctly, they are difficult to game, they capture reliability of a service remarkably well, as well as level of friction in deploying. Finally, Deployment Frequency is a reasonably accurate proxy for team velocity.

This is really what we should look for in metrics: How robust are they against perverting incentives to game them? How objective/subjective are they? Are they a reasonable proxy for an outcome we are looking for? Remember, where metrics are concerned, activity is not achievement.

SRE metrics

The SRE metrics are detailed in depth in Googles freely available SRE books. Again, for an in-depth guide, read them.

The reason we care about SRE-metrics, is that they again capture the reliability of a system, and the amount of reactive “fixing” work (“toil”) required to keep systems running. Furthermore, the metrics provide a mechanism for resolving the conflict between features vs reliability, and resolving the commonly occurring release hesitance in many organisations due to previous failed releases.

Service Level Indicator, Service Level Objectives, Service Level Agreements

There are three initial concepts to SRE metrics, that are not metrics, but valuable to understand:

Service Level Indicator’s (“SLI”) are the top-level concept, which are simply things that may indicate service quality, for instance, latency or error-rates/availability.
Service Level Objectives (“SLO”) are, for our purposes, the key indicator: an SLO is effectively the level on an SLI that we are aiming for. For instance, 99% availability if we are looking at an availability SLI.
Service Level Agreements (“SLA”): SLAs are a contractual commitment to an SLO, which typically involves all three of technical, legal and commercial implications.

As we proceed, it is important to primarily understand the relationship between SLI & SLO.

Error budgets

Error budgets are the measure that we monitor and alert on in our service observability. If we have an SLO of 99% availability, meaning as an example 99% of requests to a web-server should return non-500 response codes, this means we have a 1% error threshold. If we subsequently make a bad deployment that increases the number of errors inside a measuring period (say 30 days) to 0.5%, this would mean we have burned through 50% of our error budget.

To make sure we catch these issues, we should have alerts tuned to going off quickly when our error burn-rate inside some time (say 5 minutes and/or 15 minutes) goes above a meaningful threshold. This way, we connect our Error budgets to our observability and ultimately alerting.

If we burn through our entire error-budget for a measuring period (usually 30 days), this would imply we do no more feature-releases until the underlying causes of the burned error budget have been corrected with reliability work and a reliability release. This is effectively our mechanical rule, to balance priorities between reliability work and new feature work.

Percentage toil-work

This is perhaps the metric that is not really a metric, but more of a rule of thumb: if people with ops-centric roles spend 50% or more of their time working reactively, fixing environments, production, responding to requests etc, this is a red flag. If the amount of “toil” is more than 50%, efforts should be made to automate the toil away, until it is lower. A higher rate will lead to burn out, but also shows serious operational deficiencies in the organisation.

Other metrics

Apart from DORA- and SRE metrics, we have a few other metrics that are worth keeping an eye on:

Flow time

Flow time is similar to DORA’s “Lead-Time For Change”, with the exception that LTFC is counted from committed change to production, whereas flow-time is from accepted work item (planned, soon to be worked-on, but not started) to production. Therefore, Flow-time can give a little bit more nuance to measuring how long it takes for new features to be accepted and refined, to when they are in production.

Flow distribution

Flow distribution, again, gives more nuance to DORA’s Change Failure Rate: flow distribution, simply put, counts the distribution of work between bugs, features, operability improvements and any other categories of work you may have. It is a useful indicator to see if enough effort is going into testing and reliability, if the system is operable enough.

PR lead-time & rework

We would include two metrics in this:

PR lead-time (time from open until merged/closed)
PR rework (number of PR with changes made after review)

We generally advise against a pure Pull Request-based workflow unless necessary. Why?

Because the Pull Requests come too late: the cost of change, cost of rework, cost of interruption, and context switches for author and reviewer alike make pull-request based workflows expensive.

A little discussion and design upfront, followed by a live, 1-on-1 review at the end can take mere minutes, sometimes saving hours and days of back-and-forth around a pull request.

That being said, it is quite likely, even adopting this approach, you will have notional PR’s before merging to main. In either case,measuring Pull Request lead-time (time from PR opened until merged) is a start, but does not reveal the entire picture of health around reviews.

We would use PR lead-time as a top-level indicator, but also check how many rounds of comments and commits after raising a PR are required before, to further drill into the health of change-flow.

Other potential indicators

Error log volume

We are big proponents of observability, but with great observability, comes great costs. At least if you log promiscuously. We would suggest log volume per service as an additional health check on services.

We would further recommend that all logging except error logging should only be temporary for debugging purposes (ideal state is 0 INFO or DEBUG logs). And error-logging should be reserved for truly exceptional circumstances out of the control of the system, or indicating defects in the system, rather than something that is generated as part of just running a service normally.

This being said, the ideal volume of logs in general, and error logs in particular, should be exceedingly low.

Team health/psychological safety

A team that feels psychologically safe, and feels they have control over their work and ways of working, is a team that will raise genuine concerns, give each other honest feedback, and help each other continuously grow and improve.

Doing an occasional anonymous Team Health-check, as well as a psychological safety survey, will help catch any warning signs. Please consider doing these on a monthly or quarterly basis. The Spotify team health check is a good starting point, as is Fearless Organizations psychological safety survey.

Conclusion

To summarize what we wrote in the introduction to this post:

Do:

Use metrics to identify areas of improvement in a team.
Use metrics as an early warning system to potential issues.
Compare a team to itself.

Don’t:

Try to break down metrics to the individual developer level!
Be careful about comparing teams to each other, this is frequently apples and oranges.
Rank teams or individuals based on metrics.