Why you should be careful with developer metrics

December 6, 2020 ● Denis Stebunov

If you've ever managed any software project, you've probably asked yourself: how could our teams move faster? How fast are we moving today?

For these kinds of questions, it's tempting to turn to metrics. After all, we use metrics routinely and successfully when we develop software. There's performance, production load, and uptime metrics. There are also metrics based on customer behavior, like conversion and retention. These metrics don't just provide visibility. More importantly, they create a feedback loop. We can make a change that is intended to improve something, and use metrics to see how much improvement there really was. Developer wisdom says that every software performance optimization must start with a measure, which makes perfect sense.

Since metrics are so helpful, can't we apply them to software development speed? As better development processes should improve development output, maybe output metrics could measure whether processes really improved. So then, what metrics could we use?

We don't have any good metrics for that

Development speed is the amount of work produced per unit of time, so we need to measure both output and time. Measuring time is simple, no problem there. What about work output? Attempts to measure it are as old as software itself. Through the years one thing's held true: whenever we decide a metric measures work output, something unintended soon follows. Consider the following examples:

Lines of code — Probably the oldest one. Almost no one takes it seriously today. Counting code lines just makes the code bloated and the developers focus on the simplest task possible;
No. of commits — encourages developers to break up their commits into many parts. Once again, the focus is on the simplest tasks and refactorings. Complex problems are avoided;
No. of tasks completed — leads to tasks being split into subtasks. Every problem, even minor, is submitted as a task. Also adds an incentive to cut corners to complete tasks as fast as possible;
No. of bugs, or percentage of bugs — not a metric of work output, but work quality. Discourages developers from making changes, and makes them reluctant to report bugs that they found themselves;
Man-hours or storypoints — when estimates form the basis for how team performance is measured, it's easy to guess what comes next — inflated estimates;
... and so on.

Developers are smart. They specialize in cracking complex problems. For any metric you give them, they'll find the easiest way to improve on it. This way most likely won't correlate with work quality or desired project outcome. This doesn't necessarily mean that developers will game them. I'd say it depends on the context and how strong the incentives are. But one thing is certain — developers will realize that the measure of their productivity is disconnected from what matters. This is not only frustrating. It also distracts them from doing the real work.

But why?

Why do metrics work so well for software products, and not for measuring developer output? Is it some kind of developer conspiracy? Actually, if you look outside of software development, you'll find more examples where metrics work well, and where they don't.

Where metrics work well: manufacturing and sales. Let's take the manufacture and sale of cups. You can measure production output, i.e. no. of cups produced per unit of time, and production quality — percentage of cups that failed to pass quality control. On the sales side, you can measure sales volume and profit margin. These metrics are really helpful for management. For example, the goal for the manufacturing department can be to improve the percentage of cups that successfully pass quality control, while keeping unit costs low. The head of sales can aim to increase sales volume or improve profit margins. Improvements in these metrics are good for business, so we can also treat them as a measure of efficiency of corresponding departments.

Where metrics don't work: measuring scientific output. Scientists publish their research in articles. Now science also has some metrics that measure work output and quality: no. of articles published, no. of citations, statistical significance of the results. Can we say that a scientist who published 10 articles has produced twice as much value as a scientist who published 5? Unlikely. Works differ too much in their value. Even without numbers, it's often hard to say which work is more valuable. Because "hacking" one's number of publications and citations is a well-known problem in the scientific community, they're not considered reliable indicators of work productivity. Statistical significance has its own issues too — p-hacking is a wide-spread problem.

Two key criteria

In any context, metrics that work well share two important criteria:

They relate directly to value;
These values are consistent, i.e. in countable quantities of interchangeable units.

Let's take a look at the examples above:

Metrics in manufacturing and sales satisfy both criteria. In cups manufacturing, value is represented in cups. Since it's mass produced, cups are identical. In sales, value is measured in dollars. The business goal is to make money, so money's relationship with business value is as direct as possible. As one dollar is equal to any other, metrics based on money measure consistent values.

In science, these criteria aren't satisfied. We can't find a metric that would measure the value of scientific results directly. We have only indirect metrics, like no. of articles and citations, which can be hacked. In any case, these metrics aren't consistent either, because all publications are not made of interchangeable units.

Developer metrics fail both criteria

What do we have to measure developer output? Lines of code, no. of commits, no. of tasks completed, man-hours, storypoints… Considering these metrics against the two key criteria above, you'll find that:

None of them relate directly to value. We don't ship our customers lines of code, man-hours, or storypoints. They don't care how many commits we made or how many tasks we completed;
All these measurements are inconsistent. One commit is not equal to another, one line of code is not as valuable as the other line, all tasks are also different, and man-hours and storypoints are based on estimates, which are by definition inaccurate.

It's no surprise that none of these metrics work well.

Why don't we have metrics for developers that would relate directly to value? For the same reason we don't have any for scientists. Developers, just like scientists, always create something new. They don't write the same code again and again — that wouldn't make sense. Code can be re-used in a variety of ways — you can either extract it to a module or a library, or, as a last resort, just copy and paste it. Every developers' workday is unique. Even if they solve similar problems, they solve them in a different context, or a new way, each time.

Isn't there anything more modern, based on research?

Of course, no one today seriously talks about measuring developers' output in lines of code. There should be something more modern, right?

The book Accelerate from 2018 presents research on 2000 organizations of different sizes. The research goal was to identify which metrics would differentiate high performers from low. Here's what they found:

^{Source: Nicole Forsgren, Jez Humble, and Gene Kim, "Measuring
Performance," in Accelerate: The Science behind DevOps: Building and
Scaling High Performing Technology Organizations}

We can see four metrics here. Let's see how these metrics relate to value, and whether they're consistent:

Deployment frequency — I can see why it's listed here. The more frequently you ship to production, the more reliable the shipping process becomes. Productive teams tend to ship code more frequently. However, correlation doesn't imply causation. It's not directly related to customer value. People want a product that satisfies their needs, not a product that changes as frequently as possible. The metric is also inconsistent, because one change shipped to production is not equal to another.
Lead time — the time for satisfying customer requests. This one relates more closely to value, because we're talking about the changes customers have asked for. However, it's not a metric with consistent units, because requests are not equal to each other. Some might be simple, and others can be extremely challenging.
Mean Time To Recover (MTTR) — how fast we recover after a failure. Customers are unhappy when software fails. So this metric has a relationship with value, which is good. There are drawbacks though. First, it doesn't take failure frequency into account. If software fails often and recovers quickly, although the metric will look good, the customers will still be unhappy. Second, it's not consistent, because failures are not equal to each other. If the whole system failed, that's very serious. If just a minor feature doesn't work correctly, that might be unnoticed by most customers.
Change failure rate — percent of changes that lead to failures. If we're talking about updates that customers install themselves, this does relate to value. If customers have bad experience installing an update, they'll be scared to install future ones. For SaaS products, the relationship is less straightforward, because customers don't care much about the reason why your service failed. It could be due to a change, or maybe one of your providers failed, or maybe the service can't handle the load or it's under attack. No matter what the reason is, if the service isn't working, customers will be unhappy. The metric is also inconsistent, because one change is not equal to another. Some are trivial, and others are more serious.

Bottom line — all four metrics are inconsistent, and they don't always directly relate to value. Are they prone to gaming? Sure — just ship trivial changes as frequently as possible, and all metrics except Lead Time will look great.

As for Lead Time, even if we ignored the (important) fact that it's inconsistent, setting this metric as a goal would lead to prioritizing the simplest customer requests, and ignoring everything that customers didn't ask for. This category typically includes refactoring, tests, and all the improvements customers hadn't thought about.

That's why I wouldn't recommend using these metrics as development goals.

But maybe we'll find better metrics?

You can say — wait, just because good metrics haven't been found yet doesn't mean they can't be found at all! People are smart, they'll find something new that'll work better. Well, I'm afraid they won't. There's a fundamental reason why we don't have good metrics for developer performance. Good metrics would satisfy the two key criteria:

They directly relate to value;
They are consistent, i.e. based on countable quantities of some equal value.

We can't measure developers' output directly, because their results are always different. Each task and project has unique requirements, so there are no repeating results. Without repeating results, we just don't have a reliable foundation for measurements. All that we have is indirect metrics, which don't always correlate with value, and are prone to gaming. Using them as goals ends up causing more harm than profit.

Is it possible to improve what we can't measure?

Metrics are convenient because they provide a feedback loop — you learn whether your changes improve something. If you don't have metrics, the feedback loop is not so straightforward. Sometimes you may even feel that you're going blind. There's a famous saying attributed to Peter Drucker:

If you can't measure it, you can't manage it.

This isn't true though. According to the Drucker Institute, Peter Drucker didn't have the illusion that it's possible to find a metric for everything. In fact he never actually said that. Not everything that matters can be measured and not everything that's measured matters.

Not having good metrics doesn't mean we can't improve development speed. Some companies definitely build software faster than others, without dropping quality, so improvements must be possible.

Bottom line

You can and should improve your software product with metrics. Performance metrics, like latency or CPU load, reliability metrics, like uptime, and user behavioral metrics, like conversion or retention, are your friends.

However, you shouldn't rely on metrics when trying to boost development speed, because there are no good metrics for that. We can measure lots of things, but unfortunately, everything that we can measure doesn't directly relate to value, or doesn't have consistent enough values, or both. If you set goals based on such metrics, nothing good happens.

But don't worry — there's hope! If we don't have good metrics for development speed, that doesn't mean we can't go faster. We definitely can. One of the most important things that can help us develop faster, is improving communication between developers and managers. At the link above we'll talk about why this is important and provide some concrete examples of what can be improved and how.

About the author

Denis Stebunov is the founder of Teamplify and ivelum, and he has been working with software for over 20 years as a developer, engineering manager, and CTO.