The end state of traditional engineering metrics and the messy transition getting us there
I’ve spent the past few years watching enterprise eng leaders try to figure out how to measure their teams’ productivity (with increasing urgency). A few patterns (and mistakes) keep showing up.
First of all, let’s acknowledge the obvious: We’re in a strange period for software engineering measurement. The metrics we leaned on for the last decade, including PR throughput, time to first review, deployment frequency, still show up on every dashboard. But the assumptions underneath them are fracturing. AI is changing how code is written, reviewed, tested, and shipped, and the metrics that survive this transition will look very different from what most engineering orgs measure today.
This is my read on where AI engineering metrics are heading, what’s changing in the meantime, and what engineering leaders should pay attention to right now.
TL;DR:
- The end state for measuring engineering productivity will be a single north star metric: time from insight to product in the customer’s hands. Everything else is a stepping stone.
- AI adoption metrics are transitional. Once everyone uses AI, they’ll stop being a meaningful signal.
- Bottlenecks won't sit still. As AI absorbs/augments each phase of the SDLC, the constraint relocates fast—sometimes within a single quarter. Our research is already showing this.
- Optimizing the inner loop with AI can shift issues to the outer loop. That’s a critical failure mode to watch out for.
- Team-specific metrics in isolation are done. Handoffs between teams are where the next wave of pain appears.
What will engineering metrics look like in a fully AI-automated SDLC?
If you fast-forward past the current AI transition, the only metric that will matter is the time it takes from identifying a customer pain point to having a high-quality, productized solution in that customer’s hands. Everything in between—whether it’s code reviews, CI, QA, design, PM scoping, or deployments—is assumed to be automated underneath. Each person in the SDLC will effectively manage a team of AI agents that handles their slice of the pipeline. This means the AI transition isn’t just affecting developers; DevOps, PMs, and design will all get pulled into the same end-to-end measurement.
That’s where software engineering is headed. The question is how we get there without breaking things along the way.
Proxy metrics vs. outcome metrics in AI engineering
Right now, the industry leans heavily on proxy metrics, such as AI adoption rates, AI usage per developer, and acceptance rates. The implicit logic: more AI use means greater productivity. That logic is fine as a transitional signal, but it has an obvious expiration date. Once everyone is using AI (or, more likely, once AI is authoring all new code), tracking AI adoption will be like tracking IDE usage; it will yield zero actionable insight.
PR count per developer is in a similar spot. As a standalone number, it never meant much. And as teams begin to shrink because AI absorbs work that used to need multiple engineers, it’ll mean even less. The question worth asking: is output per team up 2x, 5x, 10x relative to headcount? If not, the AI investment isn't landing where it needs to.
The one exception is the traditional outcome metrics. New features shipped, customer bugs fixed, and products launched are metrics which I expect will stay highly relevant, because they measure what the business actually cares about. The proxy layer above them is what’s getting rapidly compressed.
Engineering bottlenecks will keep moving in AI-driven workflows
Next, there’s a new dynamic engineering leaders need to plan for: as AI modifies each phase of the SDLC, the bottleneck moves. If AI reliably handles code reviews and they take 2 minutes each, the bottleneck suddenly becomes CI test execution—especially in large enterprises with thousands of tests. Solve CI latency, and the bottleneck inevitably migrates elsewhere: release approval, environment provisioning, or customer rollout.
That has real implications for measurement:
- Time to first review: Stops being measured in hours or anchored to business hours. Starts being measured in minutes, 24/7, and may change to “time to final approval”.
- CI cycle time: Becomes the new gating metric the moment review latency disappears (assuming the processes run concurrently).
- Percent of decisions escalated to a human in the loop: This is worth tracking on its own, as it tells you exactly where AI is hitting its confidence ceiling and where humans are still the critical path.
Note: In regulated industries with contractual human-review requirements, some of this won’t apply, because code can’t ship to production without a human in the loop; so the metric shapes there will lag. But for most teams, the metric shapes are changing alongside the targets themselves.
The shift-right trap: When AI test selection moves bugs downstream
Quality still matters, and escape defect rate is still the metric. The tension here is well known to anyone who ever thought through software testing (I’m no expert, but I was lucky to work alongside some of the best in the business). On the one hand, running the entire testing suite, with every feature flag permutation, every config variant, in all environments on every single change, gives extremely high confidence we’re not introducing new product defects/regressions. On the other hand, that’s way too slow and way too expensive. AI promises a lot in this category, namely the ability to select the right subset of tests to run, without compromising on coverage or quality.
In principle, this is a quality win. But in practice, we’re still far away from AI being able to select the perfect test combo (just think: If it doesn’t select the right test just 0.1% of times, but there are 50k tests, we’re likely to run into big trouble). This means that while companies that use AI-based test selection may achieve faster inner-loop times, every wrong test selected, every miss, is likely not just to shift-right bug finding but to cause a real break in trust between developers and the AI-powered system. Many of you can imagine how frustrating it could be to get paged on an outage or assigned a critical bug, just because “the system” didn’t do its job correctly.
The result: more on-call pages, more hotfixes, more customer-visible incidents. The inner loop got faster, but the system got worse.
The right way to measure success here is, therefore, paired:
- Inner loop speed: Faster, with smarter test selection.
- Outer loop defects: Flat or trending down.
If only the first number is moving in the right direction, you’re just shifting risk, not removing it.
From team-level metrics to joint metrics across the SDLC
The old model for engineering managers was simple: you should mostly care about driving and hitting your team’s numbers, and have your team’s metrics look good compared to your peers’. Do well, get rewarded (or at least don’t show up on your head of engineering’s radar.) If your shortcuts made the next team's life worse, well, that was somebody else’s problem (the world’s best cloaking device according to the Hitchhiker's Guide to the Galaxy, btw). That worked when handoffs were slow enough to absorb the friction. But the model breaks in an AI-optimized pipeline because every team’s output becomes another team’s input at a higher velocity. If one team’s optimization tanks the downstream team’s metrics, the system as a whole moves more slowly.
Moving forward, engineering leaders will increasingly be evaluated on a dual mandate: hitting their localized targets while ensuring their output doesn’t degrade performance for adjacent teams. This is uncomfortable. It means a team can hit every one of its targets and still be the reason the org misses its targets. But this shared responsibility is the only viable way to measure and optimize a system in which bottlenecks constantly relocate.
What engineering leaders should track during this transition
With all that said, here are a few concrete moves for engineering leaders navigating this transition:
- Audit which metrics are proxies vs. outcomes. AI adoption rate, PR count, and acceptance rate are proxies. Time to customer value, escape defect rate, and feature throughput are outcomes. Make sure proxies aren’t being treated as end goals.
- Track inner loop and outer loop together. Any inner-loop optimization should report a paired outer-loop number. If you can’t see both, you can’t tell whether you’re improving or just relocating the problem.
- Add “percent escalated to human” to your AI workflow metrics. It’s the cleanest read on where AI is and isn’t ready to be in control in your pipeline.
- Start measuring cross-team metric impact. When Team A changes a process, what happens to Team B’s lead time? If you can’t answer that, you don’t know what your system is actually doing.
- Start instrumenting end-to-end time from insight to customer value now. Yes, even if the data is messy today. You’ll need the history when this becomes the only number that matters.
Final thoughts
The transition will be noisy. For the next year or two, your dashboards will get more crowded, not less, because you’re instrumenting a moving target. Treat your metrics stack as something you revisit every quarter—not something you set and forget. And talk your teams through this being the new norm. The teams that get this right won’t have the prettiest dashboards. They’ll be the ones who keep up with the speed of change.

.avif)





