What your CFO actually wants to know
"We had surveys. We had dashboards. But when my CIO asks for an economic case for our AI tools, none of that helps. You need a fundamentally different class of data to answer that question."
That's a VP of Engineering at a top-tier industrial manufacturing company, speaking in April 2026. His team had used one of the leading developer experience platforms. It had the best surveys. It had some tool telemetry. And when AI coding tools started consuming a meaningful line item in the engineering budget, none of it was sufficient to answer the questions that actually mattered.
He's not alone. Across enterprise engineering organizations, the same pattern is playing out: developer sentiment surveys built for a pre-AI world are being asked to do a job they were never designed to do. And the gap between what they can tell you and what you actually need to know is getting wider every quarter.
Why engineering teams built their measurement programs around developer surveys
Developer sentiment surveys made a lot of sense for a long time. Tools like DX (now part of Atlassian) and DORA-aligned pulse checks gave engineering leaders something genuinely valuable: a scalable way to understand how developers experienced their work. Where were the friction points? Were teams burning out? Was the toolchain getting in the way? These are real questions, and surveys answered them well.
There was also a practical reason surveys became the default. Connecting engineering data across a heterogeneous toolchain, Jira here, GitHub there, ADO somewhere else, CI/CD pipelines, incident management systems, is genuinely hard. Surveys sidestepped that complexity entirely. You didn't need to instrument anything or build a unified data model. You just asked your developers how they felt. For a long time, that felt like enough.
The developer experience discipline that emerged from this era was legitimate. Capturing developer sentiment helped organizations identify systemic problems: manual and slow pipelines; process overhead; too many meetings, interviews, and interruptions. Survey instruments like those in the DX platform gave engineering leaders a credible, structured way to bring those signals to leadership.
When the primary question was "how do we remove friction from our existing engineering process," surveys were the right instrument. They told you where developers were struggling. They gave you a feedback loop on changes you'd made. They were a meaningful part of how engineering leaders justified investments in tooling and process improvement.
That era is not entirely over. Sentiment still matters. But it's no longer sufficient on its own, and in some cases it's actively pointing leaders in the wrong direction.
What changed: AI made the questions harder
AI coding tools changed what engineering leaders are accountable for explaining. It's not just "are my developers happy and productive?" It's "are we getting economic value from this AI investment, and how do I prove it?"
A VP of Engineering at a large enterprise put it plainly: "Every AI tool conversation with my CIO comes down to the same question: what's the economic case? Sentiment data doesn't answer that."
This is the core problem. Developer sentiment surveys were built to measure developer experience, not to produce the economic analysis that CFOs and CIOs now expect. When GitHub Copilot, Cursor, Windsurf, or Claude Code costs real money at scale, the question changes from "do developers like this tool?" to "what is this tool actually delivering, and is it worth what we're paying?"
The "appearance of productivity" problem makes this worse. Developers overwhelmingly report that AI tools make them feel more productive. The 2025 Stack Overflow Developer Survey found that roughly 70% of developers using AI agents agreed they had increased productivity. DORA's 2025 State of AI-Assisted Software Development report, based on nearly 5,000 survey responses, found that over 80% of respondents said AI had enhanced their productivity.
These are not small numbers. And they are almost certainly telling you something true at the individual level: developers working with AI assistance do complete discrete tasks faster. The problem is that organizational outcomes don't follow automatically from individual sentiment, and survey instruments aren't designed to catch the gap between the two.
{{cta}}
When the survey says one thing and the data says something else
This is where developer sentiment surveys stop being a useful signal and start becoming a liability.
The AI Engineering Report 2026 analyzed two years of telemetry data from 22,000 developers across more than 4,000 teams. It did not rely on self-reported estimates. It measured what actually happened in the systems where software gets built, reviewed, and shipped. The findings are not what the surveys would have predicted.
Task throughput per developer is up 34% under high AI adoption. Epics completed per developer are up 66%. Those gains are real. But at every downstream stage, the quality signal tells a different story. Bugs per developer are up 54%. Incidents per pull request have increased 242%. Code churn, the rate at which recently written code is deleted and replaced, has risen 861% under high AI adoption. Median PR review time is up 441%, and 31% more pull requests are merging with no review at all.

Developers are reporting productivity gains. The telemetry is showing compounding quality costs. Both things are happening simultaneously. A survey cannot see that. It can only record what developers believe about their own experience.
The same gap showed up in independent research. A METR study found that experienced developers using AI tools on real tasks from their own codebases took 19% longer than those working without AI. Before the study, those same developers had predicted AI would make them 24% faster. After experiencing the slowdown, they still believed AI had sped them up. The gap between perception and measurement was nearly 40 percentage points.
This is not a bug in how surveys are designed. It is a feature. Surveys are designed to capture perception. In a world where perception tracks reality reasonably well, that's useful. In an AI-accelerated engineering environment where individual experience and organizational outcomes are diverging, it's a problem.
One engineering leader described the experience precisely: "We'd see the survey say one thing and the telemetry say something completely different. That gap is where the real questions live."
"We'd see the survey say one thing and the telemetry say something completely different. That gap is where the real questions live."
Should we start surveying coding agents instead?
If developer self-reporting is an unreliable input to productivity measurement, it's worth paying attention to where the survey-first approach goes next. DX recently announced Agent Experience, a framework that asks AI coding agents like Claude Code, Copilot, Gemini, and Windsurf to self-report on their own sessions, scoring the clarity of documentation, the quality of context, and the difficulty of the task they encountered.
The logic is appealing on the surface: agents have full visibility into their context, so why not ask them? But it imports the same fundamental problem that makes developer sentiment surveys insufficient, and compounds it. Human developers who reported productivity gains in METR's controlled study were measurably wrong by nearly 40 percentage points. Agents face a steeper credibility problem.
Research on frontier model self-reporting finds that LLM self-assessments fail basic reliability tests:
- Change the framing of the question slightly and the agent reports a different experience of the same task.
- Models consistently rate their own outputs as more accurate than human experts do, with the gap largest where accuracy matters most.
- An agent's narrative account of a session may not accurately reflect the decisions it actually made.
There is also a deeper issue.
DX's framework borrows the vocabulary of human experience — friction, clarity, difficulty — and applies it to an entity with no persistent state, no memory of previous sessions, and no stakes in the outcome. When an agent reports that documentation was unclear, it is producing a plausible output given its prompt. It is not filing a complaint. Treating that output as an experience signal, and aggregating it into a score that shapes how your engineering organization operates, is a category error dressed up as a measurement framework.
The answer to the survey problem is not to survey agents. It's telemetry: what the agent actually produced, how it compared to human-set standards, and whether the code it wrote held up downstream. That's measurable. That's what the Acceleration Whiplash report is built on.
What developer surveys can and can't tell you: a practical distinction
Surveys remain a valid instrument for specific questions. They're not going away, and they shouldn't. The issue is one of scope, not validity.
The VP who opened this article wasn't dismissing surveys. He was using them the way they're meant to be used: as one dimension. "We still want to capture sentiment and feedback," he said. "But you need to look at them together, because the survey sometimes says one thing and the data says something else. That gap is where the real conversations are."
That's the right framing. Surveys capture one dimension. The mistake is treating them as the primary instrument when you're trying to answer questions about measuring engineering productivity, AI tool ROI, or how to structure your engineering organization going forward.
{{cta}}
What you actually need to justify AI budgets today
The questions engineering leaders face in 2026 require a different class of data. Here's what that actually looks like in practice.
- Tool-level attribution. If your organization runs Cursor, Copilot, and Claude Code simultaneously across different teams, you need to know which tool is producing what outcomes. Not which tool developers prefer. Which one is actually reducing cycle time, reducing incident rates, and improving delivery performance? That requires telemetry connected to delivery metrics, not a satisfaction survey.
- Economic analysis, not sentiment scores. When you walk into a budget conversation, your CFO doesn't want to know that developers are happier with AI tools. They want to know what the output delta is, what it costs to produce it, and how to get the best gains at the lowest cost. That last question matters more every quarter. A mid-tier model from one vendor can outperform a premium model on the metrics that actually move your delivery performance. You can't find that out from a satisfaction survey. You find it out by connecting AI adoption data to lead time, throughput, change failure rate, and cost per delivery unit across tools and teams simultaneously.
- Adoption visibility across tools and teams. In large organizations, shadow AI adoption is real. Developers sign up for new tools independently, vibe-code applications that carry security risks, and experiment outside any governed process. You can't survey your way to that visibility. You need telemetry that spans the full tool landscape, regardless of what people choose to self-report.
- Cross-system measurement. Most enterprise engineering organizations don't run a uniform toolchain. Teams use Jira, ADO, GitHub, and various combinations. A measurement program that only captures sentiment can paper over that heterogeneity. A telemetry-based approach has to actually connect those systems and normalize the data, which is what gives you a real view of developer productivity across teams that work differently.
Does this mean engineering organizations should stop using surveys?
No. The answer is not to abandon developer sentiment surveys. It's to understand what they're for.
Surveys are a valid instrument for understanding developer experience, identifying friction, and capturing the qualitative signals that telemetry can't surface. If your teams are struggling with unclear requirements, or if a tooling change is creating frustration that hasn't yet shown up in delivery metrics, a well-designed survey can catch that early.
The problem is treating surveys as a proxy for the questions that only telemetry can answer. When you ask "is our AI investment working?," a survey will tell you how developers feel about their AI tools. It will not tell you whether those tools are generating a return on investment. When you ask "how do we justify the role of engineering in an AI-first organization?," a survey will tell you how engineers perceive their own value. It will not give you the economic case that survives a CFO conversation.
Sensors and surveys serve different purposes. The organizations getting this right are using both, in the right proportions, for the right questions. They're not substituting one for the other.
The same caution extends to composite indexes built on top of self-reported data. DX's Developer Experience Index (DXI), for example, aggregates survey responses into a single score meant to represent the health of your engineering organization. The appeal is obvious: one number, easy to track, easy to present. But an index is only as reliable as its inputs. If the underlying data is heavily reliant on self-reported perception, the index inherits all of the same limitations. A rising DXI score tells you developers feel better about their work. It does not tell you whether your AI investment is producing a return, whether code quality is holding up, or whether your delivery performance is improving. In an AI-first engineering environment, those are the questions that determine your budget, your headcount, and your organizational structure. An index that can't answer them isn't a measurement program. It's a mood tracker with a dashboard.
What "different dimensions" means: building a measurement program that holds up
For engineering leaders who want to move beyond surveys as their primary instrument, here is what a complete measurement program looks like.
- Start with telemetry that spans the full delivery workflow. That means connecting your source code management system, your issue tracker, your CI/CD pipeline, your incident management systems, and your AI coding tools into a unified data model. The Faros platform does this across heterogeneous toolchains, normalizing data from Jira, ADO, GitHub, and every major AI coding tool into a single schema.
- Add AI adoption tracking that goes beyond "how many licenses are active." You want to know who is using which tools, how intensively, and what the downstream effect is on their team's delivery performance. That means tracking acceptance rates, code churn by tool, and connecting usage data to cycle time, bug and incident rates, and PR review patterns.
- Continuously evaluate AI coding tools and models against your actual codebase. Vendor benchmarks and analyst reports tell you how models perform in general. They don't tell you how they perform on your repositories, your task types, your engineering context. The gap between the two can be significant. In a head-to-head evaluation on real tasks from internal codebases, a mid-tier model from one vendor outperformed a code-specialized model from another by more than 3x on successful task completion, at a comparable cost per outcome. That finding only surfaces when you test against your own code, not when you ask developers which tool they prefer. Model performance changes as vendors ship updates. The evaluation has to be continuous, not a one-time procurement decision.
- Layer in the economic analysis. When you can connect AI tool usage to delivery throughput metrics, and you know the per-seat cost of each tool, you can build a defensible ROI case. Not "our developers said they were 30% more productive." But "teams at high AI adoption completed 34% more tasks per developer, their rework rates are healthy, and here is what that means in terms of engineering capacity and cost per unit of output." And when AI tool pricing increases, as it has been doing, you can simulate the net ROI impact before the renewal conversation happens, not after.
Keep surveys in the mix, but scope them correctly. Use them to understand developer experience signals that don't show up in telemetry. Use them to cross-check: when sentiment and data diverge, that's a signal worth investigating, not a number to average away.
The leaders who will make the right calls on AI investment, team structure, and engineering governance over the next two years are the ones who have both dimensions, who know which instrument answers which question, and who can walk into a budget conversation with data that survives scrutiny.
Developer sentiment surveys gave engineering organizations a foundation. For a long time, they also gave engineering leaders a way to avoid the harder data problem. Integrating your toolchain, normalizing data across systems, connecting AI adoption to delivery outcomes, it's real work. Surveys were easier. And in an era where the primary question was developer experience, easier was defensible.
That's no longer the case. The questions engineering leaders face today, what is our AI investment actually producing, which tools are worth the cost, how do we justify the role of engineering in an organization where AI writes the code, cannot be answered with a survey. They require reliable, accurate, and granular telemetry. They require a unified data model. They require a way to query that data as new questions emerge, because the questions will keep emerging. Two years ago, no one was tracking the relationship between AI adoption, code churn, and incidents. Now those are among the most important signals in engineering.
The good news is that work is now more achievable than it's ever been. The organizations that do it will be able to walk into any budget conversation with data that survives scrutiny. The ones that don't will keep presenting sentiment scores to CFOs who are asking economic questions, and wondering why the answers don't land.
Faros's AI Engineering Report 2026 - The Acceleration Whiplash covers telemetry data from 22,000 developers across more than 4,000 teams, tracking two years of before-and-after AI adoption data across the full software delivery lifecycle.
{{whiplash}}







