Fill out this form to speak to a product expert.
There's a better way to measure AI productivity than counting lines of code. Focus on outcome metrics that prove business value: cycle times, quality, and delivery velocity. Learn why lines of code fails as an AI productivity metric, what outcome-based alternatives actually work, and when tracking AI code volume matters for governance and risk management.

Every few weeks, another headline lands: Google reports over 30% of new code is AI-generated, up from 25% just six months ago. Microsoft claims 20–30%. Meta's CEO predicts half of their development will be AI-driven within a year. And suddenly, every executive wants to know the same thing: "What percentage of our code is AI-generated?"
It's the wrong question.
Lines of code generated by AI is not just a vanity metric. It's a misleading vanity metric that creates a false sense of progress while obscuring what actually matters. The irony is hard to miss: lines of code was already widely dismissed as a flawed measure of developer productivity long before AI entered the picture. Why would it suddenly become the right metric for AI productivity?
There is one scenario where tracking AI-generated code volume makes sense: as a governance metric for repository risk and maintainability. But that's fundamentally different from using it as an outcome metric to prove ROI. The most valuable metrics for quantifying AI impact are outcome-based measures that directly tie to business value: cycle times, quality improvements, and delivery velocity.
Engineering leaders facing pressure to demonstrate AI ROI have options beyond what the headlines suggest. The path forward isn't counting lines of code because that's what Google reports. It's measuring outcomes that actually prove business value. Here's why the lines-of-code approach is failing organizations and what works better.
{{cta}}
The pressure is real. When Alphabet's earnings call reveals that AI code generation jumped from 25% to over 30% in six months, boards and CFOs start asking questions. When Microsoft's CEO discusses AI-generated code percentages at industry conferences, engineering leaders feel compelled to produce similar numbers.
But here's the fundamental flaw: an engineer might accept an AI suggestion, then delete it, refactor it, or rewrite it entirely before the code ever reaches a merge. The number that shows up in your dashboard has almost no relationship to the code that ships to production.
A social media platform we spoke with faced exactly this pressure. Leadership mandated measurement of "AI-generated lines of code" despite internal team skepticism about the metric's reliability. Similarly, a global professional services firm invested $150,000 annually in GitHub Copilot was wondering if AI lines of code was the best way to demonstrate ROI to executives. Both organizations are sophisticated engineering teams struggling with the same problem: proving AI is worth the investment using metrics that can't actually prove it.
Before AI coding assistants existed, the software industry had largely abandoned lines of code as a meaningful productivity measure. The problems were well documented: it incentivizes verbosity over elegance, penalizes developers who delete unnecessary code, varies wildly across programming languages, and tells you nothing about whether the code actually works or delivers value.
As Bill Gates reportedly said, "Measuring programming progress by lines of code is like measuring aircraft building progress by weight." The best developers often ship features by removing code, not adding it. A clever refactoring that eliminates 500 lines while improving performance is more valuable than adding 1,000 lines of redundant logic.
Yet somehow, when AI entered the picture, lines of code became the headline metric again. The same measurement that failed to capture human developer productivity is now being used to justify AI investments. It doesn't make sense.
The vendor ecosystem alone creates chaos. GitHub Copilot, Claude Code, Cursor, Windsurf, Augment, and other AI tools provide different data formats for this information with no standardization. Furthermore, developers increasingly use multiple AI tools simultaneously. One tool generates code, another refactors it, and a third helps debug it. Attributing specific lines to specific tools becomes an exercise in inference, not measurement.
Even within a single tool, the data is unreliable given the engineers’ tendency to "accept everything then modify" rather than selectively accepting suggestions. This creates false positives that inflate acceptance rates while telling you nothing about production impact. Comparing accepted lines to merged lines provides inference, not deterministic truth. You're making educated guesses based on indirect indicators rather than direct measurement.
Survey alternatives fail too. Asking engineers "what percentage of that PR was written by AI?" produces unreliable, non-deterministic results. Different work patterns compound the problem. Infrastructure engineers and application engineers use AI completely differently, making organization-wide comparisons meaningless.
The research on AI-generated code paints a concerning picture that lines-of-code metrics conveniently ignore.
GitClear's analysis of 211 million changed lines of code across 2020-2024 found multiple signatures of declining code quality. They tracked an 8-fold increase in code blocks with five or more duplicated lines, showing duplication ten times higher than two years prior. Code churn, the percentage of code reverted or updated within two weeks, is projected to double.
The Harness State of Software Delivery 2025 report found that developers now spend more time debugging AI-generated code and more time resolving security vulnerabilities than before AI adoption.
Faros AI's own research shows that AI adoption is consistently associated with a 154% increase in average PR size. More code per pull request means more to review, more to test, and more potential for defects to slip through. This isn't a productivity gain. It's a quality burden.
None of these quality signals show up when you're counting lines of code. You could report impressive AI code generation numbers while your delivery stability craters and your technical debt compounds.
{{cta}}
If lines of code is a misleading vanity metric, what actually tells you whether AI is delivering value? The answer is outcome-based metrics organized into three tiers based on their impact on business decisions.
These metrics answer the fundamental question executives care about: "Are we delivering more value, faster?"
PR cycle time measures the duration from pull request creation to merge. It directly reflects delivery velocity and code review efficiency. When a global professional services firm asked whether they could reduce consulting service pricing because of Copilot, PR cycle time was the metric that could actually answer that question.
Lead time tracks the journey from first commit to production deployment. This end-to-end measure of software delivery performance directly correlates with feature velocity. DORA research consistently shows that lead time predicts organizational performance.
Task cycle time measures how long it takes to close Jira or ADO tickets or complete work items. This measures productivity at the unit of work level and is easier for stakeholders to understand than code-level metrics.
Quality metrics include bugs escaping to production, incident rates, change failure rates, and rework rates. These answer the critical question: "Is AI-generated code actually good?" Without quality metrics, you have no idea whether your AI-assisted velocity is creating technical debt that will slow you down later.
Developer satisfaction and experience provides essential qualitative input. Regular pulse surveys on AI tool satisfaction help identify friction points and inform tool selection decisions. Developer happiness predicts long-term adoption success. If engineers don't like a tool, they won't use it, regardless of what the acceptance rates say.
AI tool usage frequency measures how often the new AI coding assistants are used. This is a leading indicator of impact and identifies adoption hurdles. It's more reliable than acceptance rates because it shows actual engagement patterns.
.webp)
Percentage of the organization using AI tracks adoption trends over time. This metric becomes meaningful when you overlay it with outcome metrics to identify patterns. For organizations tracking hundreds or thousands of engineers across many projects, understanding adoption breadth is essential context for interpreting outcome changes. But be careful: seeing adoption rise alongside improved outcomes doesn't automatically prove AI caused the improvement. That requires more rigorous analysis, which we'll address later.
When adoption metrics stagnate or decline, the root cause often lies beyond the tool itself. Common blockers include inadequate training programs, lack of manager buy-in, unclear guidelines on when and how to use AI tools, and insufficient communication about the "why" behind AI adoption.
Lines accepted from AI divided by lines in PRs can be useful for debugging specific usage patterns, but it's not suitable as a primary KPI. High variability based on engineering discipline makes comparisons problematic.
AI-generated versus handwritten lines is useful for identifying repos highly augmented by AI to manage risk and maintainability. This is the one context where tracking AI code volume matters, and we'll cover it in the next section.
Agent-generated pull requests applies only for autonomous agent tools like Claude Code. It's more deterministic than line-level tracking but still doesn't answer the quality question.
{{cta}}
There's one scenario where knowing how much of your code is AI-generated becomes genuinely important: repository risk management.
If a significant portion of a codebase was generated by AI, that's information worth knowing for maintainability and quality planning. Repos highly augmented by AI may need extra review attention, more robust testing, and closer monitoring for the quality issues that research shows AI code tends to introduce.
.webp)
This is fundamentally different from using lines of code as a productivity metric. You're not asking "how productive are we?" You're asking "where might we have elevated risk?" The goal isn't to maximize AI code generation. It's to understand where AI-generated code exists so you can manage it appropriately.
A data protection company took this approach when evaluating AI coding assistants. Rather than tracking lines of code as a KPI, they measured adoption and usage patterns while correlating them with downstream impacts. They compared test groups using different tools and tracked actual productivity outcomes. The result was data-validated confidence in their chosen AI coding assistant, with 2x higher adoption, 3 additional hours saved per week per developer, and 40% higher ROI, all without misleading lines-of-code metrics.
The right question isn't "How many lines of AI code did we generate?" It's "Are engineers delivering value faster and with higher quality when they use AI?"
Organizations fall into three buckets when adopting AI:
The "me too" bucket adopts AI to follow industry trends without clear objectives. Measurement here is nearly impossible because there's no definition of success.
The "top-down mandate" bucket sees executives mandate AI adoption without attaching it to an underlying goal or defining success criteria. These organizations struggle to prove ROI because they never specified what ROI would look like.
The "problem-first" bucket identifies a clear goal or challenge and evaluates AI as one lever for the solution. This is where measurement succeeds because success criteria exist before implementation begins.
You cannot measure improvement without knowing where you started. Ingest historical data to show pre-AI performance across your key metrics. Note that the AI tools themselves often limit data to 30-100 days of usage history. An engineering productivity platform can remove this barrier to create a longer view.
What does your CTO or CEO actually care about? It's rarely "lines of code generated." Common answers sound like "Engineers should do more in less time, and it should be better." That translates to higher throughput of PRs, faster PR completion, and improved quality metrics.
Set specific targets. "If 100% of developers use AI, we expect PR cycle time to drop by 50%" gives you something concrete to measure against.
Overlay AI adoption trends with outcome metrics. What percentage of engineers are using tools? How is that correlating with changes in your key engineering productivity metrics?

Those metrics might be DORA metrics like lead time, deployment frequency, change failure rate, and failed deployment recovery time. They might follow the SPACE framework covering satisfaction, performance, activity, communication, and efficiency. They might be something bespoke to your organization based on what your leadership actually cares about.
The point is measuring outcomes that matter to your business, not inputs that are easy to count.
Here's the complication: most organizations have "tons of other things" happening beyond AI adoption. Quality initiatives, process changes, team reorganizations, new tooling, all of these confound your ability to attribute outcome changes to AI specifically.
When metrics move week-over-week, inference-based measurements make it "really challenging" to explain why. Engineering teams easily dismiss proxy metrics with "that won't work for us."
Good causal analysis requires access to comprehensive engineering data, the ability to control for confounding variables, and statistical rigor beyond dashboard visualizations. Charts showing "adoption went up and code smells went down" are compelling, but without proper statistical controls, you can't confidently claim AI caused the improvement.

This is where Faros AI differentiates. While most platforms stop at correlation dashboards, Faros AI provides causal analysis of the impact of AI on key quality metrics. That means isolating AI's effect from the noise of other initiatives happening simultaneously, giving you defensible ROI claims rather than speculative correlations.
{{cta}}
Perhaps the most important insight from recent research is what Faros AI calls the AI productivity paradox: individual developers report significant productivity gains, but organizations see no measurable improvement in delivery outcomes.
The data is striking. Developers on teams with high AI adoption complete 21% more tasks and merge 98% more pull requests. But PR review time increases 91%, revealing a critical bottleneck: human approval.
This pattern reflects Amdahl's Law: a system moves only as fast as its slowest link. AI accelerates code generation, but if your code review process, testing infrastructure, and release pipelines can't match the new velocity, the gains evaporate.
The METR research nonprofit found that experienced developers took 19% longer to complete tasks when using AI coding assistants, despite believing they were 20% faster. The 39-percentage-point gap between perceived and actual productivity represents what researchers call the "perception tax."
Asana's research identified the same phenomenon among knowledge workers broadly. Super productive employees report saving 20+ hours per week with AI, but 90% say AI creates more coordination work between team members. Individual gains are consumed by coordination costs, quality taxes, and rework loops before they reach the bottom line.
Without lifecycle-wide modernization, AI's benefits are quickly neutralized. You can't just measure code generation speed and declare victory.
{{ai-paradox}}
What leadership actually cares about became clear when the sales team of the professional services firm asked: "Can we now say that it's 25% cheaper to deliver our software development services because we use GitHub Copilot?"
That's the right question. The answer requires outcome metrics, not vanity metrics.
What to avoid: Leading with "X% of code generated by AI" claims, vendor-provided acceptance rates as KPIs, survey-based self-reporting, and any metric that can't be explained when it fluctuates.
Accept that measurement will be imperfect. But focus on metrics that meet four criteria:
The industry hasn't solved deterministic AI attribution, automated causal inference at scale, or cross-tool normalization. These remain hard problems. But that doesn't mean you're stuck with misleading vanity metrics.
Outcome-based measurement works. It requires more thought than counting lines of code, but it tells you something that actually matters: whether AI is helping your organization deliver better software faster. And that's the only question worth answering.
Ready to measure what matters? Explore the AI Productivity Paradox research to understand why individual gains don't translate to organizational impact, then see how Faros AI's AI transformation measurement helps engineering leaders prove real ROI.




