What is the main finding of Faros AI's causal analysis on GitHub Copilot and code quality?
Faros AI's causal analysis found that, overall, GitHub Copilot does not negatively impact key code quality metrics such as PR approval time, PR size, code test coverage, or code smells. Engineers can use Copilot to accelerate development speed without sacrificing code quality. Source: Faros AI Blog, March 13, 2025
Why do enterprises hesitate to adopt AI coding assistants like Copilot?
Enterprises are cautious due to uncertainty about the ROI and downstream impacts on code quality. They seek evidence that tools like Copilot improve productivity and do not degrade code quality before rolling out licenses at scale. Source
What metrics did Faros AI use to evaluate Copilot's impact?
Faros AI evaluated Copilot's impact using metrics such as PR approval time, PR size (diff size), code test coverage, and code smells. These metrics are highly indicative of engineering workflow quality. Source
How does Faros AI's causal analysis differ from simple correlation or A/B testing?
Faros AI uses double (debiased) machine learning to control for confounding variables such as engineer seniority, team composition, and repository specifics. This approach isolates the true causal effect of Copilot usage, unlike simple correlation or A/B tests that can be misleading in complex engineering environments. Source
Why is causal analysis important for evaluating Copilot adoption?
Causal analysis helps organizations address uncertainty about downstream impacts, ensuring that Copilot adoption does not lead to negative outcomes in code quality metrics. It provides confidence for safe, large-scale rollout. Source
Did Faros AI find any negative effects of Copilot usage on code quality?
Faros AI found no significant negative effects of Copilot usage on key code quality indicators at the organizational level. However, some teams experienced varying impacts, highlighting the need for team-specific analysis and ongoing monitoring. Source
What are the risks of ignoring team-level differences in Copilot adoption?
Ignoring team-level differences can lead to negative outcomes in sensitive code areas, undermining confidence in AI tools and potentially resulting in freezes or limitations on Copilot usage. Faros AI recommends team-specific causal analysis and process improvements. Source
How should engineering managers respond to Copilot adoption findings?
Engineering managers should conduct team-specific causal analysis, improve code review and quality processes, and monitor for negative impacts. Faros AI provides tools to support these actions and ongoing code quality monitoring. Source
Is Copilot safe for use in large engineering organizations?
Research using debiased machine learning techniques suggests that Copilot is overall safe for code quality in large organizations. Ongoing monitoring is recommended to ensure continued safety as technology and practices evolve. Source
What is double (debiased) machine learning and why is it used?
Double (debiased) machine learning is a technique that combines machine learning models with cross-validation to control for observable confounders. Faros AI uses it to isolate the true causal effect of Copilot usage on code quality metrics. Learn more
How does Faros AI measure Copilot usage for causal analysis?
Faros AI measures Copilot usage by tracking the number of times Copilot was accessed in the time window around when a PR was marked 'Ready for review.' This approximation captures how heavily an engineer relied on Copilot for that PR. Source
What are the outcomes of interest in Faros AI's Copilot analysis?
The outcomes of interest include PR approval time, PR size, code test coverage, and code smells. These metrics provide a comprehensive view of code quality and workflow efficiency. Source
Why is ongoing monitoring of Copilot's impact recommended?
Ongoing monitoring is important because technology and practices evolve, and a single negative outcome in a sensitive code area can undermine confidence in AI tools. Faros AI enables continuous code quality monitoring to mitigate risks. Source
What is the importance of controlling for confounders in causal analysis?
Controlling for confounders ensures that the measured effect of Copilot usage is not distorted by external factors such as engineer seniority, team composition, or project complexity. This leads to more reliable and actionable insights. Source
How does Faros AI ensure the accuracy of its causal analysis models?
Faros AI uses rigorous model selection, hyperparameter tuning, and repeated sensitivity analyses to ensure model quality. Models that do not meet established standards are excluded from causal effect calculations. Source
What should teams do if negative outcomes are observed with Copilot usage?
Teams should investigate how Copilot is being used, review code quality practices, and implement process changes as needed. Faros AI provides actionable insights to help teams refine best practices and mitigate issues. Source
How can I speak to a Copilot adoption expert at Faros AI?
You can request a meeting with a Faros AI expert by filling out the contact form on the Copilot causal analysis blog page. Contact Faros AI
Faros AI Platform Features & Capabilities
What core problems does Faros AI solve for engineering organizations?
Faros AI solves problems such as engineering productivity bottlenecks, software quality management, AI transformation measurement, talent management, DevOps maturity, initiative delivery tracking, developer experience insights, and R&D cost capitalization. Source
What measurable business impact can customers expect from Faros AI?
Customers can expect a 50% reduction in lead time, a 5% increase in efficiency, enhanced reliability and availability, and improved visibility into engineering operations. Source
What are the key capabilities and benefits of Faros AI?
Faros AI offers a unified platform, AI-driven insights, seamless integration with existing tools, proven results for customers like Autodesk and Vimeo, engineering optimization, developer experience unification, initiative tracking, and process automation. Source
How does Faros AI ensure scalability and performance?
Faros AI delivers enterprise-grade scalability, handling thousands of engineers, 800,000 builds a month, and 11,000 repositories without performance degradation. Source
What APIs does Faros AI provide?
Faros AI provides several APIs, including Events API, Ingestion API, GraphQL API, BI API, Automation API, and an API Library. Documentation
What security and compliance certifications does Faros AI hold?
Faros AI is compliant with SOC 2, ISO 27001, GDPR, and CSA STAR certifications, demonstrating robust security and compliance standards. Source
Who is the target audience for Faros AI?
Faros AI is designed for VPs and Directors of Software Engineering, Developer Productivity leaders, Platform Engineering leaders, and CTOs at large US-based enterprises with hundreds or thousands of engineers. Source
How does Faros AI help with developer experience?
Faros AI unifies developer surveys and metrics, correlates sentiment with process data, and provides actionable insights for timely improvements in developer experience. Source
What KPIs and metrics does Faros AI track?
Faros AI tracks DORA metrics (Lead Time, Deployment Frequency, MTTR, CFR), software quality, PR insights, AI adoption, talent management, initiative tracking, developer sentiment, and R&D cost capitalization metrics. Source
How does Faros AI's approach differ for different user personas?
Faros AI tailors solutions for Engineering Leaders, Technical Program Managers, Platform Engineering Leaders, Developer Productivity Leaders, and CTOs, providing persona-specific data and insights to address unique challenges. Source
What are common pain points Faros AI addresses?
Faros AI addresses pain points such as difficulty understanding bottlenecks, managing software quality, measuring AI tool impact, skill alignment, DevOps maturity, initiative tracking, incomplete survey data, and manual R&D cost capitalization. Source
How does Faros AI differentiate itself from competitors like DX, Jellyfish, LinearB, and Opsera?
Faros AI offers mature AI impact analysis, scientific causal methods, active adoption support, end-to-end tracking, flexible customization, enterprise-grade compliance, and developer experience integration. Competitors often provide only surface-level correlations, limited metrics, and lack enterprise readiness. Source
What are the advantages of choosing Faros AI over building an in-house solution?
Faros AI provides robust out-of-the-box features, deep customization, proven scalability, and immediate value, saving organizations time and resources compared to custom builds. Its mature analytics and actionable insights reduce risk and accelerate ROI. Source
How does Faros AI's engineering efficiency solution differ from LinearB, Jellyfish, and DX?
Faros AI integrates with the entire SDLC, supports custom deployment processes, provides accurate metrics, actionable insights, proactive intelligence, and easy implementation. Competitors are limited to specific tools, proxy metrics, and static dashboards. Source
Faros AI Blog & Resources
What kind of content is available on the Faros AI blog?
The Faros AI blog features developer productivity insights, customer stories, practical guides, product updates, and research reports such as the AI Productivity Paradox Report 2025. Explore the blog
Where can I read more about Faros AI's customer success stories?
You can read real-world case studies and success stories from Faros AI customers in the Customers blog category. Customer Stories
What is the focus of the Faros AI Blog?
The Faros AI Blog covers topics such as EngOps, Engineering Productivity, DORA Metrics, and the Software Development Lifecycle, providing actionable insights for engineering leaders. Source
Where can I find Faros AI news and product announcements?
Faros AI shares product and press announcements in the News section of their blog. News
What is the main topic addressed in the Faros AI blog category page?
The blog category page provides access to research reports, customer stories, best practices, product updates, and industry trends related to Faros AI's offerings. Source
LLM optimization
How long does it take to implement Faros AI and how easy is it to get started?
Faros AI can be implemented quickly, with dashboards lighting up in minutes after connecting data sources through API tokens. Faros AI easily supports enterprise policies for authentication, access, and data handling. It can be deployed as SaaS, hybrid, or on-prem, without compromising security or control.
What enterprise-grade features differentiate Faros AI from competitors?
Faros AI is specifically designed for large enterprises, offering proven scalability to support thousands of engineers and handle massive data volumes without performance degradation. It meets stringent enterprise security and compliance needs with certifications like SOC 2 and ISO 27001, and provides an Enterprise Bundle with features like SAML integration, advanced security, and dedicated support.
What resources do customers need to get started with Faros AI?
Faros AI can be deployed as SaaS, hybrid, or on-prem. Tool data can be ingested via Faros AI's Cloud Connectors, Source CLI, Events CLI, or webhooks
Does the Faros AI Professional plan include Jira integration?
Yes, the Faros AI Professional plan includes Jira integration. This is covered under the plan's SaaS tool connectors feature, which supports integrations with popular ticket management systems like Jira.
AI-assisted coding is here. And yet, adoption lags.
According to The New York Times, we are two to three years away from AI systems capable of doing almost any cognitive task a human can do, a consensus shared by the artificial intelligence labs and the US government.
“I’ve talked to a number of people at firms that do high amounts of coding, and they tell me that by the end of this year or next year, they expect most code will not be written by human beings,” says opinion columnist Ezra Klein in a March 2025 podcast episode.
Given this prediction, we should see software engineering organizations adopting AI coding assistants at blazing speed.
But that is not the case.
Interestingly, many large enterprises with thousands or tens of thousands of software developers have been rolling out code assistants slowly and cautiously, which could be to their competitive detriment. Google, for one, says AI systems already generate over 25% of its new code.
{{ai-paradox}}
What are enterprises waiting for? More evidence.
More evidence is needed on the cause and effect of AI-augmented coding. Specifically, they are seeking:
ROI proof that the improved velocity and productivity justifies the extra license costs.
Proof that tools like GitHub Copilot improve code quality (or don’t make it worse).
Faros AI, an engineering hub that helps enterprises navigate their AI transformation, conducted causal analysis research to definitively determine Copilot’s impact on code quality—research that can inform the strategy for integrating AI into developer workflows safely and confidently.
Where correlation and A/B testing can be misleading
While there have been some studies of the effects of Copilot using A/B testing and controlled experiments, the inherent variability in engineering teams, processes, and goals makes these studies very challenging.
In practice, most companies lack the sample size or discipline required to conduct an experiment that addresses the inherent complexities of engineering organizations and the present biases, making simple correlations incomplete.
Without a rigorous approach, the impact figures often cited can be inaccurate or downright misleading. However, by applying causal analysis, as we did at Faros AI, you can overcome the complexity to isolate the true downstream effects of Copilot on engineering workflows.
Indeed, any seasoned software engineering manager or developer will have strong beliefs, but causal analysis aims to back or challenge those beliefs with solid evidence and answer the question: "What is the measurable impact of Copilot on Pull Requests (PRs) and code quality?"
Causal analysis prevents three mistakes
Misunderstanding the cause and effect of Copilot can lead to three types of mistakes when adopting AI:
Moving too slow. Example: You run a small pilot but cannot prove Copilot is having a positive effect on productivity or quality. Mistake: You don’t distribute more licenses for months (or years?). Consequence: Most of your organization misses out on the benefits of AI-augmented coding, and you’re outpaced by competition.
Moving too fast. Example: You provide Copilot licenses to everyone. Mistake: You don’t address the changes Copilot is causing downstream in reviews, testing, or deployments. Consequence: Your coding velocity gains are erased, and the business sees no positive outcomes. You decide to divest from the tool.
Moving in the wrong direction. Example: You misattribute velocity improvements to Copilot when they are actually caused by something else. Mistake: You make incorrect assumptions in planning about your staffing, skills, and capacity. Consequence: You don’t allocate enough resources and miss critical business commitments.
Causal analysis helps prevent three mistakes in AI adoption
What is causal analysis?
Causal analysis, broadly speaking, aims to identify the direction and effect of one or more variables (or “treatments”) on an outcome (or “measure” or “metric of interest”).
Unlike traditional analyses that only look at correlations (e.g., “X and Y move together” or “X and Y don’t move together”), causality is concerned with cause and effect.
To illustrate:
Causality is about establishing “smoking causes lung cancer,” not the other way around.
Causality wants to ensure that no hidden external factors (confounders) contribute to the measured effect of the treatment or variable of interest.
Accounting for confounders is critical in developing reliable causal estimates of effect.
In the “smoking causes lung cancer” example, age can be a significant confounder in determining the effect of smoking on lung cancer. Older individuals have a higher baseline risk of developing various cancers, including lung cancer.
If older individuals also happen to smoke more frequently than younger individuals, a simple correlation analysis could mistakenly attribute the entire increase in lung cancer incidents to smoking alone. In reality, part of the observed “smoking effect” might be driven by age, an independent factor that increases lung cancer risk all by itself.
Isolating the causal impact of smoking must account for age
To properly isolate the causal impact of smoking, you have to account for age—meaning you statistically control for it or otherwise remove its influence, so you do not overestimate or underestimate smoking’s effect on lung cancer risk.
In standard correlation-based analytics, you might see a strong relationship between two variables, but you don’t necessarily know which (if either) drives the other. There might be a third variable lurking in the background that influences both.
Sussing out the nature and direction of the relationship between all the potential variables is both the power and the challenge in causal analysis.
Distinguishing correlation from causation
To define relationships between variables for causal analysis, you start by examining the variables and data. A necessary first step is making the data observable. If you cannot see and understand your data, you cannot use it.
Faros AI is a powerful tool for this task because it centralizes data from across the software delivery life cycle, standardizes it into a unified data model, and translates raw data into useful measurements with powerful visualizations.
Observing the data alone isn’t enough, however. Drawing conclusions based on observation alone can lead to misinterpretation caused by:
Consider observing that "ice cream sales correlate with drowning incidents." This doesn't mean that purchasing ice cream directly causes drownings. Instead, both variables are influenced by a third factor: warm weather. During warmer months, people are more likely to buy ice cream and swim, which can lead to increased drowning incidents.
Similarly, correlations in data may be linked by external factors without a direct causal relationship between them.
Pitfall 3: Simpson’s Effect
Simpson’s Effect (or Simpson’s Paradox) is frequently discussed in data science interviews. It describes a situation where a trend or effect seen in several groups of data reverses when these groups are combined.
Consider the example of baseball players’ batting averages: A player might have a higher batting average than another player in two separate seasons; however, when you combine statistics across both seasons, the second player might end up with a higher overall batting average. This paradox occurs because the number of times each player batted in each season affects averages differently.
Example of Simpson's Effect
Determining causality in a dizzying engineering ecosystem
True—humans can sometimes rule out spurious, non-causal, and nonsensical explanations using common sense. But it’s also easy to be misled, like in the baseball example.
The classic way of avoiding these issues is to conduct an A/B test, where you only change the variable you care about and then directly measure what happens.
But how do you run an A/B test on an engineering organization where there is so much going on at any given moment that could be impacting performance?
Many simultaneous factors may impact engineering performance during Copilot A/B tests
In real-world engineering organizations, any analysis is complicated by the fact that engineers are not all the same. They differ by:
Engineer-specific factors:
Seniority and experience level: Junior, mid-level, or senior, which could affect coding proficiency and tool utilization.
Tenure with the company: Experience specific to the company, which might impact familiarity with internal systems or codebases.
Technical skillset: Specific language proficiency and technical competencies, which can influence the ease of tool adoption.
Past performance metrics: Historical productivity and quality metrics.
Preference for tools: Personal affinity or bias towards using certain tools, including AI-assisted ones like Copilot.
Human factors: Individual variance in adaptability and learning new technologies. Current workloads, motivation, and stress levels.
Team-specific factors:
Team size: Smaller versus larger teams might have different dynamics and integration strategies.
Team composition: Diversity in roles and responsibilities, including the distribution of expertise.
Team communication and collaboration: Established protocols and efficiency in communication.
Cohesiveness and culture: How well team members work together and the team's openness to technology adoption.
Project-specific factors:
Project complexity: Varying levels of complexity can affect productivity metrics and tool effectiveness.
Project timeline: The urgency and deadlines can pressure tool usage and productivity outcomes.
Customer demands: Pressure from clients or stakeholders affecting team dynamics.
Organizational factors:
Resources and infrastructure: Availability of hardware, software, and network resources.
Company culture and policies: Attitudes towards innovation and tool adoption policies.
Support and training: Access to training programs and support for new tools.
Change management practices: How the organization manages transitions to new tools.
Repository and codebase factors:
Repository size and structure: Complexity and organization of the code repositories.
Historical code quality: Prior metrics on code quality within the repository.
Version control practices: How versioning and code changes are managed.
Process and workflow factors:
Development practices: Use of agile, waterfall, DevOps, or other methodologies.
Code review processes: Protocols for reviewing and approving code changes.
Testing and CI/CD pipelines: How testing and continuous integration are managed.
Industry trends: How industry-wide shifts influence tool adoption.
All of these differences can wreak havoc on attempts to compare one group of engineers (or teams) to another, especially if you ignore these differences when trying to measure a cause-and-effect relationship like “Does Copilot improve code quality?”
If you give Copilot to one team and not another and do a naïve comparison, you’re liable to see the effect of countless confounders. Perhaps Team A is more senior, or perhaps Team B is dealing with more urgent incidents. Without careful methods, you risk attributing differences to Copilot when something else is at play.
Causal analysis for Copilot’s impact on code quality
Given the complexity of engineering organizations—and the prevalence of confounders like seniority, repository specifics, or team composition—it becomes very important to use methods designed to tease out the real cause-and-effect relationship.
Specifically, we wanted to see:
Does GitHub Copilot improve code quality?
Does GitHub Copilot improve code coverage?
Does GitHub Copilot reduce code smells?
Does GitHub Copilot speed up PR reviews?
… Or does it have no effect at all?
Our data scientists are domain experts who are deeply knowledgeable about what engineering organizations measure and the factors likely to affect those measurements.
We chose to focus our causal analysis on quality metrics, not on engineering throughput or velocity.Why? In all the surveys our clients conduct and analyze with Faros AI, developers are bullish on Copilot, reporting significant time savings and high satisfaction.
{{cta}}
However, the jury is still out on Copilot's impact on the quality of engineering work, which could be a blind spot for many organizations. If quality is left unchecked, there are many risks related to long-term maintainability, readability, and security.
When the data is messy (we do not have a perfect A/B test or do not have perfectly matched teams), standard correlation measures or simple machine learning models may produce biased or incorrect estimates.
Instead, we used a particular technique within causal analysis that avoids the need to map out every link (like seniority → code coverage → time-to-approve) but still helps us isolate the effect of Copilot usage. The technique is known as double (or “debiased”) machine learning.
Note: The sections below describe double machine learning and its application in gory detail.
If you don't care about the details and just want to know what we found, skip ahead to our results.
Applying double (debiased) machine learning to Copilot impact
Double (debiased) machine learning is a relatively new method in the fields of data science and causal inference (originally published around 2016). Its core concept is to combine machine learning models with a cross-validated approach to control for observable confounders. By “observable confounders,” we mean any measurable variables that might impact both the “treatment” (here, Copilot usage) and the “outcome” (like PR approval time).
Let’s define a few terms in this specific scenario:
Confounding Variables (Confounders): An engineer’s seniority, tenure, assigned repositories, programming languages, their team’s workload and composition specifics, the incident load and meeting load on author and reviewer, etc.
Treatment: The number of times Copilot was used during the creation of a PR.
Outcome (or Metric of Interest): PR Review Time, Code Coverage, PR Diff Size, Code Smells, etc.
Despite the name, double machine learning actually uses three layers of ML models to obtain a final prediction of true causal effect (the “double” part refers to the partialing-out of treatment and outcome residuals, the difference between the predicted and actual values, typically accomplished with two main models). Here is the general structure:
Model One: Predict the outcome (i.e., PR review time, code coverage, etc.) from all the confounders.
Model Two: Predict the treatment (the Copilot usage frequency) from all the confounders.
Model Three: Take the residuals (the errors or differences between the actual observed values and model predictions) from Model One and Model Two. If any portion of the variation in Copilot usage is not explained by the confounders, and the confounders do not explain any portion of the variation in the outcome, we model how those residuals correlate.
Double machine learning in the partially linear model (Source: Fuhr, Berens, and Papies)
Said differently:
If an engineer has a high predicted PR Review Time (based on their seniority, team info, etc.) but, in reality, they have a shorter PR Review Time, then that difference is a residual from Model One.
If an engineer has a high predicted Copilot usage (based on their seniority, team info, etc.) but, in reality, they used Copilot more or less, then that difference is also a residual from Model Two.
Model Three sees how the leftover “Copilot usage difference” explains the leftover “Review Time difference,” net of all the other variables. This is repeated across many cross-validation folds to reduce overfitting and produce an unbiased, causal estimate.
This procedure allows for separating out causal effects without needing to define the functional format for every relationship. This is important because, while it is clear that seniority influences review time, the relationship is unlikely to be linear or consistent across all companies. Defining the precise mathematical correlation between seniority and review time is complex, and accounting for every potentially significant variable is even more challenging. The ability to analyze these relationships without requiring exact predefined formulas is a significant advantage.
For the technique to work well, there are still some very important assumptions that must be met:
You have a good measure of your treatment.
Your data on outcomes of interest is good.
You are measuring and inputting all the potential confounding variables.
You have not included any bad controls (measures that are the result of your treatment included in your inputs) in your confounding variables.
Below, we break down how our analysis met these conditions.
Measuring treatment: How we measured Copilot usage
Challenge: The GitHub Copilot API does not provide fine-grained data on exactly when, within a PR’s code changes, Copilot was used.
Instead, we developed an approximate measure: the number of times Copilot was accessed in the time window around the moment the PR was marked “Ready for review” (7 days before and 3 days after). We chose this window by examining the median lead time for tasks across customers and selecting a window that covered most tasks’ coding time.
Challenge: The Github Copilot API provides information on when an engineer accessed Copilot, however it does not provide fine-grained information about code generated using Copilot.
For the purposes of this study, we assumed that the times Copilot was accessed correlated with how often the engineer was using it and made the decision to treat “the number of Copilot accesses” as our “treatment variable.” While this is an approximation, it is sufficient to capture how heavily an engineer was relying on Copilot for that PR.
In the future, we’ll enhance our calculations with detailed measurements extracted from the IDE itself, provided by Faros AI IDE extensions like this one, including how much code was created directly using Copilot as well as the pull requests, languages, and repos where the code was used.
Outcomes of interest: PR review time, code coverage, diff size, and code smells
As explained above, the causal analysis focused on Copilot’s impact on code quality because uncertainty about the downstream impacts is one reason organizations are rolling out the tool slowly.
In particular, we examined the effects of Copilot usage on:
PR Approval Time
PR Size (diff size)
PR Code Coverage
PR Code Smells
We chose these specific metrics mainly because PR data is typically very complete, high-quality, and highly indicative of how engineering organizations operate. If there’s no negative outcome in code quality metrics, that is highly reassuring for widespread usage.
Feature engineering and capturing confounders
To maintain a robust and ongoing double machine learning causal analysis, it is essential to continuously capture all relevant inputs that might affect the outcome or treatment. Excluding critical confounders (or including variables that are effects of your treatment) can lead to overestimating or underestimating the real effect of Copilot.
Our automated machine learning workflow leverages a library called Feature Tools to periodically generate features (variables) for data within the Faros AI standard schema. As Faros AI continually ingests data from a range of engineering tools and normalizes it, we've established a general approach to Copilot analysis that 1) applies universally across all our customers without custom feature engineering and 2) provides a comprehensive set of features for our analysis.
Recognizing that not all Faros AI customers immediately integrate the complete range of engineering data sources (such as calendar information or deployment data) alongside common sources like version control (e.g., GitHub), ticket tracking (e.g., Jira), or incidents (e.g., PagerDuty), our feature definitions are made robust against missing information. This ensures that analyses remain insightful even when some confounding variables are absent. However, it may occasionally lead to overestimations of the effects of Copilot (these estimations will still be more accurate than looking at the completely uncorrected data).
The features provide an ongoing comprehensive view of activities surrounding the creation and review of each PR, accounting for everything from authors and reviewers to repository and team information.
Simplified entity relationship diagram (ERD) for ML features generated using the FarosAI standard schema
These features were meticulously curated over repeated analyses to ensure none are downstream effects of Copilot usage itself. When you include downstream effects (often called “bad controls”), you can distort the results and underestimate Copilot’s true effect.
We ran several sensitivity analyses across all customers, examining feature importance and effect size to remove any features that, upon reflection, were likely downstream effects.
For example, we removed incident assignments to pull request authors in the post "Ready for review" period from the feature set because they tended to be suspiciously predictive of code coverage, likely indicating that if you don't test your code properly and it causes an incident, you will be the one assigned to fix it.
Model selection and hyperparameter tuning
With a validated workflow for feature creation in place, our model selection process rigorously identifies the best architecture for each model (outcome, treatment, and final residual correlation) across our customers' data.
Using scikit-optimize for Bayesian optimization, we periodically recalibrate hyperparameters for and select between a variety of scikit-learn tree-based models (e.g., RandomForest, HistogramGradientBoosting, and ExtraTrees) every few months. This ensures that we are optimizing model selection specific to each customer's evolving dataset within the double machine learning empirical assessments.
All of the models we evaluate in this model selection step are tree-based. Tree-based models are particularly well suited to double machine learning applications because they naturally capture nonlinear relationships and interactions among variables (such as “Engineer with 3 years of experience + Java usage + a busy schedule may behave differently than a brand-new engineer working in Python,” etc.). This allows the models to capture the complexity of the confounding variables interactions without needing to explicitly define the relationships between them.
After models and hyperparameters are established, the EconML non-parametric double machine learning model is applied to each organization's data. Predictions, feature importance, and final outcomes undergo rigorous verification, including repeated sensitivity analyses and model quality ratings. Models that do not meet established quality standards for a customer (determined by the R-score metric) are systematically excluded from causal effect calculations.
Causal analysis results - Does GitHub Copilot improve code quality in PRs
At the conclusion of this analysis, after we examined the results across ten companies using Faros AI to navigate GitHub Copilot adoption and optimization, we found that overall, there are no significant negative effects of Copilot usage on at the organizational level.
In other words:
PR Approval Time was not adversely affected by Copilot usage.
PR Size did not experience detrimental changes.
Code Test Coverage was not decreased.
Code Smells did not increase.
These important findings indicate that, overall, engineers can continue using Copilot without worrying about a decline in key code quality indicators. Engineers themselves widely report that Copilot helps them move faster, and from a broader organizational perspective, this doesn’t appear to happen at the expense of code quality.
That’s it? Are we done? No. Because that result was overall.
Some teams saw a positive impact on quality, and others saw a negative impact.
Team coding practices are a major factor in AI-generated code’s quality
While overall there was no significant negative impact on PR quality metrics (a very encouraging finding), individual teams within organizations did show varying levels of effects of Copilot usage on these metrics.
For example, some teams' PR Code Coverage decreased, while for others, PR Diff Sizes increased.
That is where a team-by-team analysis is paramount. We all know that one bad outcome in a highly sensitive area of the code base can reverberate across the entire organization, undermining confidence in AI when it isn’t completely justified. Freezes, moratoriums, and limitations on Copilot usage may follow in an effort to prevent similar incidents.
{{cta}}
What can you do?
As soon as adoption begins, every engineering manager should consider their team’s specific causal analysis and improve key processes around code review and code quality to mitigate any negative impact.
It is always prudent to consider the team’s context and adapt accordingly:
Are teams tackling exceptionally difficult or domain-specific issues?
Are new developers relying too heavily on AI-suggested code and missing some deeper design best practices?
Does the team have good testing and review practices in place? How does introducing Copilot affect these practices?
Conclusion: Reassuring news for Copilot adoption
Overall, it is reassuring to discover that engineers can use Copilot without creating a nightmare for the next person to maintain the code. In large engineering organizations, this is a major concern when adopting new tools: “Will it scale? Will it degrade code quality?”
This causal analysis research, powered by double (debiased) machine learning techniques, strongly suggests that, at this point in time, Copilot is overall safe from a code-quality perspective.
If your organization has flexible budgets, investing in Copilot licenses can accelerate development speed while maintaining quality metrics like code coverage and PR review time. Combined with ongoing qualitative feedback from engineers, this is a promising result for broad adoption.
However, it's important to recognize that with evolving technology and practices, the impact on code quality could change. Thus, using a tool like Faros AI to monitor code quality remains essential. Currently, Faros AI determines that using coding assistants does not detrimentally impact code quality, but this could shift with wider adoption, increased reliance on technology, or technological evolution.
Furthermore, any teams with evident negative outcomes could greatly benefit from investigating how Copilot is being used. Are individuals ignoring lint or code smell warnings? Are they relying heavily on generic snippets that clash with architectural constraints? Analyzing these outliers can help refine best practices, allowing teams to implement necessary process changes and share successful strategies to mitigate issues.
For a conversation with a Copilot adoption expert, request a meeting with Faros AI.
Leah McGuire
Leah McGuire has spent the last two decades working on information representation, processing, and modeling. She started her career as a computational neuroscientist studying sensory integration and then transitioned into data science and engineering. Leah worked on developing AutoML for Salesforce Einstein and contributed to open-sourcing some of the foundational pieces of the Einstein modeling products. Throughout her career, she has focused on making it easier to learn from datasets that are expensive to generate and collect. This focus has influenced her work across many fields, including professional networking, sales and service, biotech, and engineering observability. Leah currently works at Faros AI where she develops the platform’s native AI capabilities.
Fill out this form and an expert will reach out to schedule time to talk.
Thank you!
A Faros AI expert will reach out to schedule a time to talk. P.S. If you don't see it within one business day, please check your spam folder.
Oops! Something went wrong while submitting the form.
More articles for you
Editor's Pick
AI
DevProd
10
MIN READ
Claude Code Token Limits: Guide for Engineering Leaders
You can now measure Claude Code token usage, costs by model, and output metrics like commits and PRs. Learn how engineering leaders connect these inputs to leading and lagging indicators like PR review time, lead time, and CFR to evaluate the true ROI of AI coding tool and model choices.
December 4, 2025
Editor's Pick
AI
Guides
15
MIN READ
Context Engineering for Developers: The Complete Guide
Context engineering for developers has replaced prompt engineering as the key to AI coding success. Learn the five core strategies—selection, compression, ordering, isolation, and format optimization—plus how to implement context engineering for AI agents in enterprise codebases today.
December 1, 2025
Editor's Pick
AI
10
MIN READ
DRY Principle in Programming: Preventing Duplication in AI-Generated Code
Understand the DRY principle in programming, why it matters for safe, reliable AI-assisted development, and how to prevent AI agents from generating duplicate or inconsistent code.