How GitHub Copilot Fixes Flaky Tests in CI

A step-by-step example of GitHub Copilot fixing a flaky test: analyze logs, propose a PR, validate the solution.

Text written: How GitHub Copilot fixes flaky tests in CI, above a computer with an alert symbol, an arrow with the GitHub Copilot logo, and a computer with a green success symbol, on a gradient blue background

How GitHub Copilot Fixes Flaky Tests in CI

A step-by-step example of GitHub Copilot fixing a flaky test: analyze logs, propose a PR, validate the solution.

Text written: How GitHub Copilot fixes flaky tests in CI, above a computer with an alert symbol, an arrow with the GitHub Copilot logo, and a computer with a green success symbol, on a gradient blue background
Chapters

I recently hit one of the most frustrating problems in software development: a flaky test. Flaky tests break trust in continuous integration (CI) pipelines and slow down developers. Instead of debugging it myself, I asked GitHub Copilot to fix it. 

How can GitHub Copilot fix a flaky test?

GitHub Copilot can fix flaky tests because it has access to the codebase, CI logs, and failed runs. All you need to do is direct it to the failure.

Steps Copilot took:

  1. Analyzed the CI logs → identified the race condition causing the flakiness
  2. Proposed a pull request with the fix
  3. Validated the fix → I ran the test 100 times with Copilot’s fix (100/100 passed) vs. without it (~23/100 passed)

The flaky test hasn’t reappeared since merging the fix.

Why use Copilot for flaky tests?

  • Saves developers time by skipping manual debugging
  • Provides reproducible validation (stress-testing the fix)
  • Improves CI reliability and developer confidence

This example shows how GitHub Copilot can diagnose and repair flaky tests automatically, turning a frustrating CI failure into a quick success. Watch the video below for a walkthrough.

More details in my video below: 

<iframe width="445" height="791" src="https://www.youtube.com/embed/inYn4Os9zMU" title="How GitHub Copilot (Agent) Helped Me Fix Flaky Tests &amp; Unreliable CI - Experience Report | Faros AI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Full Transcript: Using GitHub Copilot to fix flaky tests

“Today I want to tell you about a pretty nice success story that I had with GitHub Copilot. 

I merged some code the other day, and after a while, I got an email from the continuous integration saying that one of the tests had failed. 

When I looked into that test failure, I realized that the test that was failing was completely unrelated to the change that I had made. So this seemed to indicate that this test was flaky.

So I just figured, hey, since GitHub Copilot should have access to the logs in this continuous integration run and the code itself, maybe I just put the link to the failed action here and I just simply said, hey, investigate this possibly flaky test. And I just went on to do whatever I was doing that day.

I came back and to my very positive surprise, GitHub Copilot had identified the root cause of the flakiness and had proposed a fix. So I told it to run the flaky test 100 times. So it did three validation scenarios and then run each 100 times, getting a 100% success rate. That was very promising.

Just to be super sure, I then told GitHub Copilot to run the flaky test without the fix to get the success rate before the fix. So it did the same thing, it ran the test 100 times and it got a success rate of 23%. As you know, this is very bad for developer happiness—when you're trying to merge your code and have to retry and retry and retry.

I took a look at the fix and indeed it had to do with how to handle the fake timers and the real timers in the unit test framework that we use, which is kind of not trivial to fix. 

So I was very pleased that Copilot, without any back and forth, was able to fix my problem and we never heard about this flaky test since.”

Ending flaky test frustration with GitHub Copilot

Flaky tests used to mean lost hours, broken momentum, and eroding trust in your CI pipeline; but with GitHub Copilot or similar AI coding tools, flaky tests become just another problem AI can tackle—quickly and reliably—to keep developers moving forward. 

For a deeper dive into the hidden costs of flaky tests and why it’s worth investing in fixing them, my colleague at Faros AI, Ron Meldiner, wrote a must-read article on the topic.  

If you’re interested in broader perspectives on AI in software development, I also publish my thoughts on AI and share hands-on experiences with AI coding tools frequently. Follow me on LinkedIn for more tips on using AI coding agents.

Yandry Perez Clemente

Yandry Perez Clemente

Yandry Perez is a senior software engineer at Faros.

AI Is Everywhere. Impact Isn’t.
75% of engineers use AI tools—yet most organizations see no measurable performance gains.

Read the report to uncover what’s holding teams back—and how to fix it fast.
Cover of Faros AI report titled "The AI Productivity Paradox" on AI coding assistants and developer productivity.
Discover the Engineering Productivity Handbook
How to build a high-impact program that drives real results.

What to measure and why it matters.

And the 5 critical practices that turn data into impact.
Cover of "The Engineering Productivity Handbook" featuring white arrows on a red background, symbolizing growth and improvement.
Customers
10
MIN READ

An industrial technology leader lays the foundation for AI transformation with Faros

Learn how a global industrial technology leader used Faros to unify 40,000 engineers and build the measurement foundation for AI transformation.

Customers
10
MIN READ

A leader in independent identity verification measures AI impact with Faros

Learn how a leading identity security provider uses Faros to power an AI-driven engineering organization and achieve a 35% increase in velocity.

Research
6
MIN READ

Monorepo vs Polyrepo: What the PR benchmark data actually shows

Benchmark data from 320 teams comparing monorepo and polyrepo PR cycle times. What “good” looks like and why developer infrastructure matters, especially for AI agents.