What is the Change Failure Rate and How do I measure it?
A comprehensive guide on "Change Failure Rate", one of the 4 key DORA Metrics. Read on to learn all about it and how to measure Change Failure Rate.
May 7, 2022
DevOps adoption is growing at an alarming rate partly because of the increasing demand for lightning-fast business services. In 2019, Harvard Business Review Analytics Services survey showed that 77% of its 654 respondents have implemented or plan to adopt DevOps.
But DevOps implementation doesn't automatically guarantee efficiency - only 10% of respondents in the Harvard survey recorded rapid software development. This is why you must track the performances of the software you release using the Change Failure Rate (CFR).
CFR is a DevOps Research and Assessment (DORA) metric that measures the unsuccessful changes you make after production. In this article, you’ll learn how to evaluate the change failure rate.
What is the change failure rate?
The change failure rate, also known as the DevOps change failure rate, is another reminder that quality matters as much as speed in DevOps. It measures the quality and stability of your software updates.
Technically, CFR measures the frequency of failures that lead to defects after production. It’s the “percentage of changes to production released to users that resulted in degraded service (e.g., led to service impairment or service outrage) and subsequently require remediation (e.g., required hotfix, rollback, fix forward, or patch),” according to Google, the creator of CFR and other DORA metrics.
There are many errors engineers catch before deploying code. But CFR is strictly limited to the bugs you fix after production. Pre-deployment errors don't count.
Why and how to measure the change failure rate
Imagine your users always experience downtime while using your service. That's bad for your business. Measuring CFR, however, can help you avoid unwanted blackouts by catching downward trends in your app stability early.
Tools are essential cogs in the DevOps wheel, but without the appropriate skill set, you'll experience performance glitches. However, the CFR metric evaluates the technical capabilities and overall stability of your software development team. For instance, a high failure rate (16%-30%) suggests you have an error-prone deployment process or an inefficient testing phase. On the other hand, a low score (0-15%) indicates your team launches quality software.
Launching error-free code is good software practice. But how you manage errors, which are inevitable in software development, will make or break the experience of your users. Rod Powell, Senior Manager at CircleCi, corroborates this stance. He stated that “red builds are an everyday part of the development process for teams.” Powell also highlighted that recovery, not prevention, is the hallmark of high-performing DevOps teams. “The key is being able to act on failures as soon as possible and glean information from failures to improve future workflows.”
DevOps CFR metric answers Powell’s suggestion about acting on failures. It turns failure into success for improved business outcomes. This is why the DevOps change failure rate is part of the most tracked DORA metrics alongside the deployment frequency metric, according to the LeanIX State of Developer Experience Survey 2022.
But how do you evaluate the DevOps change failure rate? Start by defining the parameters below:
- the number of deployments or releases you made.
- the number of fixes you made after deployment.
- The number of failed changes that caused an incident or a failure.
CFR is the ratio of the number of incidents you faced to the total number of deployments.
CFR (%) = # of change failures/total # deployments.
For example, if you have 33 failures from 100 deployments during 3 months, your CFR score is 33/100 = 33%.
What is a good failure rate?
State of DevOps Report 2022 change failure rate. Source: Google
According to the 2022 State of DevOps report, high-performing teams typically have a low CFR score (0%-50%), average teams achieve medium scores (16%-30%), and low-performing teams have high scores (46%-60%).
The lower the score, the better the software delivery performance. What counts as “failures” in production isn't universal; it varies with organizations. Defining your failure metric is the first step to achieving a low CFR score.
Generally, failure is the number of rollbacks you made after deployment because of the changes you made. Similarly, not all post-deployment incidents are CFR errors. Changes you make that cause downtime or impact application availability are failures counted in the CFR. Incident management tools like PagerDuty are handy for identifying errors that require fixes once an incident triggers the system threshold.
Common mistakes when measuring change failure rate
Zero failure is the ideal target for high-performing DevOps teams. However, a zero change failure score is impractical. To have a low CFR score, avoid these common errors:
- Classifying every failure as a CFR
Not every incident that caused an error is due to the changes you made. Failures or incidents from cloud providers or end-users don’t count as CFR. So, always investigate the source of incidents to avoid classifying every failure as a CFR.
- Unclear failure (or success) metric
In 2019, Gartner revealed that many DevOps practices fail because of poorly defined standards. Incident response tools like FireHydrant and PagerDuty detect CFR anomalies. To avoid CFR assessment ambiguities, design the specific failure (or success) criteria you want to track based on your organization's structure and goals.
- Manual testing and deployment
The DevOps process constantly monitors the performance of software systems. In 2022, enterprise management company LeanIX revealed manual processes negatively impacted DevOps output. Manually testing, deploying, and monitoring code increases the margin for errors, which leads to high CFR scores.
- Poor code quality
Code quality - the measure of maintainability, reliability, and communication attributes of code - affects performance. Poorly written code is less reliable and buggy. It’s also difficult to read, understand, and modify. A lack of standard documentation practice causes poor code quality. Similarly, poor organizational architecture contributes to poor code quality.
- Measurement errors
DevOps needs automation as much as humans need air. But DevOps tools also require hands-on monitoring to flag errors. For instance, some tools confuse failure in the Build phase of the CI/CD pipeline for CFR. You'll have incorrect CFR scores without a human-in-the-loop for incident assessments.
- Not considering the time interval
The DevOps CFR metric is a function of time. Omitting it during the evaluation will give inaccurate results.
To avoid mistakes, implement the practices listed below.
- Quality Assurance (QA) is your friend: Code quality plays a positive role in achieving a low CFR metric. The better the code quality, the lower the chances of recording errors during production. To produce quality code, QA must be your constant ally. You must constantly—and comprehensively—test your code before sending them out.
- Measure other DORA metrics: DORA metrics aren't just about frequency and speed—it's about creating a disciplined process for quality output. Bryan Finster, VP at Rw Baird - in an article he wrote for the Faros AI blog - believes the CFR and the other three DORA metrics (deployment frequency, lead time for changes, and time to restore service) are interconnected. Measuring all the metrics gives a comprehensive overview of the changes you need to make.
- Apply context to CFR metric analysis: CFR scores may be misleading in some situations. For instance, your CFR metric will be inaccurate if you have incomplete data about the errors and the changes you implemented. Furthermore, skewed sample analysis, such as measuring only high-risk changes, affects CFR scores. It's best not to draw too many conclusions from standalone CFR scores.
How to reduce the change failure rate
Tools are a mainstay with DevOps practices. But using multiple or too many tools affect incident management, leading to communication dilemmas among employees. Transposit's 2022 State of DevOps survey supports this position: 45.2% of the respondents highlighted disparate tools as a stumbling block toward swift incident management.
But Faros AI can solve the multiple tool dilemma. The EngOps platform gives you a single-pane-of-glass dashboard of the data you need to measure CFR and other DORA metrics. Other ways you can improve your CFR are highlighted below:
- Remove structural barriers that impede communication and collaboration
In 2019, George Spafford—Senior Director Analyst at Gartner—said in a blog that “people-related [and process] factors tend to be the greatest challenges—not technology.” Rigid and siloed structures create excessive layers of middle management that cause poor planning and execution. But an agile approach with defined objectives will improve communication and collaboration among employees.
- Implement Pull Request (PR) review
“Prevention is better than cure” is a cliche that applies to CFR assessment. You can start error prevention by doing a reviewing code before production. Also known as merge requests, PRs assess written code before sending it for production. The review process removes defective code. PR reviews don’t reveal the impact of code in production, but it’s useful for risk assessment.
Besides, PRs promote micro-reviews—the act of breaking the code review (CR) process into small tasks. It helps developers work on small and self-contained changes. Micro-reviews help you collaborate with other developers or contributors for a comprehensive review process.
So, what's the best size for mini-reviews? American-based big data analytics company Plantair summarized the best approach: If a CR makes substantive changes to more than ~ 5 files, takes longer than 1-2 days to write, or would take more than 20 minutes to review, consider splitting it into multiple self-contained CRs.
- To automation, add human evaluation
Your chances of identifying and modifying errors without automated tools are low. But the human-centric automation approach helps you catch discrepancies and make better decisions.
Final thoughts on the change failure rate
“Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.”
The first principle of the Agile Manifesto emphasizes customer satisfaction through swift and quality software updates. The change failure metric brings you closer to achieving the goal. Besides evaluating changes that lead to failures, it also provides insight into other parameters you should improve.
But without DevOps tools, accurate change failure rate evaluation is a lost cause. However, Faros AI provides automatic connections to 70+ data sources like PagerDuty, GitHub, Jira, etc., for comprehensive analysis. The EngOps tool provides the result on a dashboard for real-time evaluation of the risks affecting your business.
More articles for you
See what Faros AI can do for you!
Global enterprises trust Faros AI to accelerate their engineering operations.
Give us 30 minutes of your time and see it for yourself.