How Coursera scales world class engineering operations to unlock developer productivity
We sat down with Mustafa Furniturewala, VP of Engineering at Coursera, to talk about all things developer productivity. Today, Coursera is known not only for democratizing access to a world-class education, but also for its elite software engineering brand. So we were very excited to discuss how this elite organization manages its software engineering operations. Mustafa leads the Core Product, Enterprise and Degrees team at Coursera, and has seen the company grow from 40 engineers to over 300 engineers in the last 8 years. With this growth has come the usual challenges.
Q. Tell us more about your role at Coursera.
A. I lead the Core Product, Enterprise and Degrees team at Coursera. This includes the in-course learner experience as well as the Partner side responsible for creation of content on the platform. The team is responsible for driving learner engagement on the platform, and driving revenue for Coursera.
Q. You’ve seen the company grow from 40 engineers to over 300 engineers in the past 8 years. What are some of the challenges you’ve faced with scaling your engineering operations at different stages of growth?
A. In the early stages of Coursera, we wanted to iterate as fast as we could to get to product-market fit. Fortunately for us, we had a few bets that paid off. This led to the next growth challenge which was rapidly hiring to scale the team, and hardening the platform to be enterprise-grade. We expanded to Toronto during this phase. The next challenge we faced was scaling our communication and information-flow practices as we grew to over 200 in Engineering. We are now in the phase where we want to make sure we are able to gain as much leverage as we can in the organization, so our learners and partners can see the maximum benefit.
Q. And what are some of the changes you instituted to scale the information flow?
A. We invested heavily in onboarding and documentation, including service and product documentation. We also quantified ownership and built a metadata service that became a source of truth for information about teams and services - this allows us to scale ownership and collaboration. We invested in a lot of tools to enable retrospectives and Q&A in a remote world. We are currently piloting Stack Overflow for our teams so there’s a knowledge-base for all those questions that repeatedly get asked and answered on Slack. We invested in our OKR process, using BetterWorks to bring transparency to organizational and individual OKRs. We also built out product operations and engineering operations teams. The product operations team figures out how we collaborate on OKRs, the cadence of OKRs, what items are at risk and so forth. The engineering operations team helps coordinate major cross-team engineering projects.
Q. Were there any unique challenges that stemmed from the acceleration of remote work due to the pandemic?
A. One of the unique challenges has been enabling the team to continue to have the collective serendipity that leads to creativity and innovation. This is because of the lack of effective whiteboarding tools and reduced opportunities for cross-team interactions and knowledge sharing. We’ve tried a couple of different things to overcome this. Every month, we have an Engineering townhall, where we dedicate 45 minutes to just Q&A. We’ve also been intentional about organizing cross-team zoom events, happy hours, and “make-athons” to create opportunities for those serendipitous moments. We did try some things that didn’t quite work. An example was this virtual office tool called Gather. But that was just yet another thing that people had to log onto.
Q. Do you have a central developer productivity team? At what stage did you decide that such a team was necessary? And what was it’s scope?
A. Yes, we’ve always invested in developer productivity. We had a dedicated team once we grew to about 100 people in Engineering. At the time, we were moving from a monolith to microservices with a decentralized deploy culture. We didn’t want every team to build and maintain their own CI/CD pipelines. So this team was responsible for setting up CI/CD processes with the goal to empower developers to be able to ship to production at any point. The “main” branch is always considered something that is ready for deployment by the team and we avoid having any other long-lived branches. This team is also responsible for front-end infrastructure, including Puppeteer – our end-to-end testing framework.
Q. What were some big wins for the developer productivity team?
A. A big win has been keeping time-to-deploy at under 30 minutes, while keeping our change failure rate low. At some point we were seeing a lot of critical bugs. The team put automated pre-deploy checks in place — end-to-end tests, unit tests, linters to catch non-browser compatible apis etc. This brought down P0/P1s by 70% and enabled us to meet our availability goals.
"A big win has been keeping time-to-deploy at under 30 minutes, while keeping our change failure rate low."
Q.So how do you measure developer productivity? What metrics have you found to be the most meaningful measures? What are some bad measures?
A. For measuring developer productivity, it’s important to not look at just one signal but rather have a holistic view that looks at developer activity but also other important metrics like developer satisfaction and the efficiency of flow of information in the organization. The DORA and SPACE frameworks are good starting points. At first, we started by measuring completion of our OKR commitments. The challenge with that was that every project was unique and had different characteristics as it pertains to ambiguity, complexity etc. We then shifted to using DORA metrics so that we could measure units of work that lead to larger projects. We would also like to start tracking the ratio of microservices to engineers, alerts to engineers, distribution of seniority across teams, and so forth to get a sense of how overwhelmed some teams might be. We already measure engagement and other metrics within the organization with an Employee Pulse Survey.
"For measuring developer productivity, it’s important to not look at just one signal but rather have a holistic view that looks at developer activity but also other important metrics like developer satisfaction and the efficiency of flow of information in the organization."
Q. What are some of the challenges in gathering all these metrics? How have you overcome them?
A. For DORA metrics, the challenge was that instrumenting and querying our CI/CD data with our existing tools (log analytics or monitoring) was challenging and time consuming. We built out dashboards on sumo logic that were error prone and slow. This is where we decided to pilot Faros AI for an out-of-the-box solution that also provided the flexibility and customizability that we need, and we are now rolling it out to the organization.
"We decided to pilot Faros AI for an out-of-the-box solution that also provided the flexibility and customizability that we need, and we are now rolling it out to the organization"
Q. What are some interventions that have really moved the needle on developer productivity at Coursera?
A. We derived a lot of leverage from moving to a more open source tech stack, and moving from Scala to Java/Spring Boot — for hiring, onboarding, and community. Our infrastructure team also enabled some improvements to our CI/CD process like automated canary analysis, and invested in reducing build times, and incorporating a component design system.
Q. What were some interventions that failed, and why?
A. At some point, we tried to add a sign off process before any feature was released, especially for our enterprise customers. This wasn’t very successful since we truly are shipping in small increments which makes it challenging to put in place process gates. So we stopped doing sign-offs, but this in turn makes communicating changelogs harder.
Q. And finally, how do you see your engineering operations evolving over the next 5 years?
A. We want to move towards greater and greater automation. We are already moving towards automatic deployments, so that merges to master will automatically get deployed to production. We also want to invest in right sizing some of our services so that we can better control the dependencies between different parts of our architecture. And finally we want data about our systems and processes to be easily available, queryable, and preferably all in one place, so that data can be a bigger part of our decision making processes.
"And finally we want data about our systems and processes to be easily available, queryable, and preferably all in one place, so that data can be a bigger part of our decision making processes."
Share this article with your friends