Webhooks vs. APIs: Data ingestion options for software engineering intelligence platforms
What’s the difference between these pull and push options and which approach may work best for your data source?
October 23, 2023
Business intelligence platforms, particularly those targeting the software engineering space, play a crucial role in centralizing data from many sources to support business operations. These platforms provide teams and leaders with a holistic view of their software development processes, enabling them to make data-driven decisions, identify bottlenecks, and optimize workflows.
To achieve this, these platforms combine data from multiple types of software development systems, including source code management, project management, release management, incident management, and more. SaaS software engineering intelligence platforms like Faros AI must also support the ingestion of data from multiple flavors of those sources, whether they be cloud-based or self-hosted.
The process for getting data from a source to a BI platform often depends on the source, but it can largely be summarized into two options: a data connector that pulls the data from the source into the platform, or a webhook built into the source that pushes data to the platform.
Push or pull?
To choose which approach works best for your source, let's first compare these two options.
Comparing pull and push methods for populating a BI platform from a data source
What are APIs or connectors?
Software development systems typically expose APIs that enable interested parties to request and retrieve data. These APIs are often protected by some form of credential system, such as a token. A connector is a piece of software that uses this credential to authenticate to the API to retrieve (“pull”) the data from the source system (“data source”) into the BI platform. This connector is run periodically to ensure the platform always has the most up-to-date data within a reasonable timeframe.
This pull approach is the most common approach to ingesting data. Here are a few reasons why:
- Easy to get started: Most companies rely on third-party software development systems such as Jira and Github to facilitate and organize their software development. Fortunately, most of these third-party systems already have the APIs required for retrieving data.
- Flexibility: Since the connector is its own piece of software, it can choose which data to pull from the data source. BI platforms usually require only certain types of data from the source.
- Robustness: If the data source is temporarily offline or inaccessible, the connector can just try pulling again at the next scheduled interval.
- Scalability: The connector controls how much and how often the data is pulled, which reduces pressure on both the data source and the BI platform. The connector itself can be run on the same infrastructure as the platform, or on a separate stack.
- Historical data: The connector can pull data as far back as is supported by the data source.
- Data transformation: The connector can aggregate and transform the data in transit, which can reduce the burden on the platform.
What are webhooks?
Some software development systems come with webhooks, which are internal components that can send data events to another party in real-time, or at least very close to real-time.
In this situation, the roles are reversed: The other party, such as a BI platform, exposes an API endpoint to receive data events. When an action takes place in the software development system, e.g. a new work task is created, the system "pushes" the event to the platform by making a request to the platform's API endpoint. This endpoint may also require a credential, which is supplied to the software development system when setting up the webhook.
Webhooks are an extremely useful tool and are commonly found in systems that are inherently event-driven, such as notification systems, automation tools, and e-commerce systems.
When are webhooks the preferred option?
As a SaaS platform, Faros AI defaults to the pull approach for ingesting data. This means we develop, maintain, and run all the data connectors needed to generate the insights for our clients. But for us to run the connectors, our clients must supply us with the necessary credentials so that our infrastructure can authenticate to their software development systems. For some companies, providing system credentials to a third party is a non-starter. Perhaps they have compliance regulations that don't allow this behavior, or maybe the credentials cannot be scoped down enough to only allow the minimum set of permissions, or maybe they just don't want to do it.
For these situations, Faros offers a middle-ground option, which we call the "hybrid" approach. Our data connectors are open-source and available for anyone to download and run themselves. We can provide our clients with tailored instructions for running the connectors on their own infrastructure. This means they have full control over the operation and scheduling of the data connectors. However, full control also means full responsibility. The clients now have the added overhead of integrating the connectors into their automation stack along with the other engineering burdens of managing repeated jobs, and the time spent doing that can negatively impact other business operations.
Yet, for some clients, neither of these approaches may be ideal. But if their data sources include webhooks, they can now configure those webhooks to push their data events to Faros. This approach provides several advantages to the client:
- Easy and fast setup: Webhooks are usually quite fast to set up and can sometimes be completely configured through the data source UI. All they need to do at a minimum is provide the Faros API link for their account.
- Secure: System credentials never leave the client's infrastructure.
- Real-time updates: Webhooks are inherently event-driven, which means data is pushed to the Faros AI platform in real-time — or at least very close to it. This enables any number of event-driven automation workflows. For example, you can create an automation in Faros to add incident details to related work tasks right as incidents are generated.
- Increased control and transparency: Depending on the data source, they can choose which types of events to send to Faros, as well as which business units they wish to send events for. This process is often much easier than configuring a dedicated system credential that only has access to certain business units.
- Performance: Since the webhook is run by the data source itself, it should not be subject to any rate limiting or throttling rules that APIs are normally protected by. The client's infrastructure team also won't have to worry about their self-hosted data source getting overwhelmed by API requests from a connector.
The main drawback of webhooks is that, as an event-driven system, they do not support pushing historical data to another party, and platforms like Faros AI preferably ingest months of historical data to quickly generate actionable insights for our clients. To resolve this, Faros enables its clients to manually run the data connectors on their infrastructure — the "hybrid" approach from above — just once to pull all the historical data into the platform, and then use webhooks to push new events into the platform as they are generated. Since clients are only running the data connectors once, they don't have to deal with all the added responsibilities of automation and management that would be required to run the data connectors continuously.
Examples of systems that support webhooks
Several popular software development tools support webhooks, such as GitHub, GitLab, and Bitbucket for source code management, and Jira, Airtable, and Asana for task management. Popular incident management systems like Pagerduty and OpsGenie, which are already event-driven, support webhooks as well.
Since the Faros AI engineering team uses GitHub for both source code management and a portion of our CI/CD pipeline, we've set up our own GitHub organization to send events to our platform.
As our engineers push commits to their development branches, the GitHub webhook pushes corresponding commit events to the Faros platform. It also pushes events when:
- A pull request is created from a development branch
- Someone reviews the pull request
- The pull request is merged into the main branch
- A GitHub Action workflow updates the Faros platform with the newly merged code
Combined with the ingestion of our task management data, the platform now has a complete view of a feature being added to our task list, to the feature being deployed onto our platform.
Are webhooks hard to set up and maintain?
In general, it is very easy to get started with webhooks on a system that supports them, like GitHub. This is because the system itself does all the heavy lifting. There is no need for the user to manage any GitHub tokens, schedule any job automations, or worry about performance-related details like rate-limiting or throttling. You can see the single web page that encompasses the entire setup process for GitHub webhooks.
Screenshot of the GitHub Webhooks configuration page
Tips for supporting webhooks
If you're thinking about enhancing your own BI platform to support incoming webhook events, here are a few tips to ensure the best experience for your customers.
Tip #1 Service availability
We mentioned earlier that the main drawback of webhooks is that they can't push historical data. This means that your platform must minimize the chance of missing any incoming events, because if you miss events, then someone needs to run a data connector to pull the missed data. Therefore, your event-handling service must be highly available and reliable. Some ways to achieve this include (but are not limited to) load balancing across multiple instances, deploying instances across multiple data centers or cloud regions, and configuring auto-scaling policies to add more instances during peak traffic times.
Tip #2 Event validation
You may have noticed in the GitHub screenshot that we configured our own webhook to send all events to our platform — the "Send me everything" option. It's much faster to choose that option than pick and choose which event types to push, and if your customer is just looking to get something working quickly, this is probably the option they'll choose as well. Or, your customer's software tool may not allow them to choose which event types to send. This means your platform should handle events that don't have any relevance to your product. But to avoid these extra events impacting the performance of your platform, your event-handling service should identify and discard these extra events as early as possible, ideally before the event gets into any sort of processing queue.
Tip #3 Error handling
Even if your event-handling service has 100% uptime, there's still a possibility that some other component of your platform may have an outage that prevents an event from being fully processed. In these situations, your event-handling service should identify these errors as recoverable, and keep attempting to process the event until it succeeds. If you cannot retry indefinitely, have a backup storage system in place to store events so that when your platform issues are resolved, you can replay those errored events and get them into your platform.
In summary, while APIs and data connectors are the standard way of ingesting data into BI platforms, webhooks can provide immense value in the right circumstances. For companies that can't share credentials or want real-time data flows, webhooks are an elegant solution that puts control firmly in their hands. With high availability, validation, and error handling, BI platforms can fully leverage webhooks to deliver responsive insights.
If you're currently evaluating strategies to centralize data into a BI platform for software engineering, read more about Faros AI here.
More articles for you
See what Faros AI can do for you!
Global enterprises trust Faros AI to accelerate their engineering operations.
Give us 30 minutes of your time and see it for yourself.