Measuring the performance of DevOps teams at ENGIE Digital or how we automated DORA Metrics for ENGIE’s digital factory

17 min readMay 31, 2022

Feedback, by David Henry

At ENGIE Digital, the ENGIE group’s software company, the CTO’s team (Chief Technology Officer) automated a system to measure delivery performance across the entire organization. Putting the concept into practice was a chance to ask ourselves a long list of questions and advance our thought process on the topic in general. We share with you here our assumptions, decisions and lessons learned:

The measurement choices — DORA metrics and others — and how to automate them (from DORA: DevOps Research and Assessment that is now part of Alphabet/Google),
The advantages for management: views of anomalies and trends, situation analysis, and observations on the impact of improvement actions.
The unintended outcomes: the project introduced more control and rigor in our practices.

Introduction

Among their many missions, ENGIE Digital’s CTO Vincent Derenty and his team are tasked with supporting product teams to boost operational efficiency. The product teams design and develop strategic digital solutions for the ENGIE Group. The CTO’s team consults them as experts on topics like cybersecurity, agile methodologies, software architecture and development practices.

One of the first issues that comes up about improvement is how to measure it. It’s impossible to manage and coordinate an improvement action without knowing where you’re coming from, determining where you’re going and being able to see the results of our initiatives.

“If you can’t measure it, you can’t manage it,” as Peter Drucker would say.

When it comes to measuring the performance of DevOps teams, the gold standard is the State of DevOps Reportsand its resulting analysis in the Accelerate research, which draws a correlation between a company’s operational performance and its business performance. It defines operational performance as a set of 24 capabilities classified into five categories: Continuous Delivery, Architecture, Product and Process, Lean Management and Monitoring, and Culture. These capabilities are a list of drivers for improving efficiency in an organization’s software delivery.

Believing it was a solid approach, we decided to set up an automated data collection system that now helps us compute and track the four recommended basic measurements commonly known as DORA metrics:

Two speed indicators:

-Deployment frequency

-Lead time for changes

Two stability indicators:

-Mean time to restore

-Change failure rate

It wasn’t the decision itself that took the most effort so much as the implementation. The concept behind what each indicator needs to measure is clear enough, but how it works on the ground is not as obvious. Questions arose very quickly: What operational data do we measure? How do we get these data? How do we process them? And, as is often the case, we realized that the devil is in the details.

Our goal here is to tell you how we took these measurements from broad concept to real-life implementation by creating a link to the operational practices that teams use in the field.

We also want to take this opportunity to introduce four more indicators that we added to the DORA metrics. These new ones are more focused on workflow within teams:

Deployment chronology
Activity split
Feature flow
Defect flow

By “flow” we mean a quest for fluidity, which Don Reinertsen does a good job explaining and illustrating in his book Product Development Flow.

All of these indicators were designed to give teams a clear bird’s-eye view of their actual operations, to map out their situations, and pick out anomalies or trends in their own delivery.

The teams access these metrics on a Confluence plugin, and we will show you screen shots of the interface. Confluence is one of the most commonly used knowledge management tool in this context.

Background

To gain a better understanding of the issues we were tackling, you need a brief overview of the background.

As we mentioned, the CTO’s team at ENGIE Digital supports a number of teams making digital products that give the ENGIE Group a competitive edge in its markets. The majority of these products and the teams that produce them stem from various entities in the Group that predated ENGIE Digital, where they were brought together after being founded.

ENGIE Digital has about a dozen of these products.
They may be produced by teams ranging from 10 to over 60 people that include product management and tech.
The teams are all separate and accountable. They choose their own development and collaboration practices and workflows, and to some degree their tools, technology and architecture. While this was definitely not a distinct factor for us, we should point out that the history of all these teams meant there were relative disparities in their practices.

We collected raw data from tools that the teams used the most:

GitHub (source code manager): The best way to see the codebase activity.
Jira (organizational tool): A closeup view of team workflows.

High-level implementation

Below is the process we chose.

The program relies on a REST API to collect data:

Pushed by the projects themselves using scripts they have at hand when the data come from GitHub’s continuous integration (CI) factory.
Pulled by a cron when the data come from Jira.

And that exposes them for display in the Confluence plugin.

Diagram of the high-level architecture for the delivery performance metrics reporting system. — High-level architecture

Position

As we will see, a lot of the thought process revolved around choosing the data that would be processed to build each indicator.

These considerations also involved a related objective of making it easier for the teams to adopt our approach. In other words, it had to be the least intrusive and prescriptive in terms of their existing practices, as well as adaptable to the wide range of routines in place. To varying degrees of success, we tried to ask as little effort of the teams as possible when collecting the data.

Our decision was partly based on how to walk this tightrope between availability, ease of acquisition and data accuracy.

Definitions, calculations and views

So, the same process was used for each metric: transition from intent to technical implementation by defining calculations and concepts, and making sure they were still adapted to the team practices and routines.

Let’s see how this worked for each indicator.

1. Deployment frequency

Concept

The goal was to calculate the deployment frequency over a given time period, which meant counting the number of deploys to production for new product versions under assessment during that period.

Beyond deployment frequency, these two concepts structured what came next.

Deploy to production (DTP)

What is a DTP? Seemingly simple at first, the matter became more complex when we factored in, for example, the technical reality of micro-service architectures and multi-repository organizations.

We decided on the following definition for deploy to production: Providing a new version of a product to users by simultaneously deploying one or more repositories.

So a DTP must be differentiated from:

A repository deployment: When several repositories are deployed in the same DTP, we consider it to be a single DTP event.
For example: If I push new code to production every two weeks by updating five repositories each time, the frequency would be two DTPs per month, not 10 (2 x 5 repositories).
Commissioning (COM): Even when the new code to production wouldn’t alter how the application behaves for the end user (i.e., feature flipping/flags/toggles), we count one DTP event for each update.
For example: If I push 10 deployments in a row into production for a new functionality’s various components over a period of one month and ultimately activate the finished functionality on the 10th deployment. I would count a frequency of 10 DTPs per month, not one (which would be the COM).

So the idea is to count the deliveries of new product versions. We also had to work on defining what a version is and putting it into practice to be able to automate it.

The product version

Therefore, counting DTPs would mean counting the number of product versions deployed within a given timeframe. But we still had to effectively version out the product, which may seem easy, but it’s not all that clear-cut.

In a rare prescriptive move, we asked product teams to systematically name the versions, typically by using the standard Semantic Versioning (SemVer): X.Y.Z where X is the major version number, Y is the minor version number and Z is the fix number for a patch version. Having everyone use this convention made it easier for us to tally all the X.Y.Z combinations that were brought to production, and thus to count the DTPs.

Again, we had to be mindful of the multi-repository organizations because some of them don’t even use this product version concept. They only had repository versions, whereas product versions implicitly remained with the combination of repository versions and had no naming convention in the source codes. This could be improved in view of a need to restore systems to a previous state.

In this case, we had to offer these teams a procedure for tagging application repositories when DTPs occurred with the same annotated tag that contained an incremental version number and its deployment date. On the ground, this was a Jenkins job done after the deployments in production.

Calculating deployment frequency

GitHub was our go-to source. Once the version tagging job was set up, we will count all the product version tags (different X.Y.Z combinations qualified by date) within a certain time period to find out the deployment frequency for that period.

2. Lead time for changes

Concept

The idea was to calculate how much time passed between when a development is considered finished and when it is actually deployed to production. This indicator is meant to measure a time period, i.e., how much time does a code change take to be deployed in production?

We had the most trouble figuring out this indicator.

A composite indicator

First, bear in mind that there are two major types of factors impacting the time a development takes to go to production:

Technical efficiency: The capacity to quickly deploy code in the production environment by, for example, automating the deployment pipeline or the regression tests.
Organizational efficiency: Best collaboration practices that minimize time spent verifying or waiting for compliance verification (code review, functional review, business-side approval) and reduces the amount of redos needed by opening an effective communication channel between functional and technical upstream of developments.

Seen from this lens, “lead time for changes” is a composite indicator that, depending on the case, demonstrates both of these aspects to varying degrees. Therein lies why what is measured will change from one team to another.

A sensitive indicator in a team dynamic

The relationships between workflow, state of the source code (mostly merge on main branches) and deployment environments vary widely depending on the organizational and technical choices teams make — largely based on branching strategies (Git Flow, GitHub Flow, trunk-based, etc.), setting up an app review process, whether regression testing is automated, and if related fixes are immediate or delayed.

Therefore, depending on a team’s organization and options, the indicator’s various components come into play based on the methods and at different times. Consequently, they will be accurately represented to a greater or lesser degree, which is what we used as the starting point to calculate “lead time for changes.”

This shows how important it is to be mindful when using this indicator to compare teams with disparate organizational situations and technical options.

Determining version age

As we saw earlier, a product version is what goes to production. Yet a version can contain several increments like user stories and fixes whose precedence often varies by factors such as complexity, work hours and industry value.

As recommended in the literature, we decided to use commits as the standard increment. The advantage was we could assess consistent unitary components from a technical implementation perspective because it saved us from having to match up the more abstract concepts of tasks and user stories.

That then brings up the question of version “age.” Using a version’s commits as a basis, how could we figure out the time between when the code is considered finished and when the version goes to production?

Once again, we turned to the literature and determined that a version’s age is the average age of its commits.

[RESERVE] There are a few limitations to an approach based on the average commit age. The most noteworthy is that disruptive workflow practices like late corrections create commits on main/master branches that make the versions look younger than they are.

Let’s look at the example of a development that’s lagging behind on a branch and built in a context where I’m forced to merge to get to delivery and acceptance. If, after merging the development, eight anomalies are found during delivery and acceptance and 12 commits are made to correct them, the correction commits wipe out the effect of the branch’s age. In this example, the more fixes done downstream in the workflow, the better the indicator’s value. But we are looking for the opposite.

To get around this, we are considering not equating the average commit age with the version age, but rather the time that passes between the version’s oldest commit and its DTP. Then the best practices will be endorsed (Stop starting, start finishing: limit work in progress and stock up on ready code).

Calculating lead time for changes

While dealing with the commit concept, we naturally turned to GitHub for guidance.

To calculate an age, you have to measure the time between a start date and an end date.

The start date that we chose for each commit was the committer date (GIT_COMMITTER_DATE), which is the date of a commit’s last edit. The thinking was that this data point gets us as close as possible to the considered finished idea.
The end date is the deployment date shown on the tag for the product version where the commit is embedded.

Then you just have to calculate the version’s age by figuring out the average age of its commits.

3. Mean time to restore

Concept

The idea was to determine the mean time needed to restore a service when there’s an incident. The impacted service could be all or part of the product in question.

Incident

Insofar as an incident is an event whose resolution time has to be deducted, our definition of an incident is related to its operational outcome. If a product malfunction disrupts users so much that one or more team members must drop whatever they’re doing to take care of it, then that is an incident.

This definition allows us to remain contextual, and to accommodate each team’s uses and everyday realities when it comes to incidents.

Calculating mean time to restore

We used Jira for guidance on this type of event. It actually helped us find tickets in the workflow that could be described as incidents and to monitor their resolution time.

Since our approach used workflow instead of technical monitoring, we were able to stick to our definition of incident based on user disruption, and not solely on availability or performance criteria.

For each team, then, the task was to see which tickets fit our definition of incident. In Jira, we used:

Ticket type (issueType)
Priority level (priority)

The issueType/priority pairs specific to each team are what make the tickets identifiable as incidents for the team.

And for each of these tickets, we calculate the time between the date the ticket was opened (creationDate) and the date it was resolved (resolutionDate).

It should be noted that you have to make sure the resolution date (resolutionDate) is determined within the Jira configuration by entering a state that means the code was deployed in production.

We will take the average resolution time in a given period for all the incidents resolved during that period.

[RESERVE] Since a ticket’s lifecycle in Jira may not reflect what actually happens on an operational level, the times calculated are still representative because a ticket’s creation date cannot be the same time the issue detected resulted in an incident, let alone when it first happened. Similarly, the resolution date won’t necessarily be the exact time the fix itself went to production.

4. Change failure rate

Concept

This was about determining the proportion of failed changes within a given time period.

Failed changes

To calculate DTPs, we leveraged the product version tagging convention in place and counted each DTP as a change, then each failed change as a DTP that was immediately followed up with the fixed version’s deployment.

It’s very easy to figure this out using the standard X.Y.Z SemVer. It’s a corrected version when Z ≠ 0. We will, therefore, deduce that the previous version was a failure.

Calculating the change failure rate

We used the same GitHub deployment frequency data as a basis.

We calculate the proportion of X.Y.Z corrected versions where Z ≠ 0 out of all the X.Y.Z versions that had a DTP in that period.

We should note that a corrected version could itself be a failure, i.e., X.Y.1 then X.Y.2, in which case there would be two failed versions (X.Y.0 and X.Y.1) and one successful version (X.Y.2). So the change failure rate for this period would be 2/3 (two failed DTPs out of a total of three DTPs)

[RESERVE] The last example clearly shows that this indicator can be quite inconsistent for short periods or when the deployment frequency is low.

5. Visualizing delivery performance

A team’s delivery performance is shown on a graph with four curves that each represent a DORA metric. This is an on-demand graph built for any date range the user chooses. It can calculate an average in increments of a week, a month or a quarter.

This visual reveals how the indicators correlate to one another and provide an overview for a given period that includes all the operational performance components.

Image of the four delivery performance components (DORA metrics) plotted as a graph. — *Graph showing DORA metrics*

6. Flow monitoring

We used the data available to us to establish a set of four indicators with a more direct link to what the teams do and their workflow, and ones that were comparable.

We tried to make it easy to discern:

The proportion of the various development tasks the teams do (Activity split)
The chronology of deployments over time (Deployment chronology)
The flow of functional deliveries (Feature flow)
The flow of corrections (Defect flow)

Deployment chronology uses GitHub data that identifies the deployments, whereas the other three indicators (activity split, feature flow and defect flow) are taken from Jira. Each activity type represents one or more ticket types (issueType). These are the tickets counted to calculate the indicators.

[RESERVE] It can get difficult to figure out the approximation because the unit is a ticket in teams where tickets have a very wide range of granularity in terms of complexity and time spent.

Activity split

Activity split tracks the distribution of development activities divided into three main categories:

Features for functional increments (mostly user stories)
Bugs and problems for things like corrections, support and production incidents
Technical activities for technical tasks like research and design, technical hygiene, refactoring and dependency management

The idea is to give the team a view of how its work is broken down and set the conditions for team members to see if this breakdown matches their strategy or if it is tolerating this breakdown, i.e., spending too much or not enough time on fixes or tech, etc.

Graphic representation as a bar graph of the number of tickets delivered by period, broken down into three main activity categories: Features, Bugs and problems, Technical activities. — *Graphic representation of activity split*

Deployment chronology

By showing deployments over time and their lead times for changes, deployment chronology provides a granular view of events in each deployment. This graphic representation makes it possible to distinguish variations and accidents in the pace of deployments by correlating them with the age of their developments.

Each deployment is plotted by the date of its lead time for changes. A visual distinction is made between successful deployments (green dots) and failed deployments (red triangles). — *Graphic representation of deployment chronology*

Feature flow

This entailed overseeing feature flows (functional increments) and the relation between the system’s inflows and outflows. It’s a view into the balance between capacity to do and what’s left to do (not started and in progress).

This provides a clear picture of issues related to the pulled flows concept: Is our backlog too bloated and does it push our flow? Is our in-progress bloated? Or is our backlog drying up? And so on.

Presented as a graph of changes in the amount of features in progress or to be done and finished features. — *Graphic representation of feature flow*

Defect flow

Here is where we track corrective activity and the effects of building reliability by plotting changes in the inventory of anomalies in terms of work on accepted fixes. Practically speaking, this compares the number of tickets for anomalies (bugs, incidents, etc.) within a time period either left to do or in progress that are on deck at the end of that period with the number of fixes done during the period. In other words, it’s the number of tickets with the same type that have been closed.

Once again, it’s a matter of flow management since the change in inventory is a reflection of the effects of inflow and the number of completed fixes represents the outflow.

This view helps teams stay on track setting the right priorities for fixes or building reliability.

Bar graph of changes in the anomalies stock compared to the number of fixed anomalies represented as a curve. — *Graphic representation of defect flow*

Conclusion

It is first important to reiterate that these indicators act as a system, especially for DORA metrics. They are tightly interwoven. Several indicators will probably change in response to a team’s action or modification.

Next, reading metrics only scratches the surface of the search for meaning. Metrics provide clues, they help assess things like the landscape, events and constraints that will give you outcomes and explanations, and ultimately an improvement roadmap.

With these disclaimers out of the way, we suggest reviewing the advantages that we observed after setting up these indicators. We detected two main types:

Expected advantages that we had anticipated.
Unintended outcomes that are more unexpected yet noteworthy.

Expected advantages

As we have just explained, the top expected advantages result from a systematic view of the situations within teams that the indicators provide us. The serve to simplify:

The determination of anomalies and trends that need to be explored.
Example: The chart below clearly demonstrates the need to review how much effort is allocated for fixes after a product is initially put into service:

Right after it’s put into service, you can quickly see an increase in detected anomalies as a consequence of user influx and thereby a spike in use cases. — *Example of the visible effect of putting a product into service on its defect flow*

Factualizing situations and structuring discussions on areas of improvement.
Example: Many teams opted to always open their Retrospectives with a group review of the indicators to detect anything outstanding that needs action, discussion or celebration.
Noting the impact of improvement actions.
Example: The chart below shows the impact of delivering a deployment automation project around 8/20/2021.

Notice that the delivery of a deployment automation project significantly reduces the values for deployment frequency and lead time for changes. — *Graphic representation of the impact of delivering a deployment automation project on the deployment frequency-lead time for changes pair.*

Notice that the delivery of a deployment automation project significantly increases the number and frequency of deployments. — *Graphic representation of the impact of delivering a deployment automation project on the number and frequency of deployments.*

Notice that the delivery of a deployment automation project significantly brings the features to do and finished features curves closer together. — *Graphic representation of the impact of delivering a deployment automation project on feature flow.*

Unintended outcomes

The mere process of setting up the reporting system brought situations to the surface and led to changes that trended towards increasing control and rigor in practices:

As we have mentioned, installing SemVer for all products was an improvement in itself that helped better control how the codebase in production was arranged.
In terms of mean time to restore and flow monitoring, we were able to advance some workflows because we are basing the resolution concept of releasing a given code to production on the Jira ticket lifecycle data. Teams that were not using it advanced their visual management system and added the missing steps all the way through production. This made it possible to supplement, clarify and realize the actual state of flows downstream of developments.
Similarly, in terms of organizing work and divvying up tasks, setting up a system that counts tickets to monitor flows incentivizes teams to reduce the degree of granularity among tickets and focus more on the finer points. This in itself is a best practice that leads to many benefits.
Lastly, the simple act of shedding light on a few things like defect stock, the number of items in backlogs and the age of some commits (and thereby of some branches) led to sorting and cleaning actions that could prove healthy and enhanced everyday diligence in this area.

Takeaways

To conclude, here are two takeaways that we think are important to remember if you want to automate DORA metrics or other indicators in a mixed organization like ours:

Takeaway 1: Do not skimp on the discovery phase and take the time to really understand all the different uses in your organization. That way you can strike a good balance between adaptability (hence complexity) and the prescriptive nature of the anticipated solutions. And keep acceptability and team adoption in your line of sight.
Takeaway 2: Tap into the real-life data as soon as possible to check whether you’ve made the right choices or if your theories match realities on the ground. It goes without saying that you should also get feedback from your first users. Keep in your line of sight improvements for your solution and adapting it to real uses.

Measuring the performance of DevOps teams at ENGIE Digital or how we automated DORA Metrics for ENGIE’s digital factory

Introduction

Background

High-level implementation

Position

Definitions, calculations and views

1. Deployment frequency

Concept

Deploy to production (DTP)

The product version

Calculating deployment frequency

2. Lead time for changes

Concept

A composite indicator

A sensitive indicator in a team dynamic

Determining version age

Calculating lead time for changes

3. Mean time to restore

Concept

Incident

Calculating mean time to restore

4. Change failure rate

Concept

Failed changes

Calculating the change failure rate

5. Visualizing delivery performance

6. Flow monitoring

Activity split

Deployment chronology

Feature flow

Defect flow

Conclusion

Expected advantages

Unintended outcomes

Takeaways

Written by ENGIE Digital