Does size matter in Pull Requests: Analysis on 30k Developers

At one point or another you might have found yourself putting a Pull Request up for review that was significantly bigger than what you were expecting it to be. And you found yourself wondering:

“How big should it really be? Is there a sweet spot for the size of a review? If we could theoretically always fully control it, how big should we make it?”

You googled around, and you found a lot of resources, sites, and articles like this one, analysing the subject and ending up with something along the lines of:

“Too few lines might not offer a comprehensive view of the changes, while an excessively large PR can overwhelm reviewers and make it challenging to identify issues or provide meaningful feedback”

And although you understood the sentiment of the writer, you also understood that the theoretical answer could only be vague, as there is not a silver bullet. As always life is more complicated than that.

What we are gonna be doing in this article is something different however:

“We will analyze the PRs of ~30k developers to see how the size of PRs correlates with lead time, comments received and change failure, to try and find what statistically is the best size, as well as examine what affects it.”

Disclaimer: For anyone who has played around with data, and especially if you did any courses/training in data, the above might bring back some memories of this phrase “Correlation does not mean causation”. First of all hello to you my fellow scholar, and secondly you are absolutely right. We will try to look at it from various angles to see how this correlation varies by company, developer, per developer and amount of code committed, and any other angles which might help us understand what others values, for any reason, follow relevant patterns. However, these are “only” numbers and correlations, they do not explain the reason behind them, so any assumptions for causes that we make are more anecdotal and less scientifically backed.

Methodology

Lead Time

In this case we use as lead time the time between the earliest PR event (either 1st commit, or PR open), and when the PR gets merged in.

Data Preparation

Data that are removed as outliers:

  1. PRs that had a lead time of more that 6 months
  2. PRs that had a lead time of less than 5 minutes
  3. File changes of more than 20k lines
  4. PRs with more than 20k line changes

After we have done that we have a few hundreds of thousands of merged Pull Requests that are used to produce the below analysis.

Algorithm

All Correlations have been done using the kendall tau method, which should be able to better estimate the correlation in the case of non-linear relationships.

How does Lead Time relate to PR size

Before we go more deeply, intuitively we expect that the size of a PR should correlate in one way or another with the lead time, but is it actually the case? Running correlation between the two variables for the whole dataset, gives us as a result the below correlation matrix. 

PR size to Lead Time correlation
PR size to Lead Time correlation

From these numbers we could say that there seems to be some correlation between the two variables but it seems to be a bit above the limit of statistical insignificance, meaning that: 

Their correlation is there, but is not very strong, maybe less than one would have expected.

Seems like we’ll have to dig deeper to see why this correlation appears to be so weak, and unfortunately, plotting the graph of total line changes to lead time if anything makes things less clear, as although the trend seems to suggest that the ones with the higher lead time had slightly bigger size on average, we see that any link between them is not so clear to see.

Total PR size to Lead Time
Total PR size to Lead Time

Now, if we change this chart a bit, by grouping the data points by day, and taking the median of the total changes by day, we start to see a bit more clearly how they relate and potentially an explanation for why their correlation is not that high.

Mean total PR size to daily Lead Time
Mean total PR size to daily Lead Time

So this suggests that at fast lead times the PRs are consistently low in lines changed, and as they get bigger there is a linear increase on the lead time. However, higher lead times can be produced by any size of PR and the correlation is very low between them.

What is the best size

To try and answer this question, we’d first have to ask ourselves what is it that matters to us, ie what are we trying to optimize for. Now, that is a question with endless possibilities. For our purposes however, we will examine what is the largest size of PR that statistically works better given these 3 wants:

  1. Low lead time (aka be done fast)
  2. High number of comments (not too big to review properly)
  3. Low defects/reverts (aka we are not breaking things)

If we plot in a heatmap the probability of a PR getting done in a number of weeks to the size we get the below.

Heatmap of probability of PR of a size (x axis) getting done in a number of weeks (y axis)
Heatmap of probability of PR of a size (x axis) getting done in a number of weeks (y axis)

 Meaning that a PR of less than 100 lines of code has ~80% chance of getting done within the first week

A similar heatmap for the amount of comments gives us the below.

Heatmap of probability of a PR of a size (x axis) getting an amount of comments (y axis)
Heatmap of probability of a PR of a size (x axis) getting an amount of comments (y axis)

Which means that a PR of 6000 lines of code has the same probability of getting 0 review comments as much as a review of less than 50 lines of code.

And finally doing the same for the probability of reverts gives us the below heatmap, and depicting the probability of no commits from that PR getting reverted gives us the below.

Probability of a PR of a size (x axis) not having to be reverted (0 reverts)
Probability of a PR of a size (x axis) not having to be reverted (0 reverts)

Which means that generally larger size PRs have a larger probability of being having some parts of their code reverted (ie faulty)

From the above if we plot on the same graph the probabilities of completing an PR within the 1st week, the probability of getting at least 1 or more comments, and the probability to not have to revert a commit from that PR, we get the below.

Probability (y axes) of a PR getting done in a week (blue), to have comments (green) and to be reverted (red) over lines of code
Probability (y axes) of a PR getting done in a week (blue), to have comments (green) and to be reverted (red) over lines of code

Therefore, statistically, below ~400 lines of code per PR gives a good probability of getting some review comments, completing it within the first week and not having issues with the code.

Of course that is only “statistically” the case. It surely depends on a lot of things. Let examine some potential ones.

Does it depend on the user

We would potentially expect it to vary per user, but how different it could be per user, either that being the author or the reviewer, could be more interesting. After removing all users that have one or more of the below:

  • Less than 10 merged PRs
  • Less than 10 commits
  • Less than 100 lines of code changed

And all reviewers that have:

  • Less than 10 approved and merged PRs

We perform correlation analysis between Lead Time and PR size per user. If we then put the result of the analysis on a histogram showing how many users had each of the correlation value, we get the below charts:

Histograms of Lead Time to PR size correlation per amount of unique PR authors (left) and PR reviewers (right)
Histograms of Lead Time to PR size correlation per amount of unique PR authors (left) and PR reviewers (right)

The correlation between Lead Time and PR size heavily depends on the PR author as well as the PR reviewer

There are a wide range of reasons why that could potentially happen, like level of seniority, company/team process, coding language, review tool, etc. 

Below we plot the relation of the correlation depending on the amount of lines of code a developer has written throughout the last 6 months. Although that instinctively could lead us to think that that means a more “experienced” developer, it is not necessarily true, as it may be also affected by multiple factors, such as eg amount of meetings, mentoring, collaboration per day, which could vary on seniority, the tasks each one took up, etc., and so on and so forth. 

Nonetheless we depict it here for anyone that might find it interesting. Also keep in mind that the difference in the correlation between a user with many PRs merged and few is not a very large one.

Lead Time to PR size correlation value for developers per code committed within the last 6 months
docsLead Time to PR size correlation value per lines of code committed per developer

The more lines one has written the more correlated the PR size is with the lead time. This could also mean that lead time becomes more predictable in this case, and it depends more heavily on the size of the PR and not other parameters (e.g. complexity). However, more analysis would be required to establish that.

Does it depend on the company

We mentioned earlier that there are various potential reasons for a correlation between Lead Time and PR size, and we also said that the cause of the strength of the correlation would also be multivariate. One of the potential causes being company/team processes. If that would be the case we’d expect to see the correlation varies by the company. 

Taking a small sample of companies and examining the strength of that correlation seems to suggest that that is a valid assumption as well, as we can see here it varies from 0.1 suggesting that the two metrics are not related at all for the specific company to almost 0.7 suggesting a relatively strong correlation between the two.

Lead Time to PR size mean correlation per company (sample)
Lead Time to PR size mean correlation per company (sample)

How much PR size relates to Lead Time seems to depend heavily on the specific company 

Does it change over time

It absolutely does, and massively so! Unfortunately, it’s rather hard to depict that for everyone in a single chart. However, I’m putting here my own correlation over time that I got from our free analytics platform so you can get an image of how much it can vary.

Lead Time to PR Size Correlation over time chart for me
Lead Time to PR Size Correlation over time chart for me

Conclusion

We examined the correlation between Lead Time and PR size to try and see if we can draw some conclusions about what is the size we should be aiming for. We found that statistically there are some generalisations we can do and estimate an optimal size. However we also came to the conclusion that the link between them heavily depends on the company, the team and even the individual developer. Which seems to suggest in the end that:

Each developer works in unique ways, and only you, if anyone, knows what is the optimal for you, and your team.

Now If you would like to check where you or your team/company stands wrt this correlation between Lead Time and PR size, we created a simple way for developers and teams to get insight on how this correlation changes over  and see where they stand, either individual, team, or as a whole company. If you are curious about it, feel free to check it out.

Example correlation analysis page
Example correlation analysis page


6 replies »

  1. When quantifying the lines of code from the PRs, are these counting the changes to tests?

    Would be great to be able to see if having a distinction between “production code” and “test code” would cause any variability in the metrics.
    Great work!

    Like

    • I’m very glad you found it interesting!

      Excellent point you are making there, right now I haven’t done any separation of the code, therefor ti takes everything into account, tests included. Indeed it would be interesting to see if that changes the metrics and the charts.

      I’ll try and have a look when i get some time again to check it out. Could definitely let you know what sort of differences there were if interested.

      Like

    • I might be missing something, apart from the few pages below the image with the correlation and other types of analyses on the data is there something else that I missed and didn’t include?

      Like

  2. Thanks for sharing this article, very insightful.

    How did you get this data? I ask this because you mention “PBI” which stands for Product Backlog Item which indicates to me that you might use something like Azure Boards (which I do too) and I would love to repeat your analysis but with my own data 🙂

    Like

    • I’m very glad to hear you found it useful!

      This data come from Github and Gitlab (cloud and on-prem) integrations in our platform that users integrate with to get a visibility and understanding of their ways of working. We are currently working on the Azure DevOps integration and should be out in a couple of weeks

      Some of these analyses are available in the platform at the click of a button for yourself, your team and your company as I mention towards the end of the article.

      If you want to do the analysis on your own, I’m glad to guide you through how we will do it in Azure, which APIs you need to use and how to clean, sanitize, transform and provide you with the code for this analysis here.

      If you want I could let you know once we have solidified the Azure DevOps integration on our side, so I don’t point you towards anything that we haven’t yet validated to be the right way, and we pick it up from there 😉

      Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.