The Trouble With Software Estimation

In an effort to become more consistent with the way our company delivers estimates to clients and to each other, we decided to try an experiment aimed at discovering estimation variability and biases. The results were pretty surprising!

The experiment was divided into two parts:

  • Quantitative: team members were asked to estimate the number of hours that would be required to develop a set of features, and report on how these estimates might change depending on who was asking.
  • Qualitative: A guided discussion in which a number of aspects of the estimation were discussed in relation to the quantitative experiment.

This article presents the results of these experiments. A later article will go on to explore possible solutions to the software estimation problem.

Domain and Technology Knowledge Effects

For these experiments, participants were asked to estimate the build and unit-test portion of the effort only. Jonah’s standard estimation model focuses on the build effort because this is how delivery teams (who we use to do our estimations) actually think: build/unit test is the activity that delivery teams can most accurately estimate (as opposed to other activities like analysis, design, testing, deployment, management, and support).

Participants were presented with three successive estimation scenarios:

  1. You have a domain with two entities in a one-to-many relationship. How long do you think it will it take you to build the software that implements the UI, Middle Tier, and database persistence to create, read, update, and delete instances of these entities, as well as manage the associations between them?
  2. It is now revealed that you are free to use whatever tools you like, and your team and technology stack (one with which our company is intimately familiar) looks like the following: How long do you think it will it take you to build the software?
  3. It is now revealed that your domain is a portfolio of stock positions, and looks something like the mindmap below. How long will it take you to build the software?

It should be noted that estimators were explicitly prohibited from considering scaffolding frameworks such rails, grails, or spring roo, which would tend to skew results dramatically depending on assumptions. Instead, the primary goal here is to generate a baseline estimate for a simple software problem, and secondarily to determine what factors affect both internal estimates and reported estimates.

Results

The following is a summary of the results from the first three scenarios:

Technology and Domain Obscured

84 hrs, sd: 46 hrs

Domain Obscured

77 hrs, sd: 47 hrs

Nothing Obscured

57 hrs, sd: 33 hrs

 

 0 20 40 60 80 100 120 140

As expected, implementation technology and domain knowledge both help to reduce build estimates, though technology familiarity has a lesser effect: on average, technology familiarity reduces estimates by about 8% relative to the baseline, and domain familiarity reduces it by a further 27%

Anecdotally, non-technical estimators do not show as much of a disparity between technology and domain knowledge effects – a more linear pattern was typical of these estimators, who showed similar reductions for both technology and domain knowledge, on average.

It should also be noted that some estimators increased their estimates for scenario 3 relative to baseline (scenario 1), indicating that they felt they had underestimated the complexity of the domain before they were given information about it. Though this wasn’t the case for the majority of people, it does indicate that estimators make very different assumptions about the complexity of a problem when they have incomplete information.

It’s also clear that estimates vary widely across individuals. Standard Deviations range from 55% to 58% across the 3 scenarios. In normal distributions, a range of +/- 2 standard deviations accounts for about 95% of the observations. Assuming a normal distribution of results would imply that with 95% confidence, an estimate delivered at this level of accuracy could vary by as much as 116%. This is problematic for both estimators and clients, especially where the cost of the average estimate is significant.

Effect of "Who's Asking"

For scenario 4, estimators were asked to again consider the parameters of scenario 3, which we assumed would generate the most accurate estimate of first three. In addition, they were asked “Given your estimate for the work, what do you report it will take to the various people who might request the estimate.” In other words – how do you adjust your reported estimate based on who’s asking?

The requesters were revealed to be a Team Lead, a project manager, your sales guy, your boss, and the client. How long do you report it will take you to build the software?

Results

The following is a summary of the results from this experiment

Self

57 hrs, sd: 33 hrs

Team Lead

61 hrs, sd: 42 hrs

Project Manager

64 hrs, sd: 43 hrs

Boss

64 hrs, sd: 43 hrs

Client

72 hrs, sd: 52 hrs

Sales Guy

73 hrs, sd: 50 hrs

 

 0 20 40 60 80 100 120 140

These results show that reported estimates do in fact vary depending on who’s asking; up to 28% relative to baseline. The effect is not as smooth as one might interpret from the graph above: many estimators did not change their estimates at all based on who was asking. Others increased their reported estimates by up to 100% relative to baseline, depending on how they thought the estimate would be used or interpreted. This is why the standard deviations are so high.

Interestingly, no-one decreased their reported estimates relative to what they thought it might take, depending on the requester; it turns out that machismo is not a part of Jonah’s estimation culture!

Anectodal Interpretation

Anecdotal discussion suggests that those that padded their estimates did so because they wanted to give themselves a little margin for error, and further the amount of padding depended on one or more of the following:

  • Perceived technical ability of the requester (the higher the technical ability, the lower the padding); in this case, “technical ability” might also be re-cast as the “requester’s ability to interpret the estimate.”
  • Perceived consequences of getting the estimate wrong (the greater the potential consequences, the greater the padding)
  • Perceived effect on chances of winning the work (the higher the chance of losing the work, the lower the padding)
  • Perception of what the requester would do with the estimate (e.g. some developers assumed salespeople would reduce the estimate to win the work, so they increased their padding to account for this)
  • Perception of what the estimator is reporting. Some sales people tended to reduce estimates reported by technical estimators before reporting them upward, assuming that the technical estimates would have been inflated for contingency.

We noticed that padding for team leads, project managers, and bosses was similar, and again for sales people and clients. Based on this, we (somewhat arbitrarily) segmented the above results into 3 bands, within each of which “similar” levels of padding were reported:

  • baseline (self)
  • baseline + contingency padding (team lead, PM, boss)
  • and self + contingency and perception management padding (client, sales person)

Recasting these and taking “average” padding values for each group gives us the following:

Self

0% padding

Contingency

11% padding

Contingency & Perception Management

27% padding

 

 0 20 40 60 80 100 120 140

This segmentation attempts to quantify the effect of “contingency” and “perception management” on padding. “Contingency” padding was about 11% on average, and “perception management” padding was an additional 16%, for a total of 27%.

Estimator Effect

The high standard deviations from the four experiment scenarios presented indicate that estimates are highly variable depending on who is doing the estimating. A guided discussion was conducted to help uncover the reasons for this variability. The following is a summary of this discussion.

Estimate Influencers

Estimators were asked what the biggest influencers on their estimates. We’ve already seen that knowledge of the technology stack, the domain, and who’s asking for the estimate affect reported estimates. There was also a general consensus that estimators wanted to be as honest as possible with their estimates – transparency of estimation was a stated goal of many involved in the experiment.

Among the other answers were the following:

  • their own historical accuracy of estimation on similar projects
  • the strategic nature of the client (estimate decreases as client becomes more strategic)
  • the “fine print” of the contract, if for a client (estimates increase the more constrained by the contract the estimator feels)
  • how flexible the client is to scope alterations
  • how busy the estimator currently is
  • how productive the estimator feels that they are
  • whether or not you want to work with the technology (tends to lower reported estimate)
  • what the team makeup is (better teams lower estimates)
  • whether they’ve worked with the client before
  • estimates are adjusted based on past experience with the client
  • padding is reduced the more you work with a client
  • new clients are sometimes given lower estimates to encourage them to sign on
  • financial status of the company (tend to lower estimates if work is needed)
  • quality expectations (tend to increase estimates if the client is seen to be quality-sensitive)

In other words, each estimator carries a complex mental model with respect to estimates, especially with respect to the differences between the estimate itself and the reported estimate. It’s not surprising that people are somewhat reluctant to deliver estimates, even in the face of healthy specifications.

While the difference between the estimate and the “reported” estimate are interesting, it seems that the estimate we are really after is the one with no padding (“self” in the experiments above), especially since we want to make sure that the same padding isn’t added more than once to the same estimate, by different individuals making different assumptions as it’s passed up the chain. We’ve found the best way to uncover this estimate is to repeatedly ask the estimator for an unpadded estimate. They’ll usually protest, but give it up in the end (amidst a sea of caveats).

Who is Responsible?

Without an established estimation process, we’ve seen that an estimate can be more than double another estimate, depending on the estimator alone! Who is responsible for this process, for reporting its outcome, and for its accuracy and fidelity?

From the perspective of our development staff, estimates generally originate from tech leads, client suggestions re: budget, sales engineers, or project templates. Many were concerned about the level of input into the estimates that they’d had before projects are sold. To the question “Who is responsible for the estimate?” the following were replies:

  • sales guy
  • team lead
  • project manager
  • team as a whole

Curiously, no-one reported that they felt individually responsible for estimates others made, or even their own estimates when these were rolled into a larger package of estimates.

When asked “what would make you feel more responsible for your and your team’s estimates”, the following answers were reported:

  • knowing that you will do the work
  • macro-level feedback; over the course of a number of iterations or projects, how is the team doing relative to overall estimates?
  • micro-level feedback; within a single iteration, how does the actual time spent day-to-day compare to the originally estimated time on a per-task basis?
  • “say” in the technology being used; familiarity breeds comfort

Clearly, estimators feel more comfortable estimating for themselves, about domains they understand, using technologies with which they are familiar. They also want tools to help measure how well they are doing so that they can improve on their estimations over time. While these results are not surprising, the conditions that estimators need to improve on estimations over time are often either not possible or not enacted.

What about the rest of the work?

You might recall that the estimates from the quantitative experiment were for the build and unit test effort only. On any consulting project, there is also time set aside for discovery, analysis, design, integration testing, UAT, documentation, deployment, project management, and warranty support.

This begs the question “what percentage of the total effort is the build?”, as this has a major impact on the final estimate for all of the services to be delivered to the client.

Results were all over the map, here: 25%, 30%, 40%, 60%, 70%, 80%, “not more than 50%”, “60% regardless.” I didn’t bother to graph these.

The group began to better understand the point of the question when a follow-up was asked: “If the federal government asked you how much the total effort would be, how would you mark up your build estimate relative to a scenario in which an individual asked you what the total effort was?”, the assumption being that the federal government would be a much more formal, slow-moving, and difficult client than an individual might be. This question garnered a lot of contemplative ceiling-staring.

Obviously, the answer to this question has a huge effect on estimates that are delivered to clients, and will be the subject of a follow-up article on the software estimation methods we use at Jonah.

Wrap-up

With respect to software build estimates, asking different people the same question leads to wildly different results. Even asking the same question to the same person multiple times leads to different results, as the estimator is cajoled into reconsidering their assumptions and thinking more deeply about the problem. It is clear that there is no single “estimate”. The answer really is “it depends.”

Technology, domain knowledge, and who’s asking all have measurable effects on an estimate, but none of these compare to the effect of the estimator themselves, including not only how much time the estimator thinks it will take, but the vast array of assumptions that the estimator has considered in support of the estimate.

So where does this leave the estimator in a sales context, when the prospective client asks for the estimate for all of the work? How much should one squeeze the estimate in a competitive situation? Where does the impulse for a developer to disregard the estimate begin (“I never said I could get it done in that amount of time”)? What effect does the type of contract have on the estimate? How does one integrate the notion of the “type of client” into an estimate?

Certainly, software estimation should not be done haphazardly, and one’s approach should be defensible to both the client and to the delivery team. We’ve found that more accurate estimates are delivered if people discuss their mental models and negotiate with one another, rather than just asking for a number, marking it up and reporting it. Negotiating increases the accuracy of the estimate by helping to drive out assumptions. It also helps the team to collectively feel that they are stakeholders, and that they implicitly share a common purpose. The effect of this negotiation on estimation accuracy is unmistakable.

In a follow-up article, we'll discuss two models we use to estimate software projects, examine how these deal with the effect that the client has on a software estimate, and try to answer some of the more sticky questions above.

  1. Interesting article, a bit long, but engaging. An important contributing factor to estimation is trust. Inherently, we trust more people that understand what we do and less those we pay to do something for us. We get more cautious and doubtful with the latter. The article however, never talks about the actual process and methodology of estimation. Is it just part of the overall development methodology? Is it just (gu)estimation? Half ‘n’ half? Is it agile? User stories? Tasking? Surely not use cases involved? Or do we simply look to rely on folks who can cut the mustard, that we know from past experience?
    Oh, and I liked the mind maps too. Very descriptive.

    Nenad
    Jan 24th, 2011
  2. Yes – I agree – it was getting long. That’s why I omitted the estimation process Jonah uses. The follow-up article will talk about that a bit, as well as the client effect.

    Jeremy Chan
    Jan 24th, 2011

Add a comment

Comment feed
The better to greet you with
No one will ever see this
Your pride and joy
The reason this comment form exists

The crew behind ASOT

We're a team of interactive, software, and business intelligence experts skilled in the design, construction, and management of online enterprise systems.

Visit The Jonah Group site

Get in touch with us