Data Collection

The data collection plan should emphasize the collection of current and relevant technical, programmatic, cost, and risk data. Data collection is a lengthy process and continues throughout the development of a cost estimate and through program execution. Many types of data need to be collected: technical, schedule, program, and cost data. Data can be collected in a variety of ways, such as from databases of past projects, interviews, surveys, data collection instruments, focus groups, and market assessment studies. After the estimate is complete, the data need to be well documented, protected, and stored for future use in databases. The cost data should be managed by estimating professionals who understand what the historical data are based on, can determine whether the data have value in future projections, and can make the data part of the organization history.

Cost estimates require a continual influx of current and relevant cost data to remain credible. Cost data should be continually supplemented with written vendor quotes, contract data, and actual cost data for each new program. Moreover, cost estimators should know the program acquisition plans, contracting processes, and marketplace conditions, all of which can affect the data. This knowledge provides the basis for credibly using, modifying, or rejecting the data in future cost estimates.

Knowing the factors that influence a program’s cost is essential for capturing the right data. Examples are equivalent source lines of code, number of interfaces for software development, number of square feet for construction, and the quantity of aircraft to be produced. To properly identify cost drivers, it is imperative that cost estimators consult with the engineers and other technical experts. In addition, by studying historical data, cost estimators can determine through statistical analysis the factors that tend to influence overall cost. Case study 12 below highlights the importance of having historical data.

Case Study 12: Addressing Risks, from F-35 Sustainment, GAO-16-439

Central to F-35 sustainment is the Autonomic Logistics Information System (ALIS)—a complex system supporting operations, mission planning, supply-chain management, maintenance, and other processes. ALIS had experienced developmental issues and schedule delays that had put aircraft availability and flying missions at risk. The National Defense Authorization Act for Fiscal Year 2016 included a provision that GAO review the F-35’s ALIS. GAO assessed, among other things, the extent to which DOD had credibly and accurately estimated ALIS costs.

DOD had estimated total ALIS costs to be about $16.7 billion over the F-35’s 56-year life cycle, but performing additional analyses and including historical cost data would have increased the credibility and accuracy of DOD’s estimate.

For example, while GAO found that the DOD substantially met some best practices in their estimate for the ALIS program for an accurate cost estimate by properly adjusting for inflation and not including mathematical errors, the estimate used contractor-provided data for material costs instead of actual ALIS costs or historical cost data from analogous programs that would have made the estimate more accurate. Cost estimating officials said that they did not base their ALIS estimates on historical cost data because they believed that there were no programs analogous to ALIS. For example, there is a logistics system for the Air Force’s F-22 program—also a fifth-generation aircraft—but officials stated that it was far less complex than ALIS and did not include all of ALIS’s applications and intended functions. However, multiple versions of ALIS have been fielded since 2010 and using historical data on known ALIS costs, as well as analogous data from the F-22 or other programs, would make the estimate more accurately representative of likely sustainment costs.

GAO’s Cost Estimating and Assessment Guide states that a cost estimate should be based on historical data—both actual costs of the program and those of comparable programs—which can be used to challenge optimistic assumptions and bring more realism to a cost estimate.

Cost estimates must be based on realistic schedule information.21 Some costs, such as labor, quality, supervision, rented space and equipment, and other time-related overheads, depend on the duration of the activities they support. Often, early cost estimates are aligned with the baseline schedule. But, estimators should be aware of changes in the schedule because schedule changes likely lead to cost changes. Furthermore, seeking input from schedule analysts can provide valuable knowledge about how aggressive a program’s schedule may be.

Additionally, backup data should be collected for performing cross-checks, and risk data should be collected to support sensitivity analysis and risk and uncertainty analysis.22 This takes time and usually requires travel to meet with technical experts. It is important to plan ahead and schedule adequate time for these activities. Scheduling insufficient time may affect the estimator’s ability to collect and understand the data, which can result in a lower-quality cost estimate.

A common issue in data collection is inconsistent data definitions between historical programs and the new program. Understanding what the historical data include is vital to data reliability. For example, are the data skewed because they are for a program that followed an aggressive schedule and therefore instituted second and third shifts to complete the work faster? Or, was a new manufacturing process implemented that was supposed to generate savings but instead resulted in more costs because of initial learning curve problems? Knowing the history behind the data allows for their proper use in future estimates.

Data may not always be available, accessible, or complete. For example, some agencies may not have cost databases. Data may be accessible only at the summary level, and information may not be sufficient to break them down to the lower levels needed to estimate various WBS elements. Data may also be incomplete. For instance, data may be available for the cost to build a component, but the cost to integrate the component may be missing. Similarly, if data are in the wrong format, they may be difficult to use. For example, if the data are only in dollars and not hours, they may not be as useful if the labor and overhead rates are not available.

Sometimes data are available, but the cost estimator cannot gain access to them. This can happen when the data are classified or considered competition sensitive. In these cases, the cost estimator may have to change the estimating approach to fit the data that are available.


  1. GAO’s Schedule Assessment Guide (GAO-16-89G (Washington, D.C.: December 2015)) provides information on how to create a reliable schedule.↩︎

  2. (For additional discussion of risk data, see chapter 12.↩︎