Validating and Analyzing the Data

The cost analyst must consider the limitations of cost data before using them in an estimate. Historical cost data have two predominant limitations:

the data represent contractor marketplace circumstances that must be known if they are to have future value, and
current cost data eventually become dated.

The first limitation is routinely handled by recording these circumstances as part of the data collection task. For example, the contract type to be used in a future procurement—such as firm fixed-price, fixed-price incentive, or cost plus award fee—may differ from that of the historical cost data. Although this does not preclude using the data, the analyst must be aware of such conditions so that an informed data selection decision can be made. To accommodate the second limitation, an experienced cost estimator can either adjust the data (if applicable) or collect new data.

A cost analyst should attempt to address data limitations by

ensuring that the most recent data are collected,
evaluating cost and performance data together to identify correlation,
ensuring a thorough understanding of the data’s background, and
holding discussions with the data provider.

Thus, it is best practice to continuously collect new data so they can be used for making comparisons and determining and quantifying trends. This cannot be done without background knowledge of the data. This knowledge allows the estimator to confidently use the data directly, modify them to be more useful, or simply reject them.

Once the data have been collected, the next step is to create a scatterplot of the data. A scatterplot provides a wealth of visual information about the data, allowing the analyst to determine outliers, relationships, and trends. In a scatterplot, cost is typically treated as the dependent variable (the y-axis). Independent variables depend on the data collected, but are typically technical—such as weight, lines of code, and speed—or operational parameters—such as crew size and flying hours.

The scatterplots provide visual information about the dispersion in the data set, which is important for determining risk. In addition, the extent to which the points are scattered will determine how likely it is that each independent variable is a cost driver. The less scattered the points are, the more likely it is that the variable is a cost driver. Eventually, the analyst will use statistical techniques to confirm cost drivers, but using scatterplots is an excellent way to identify potential drivers.

The cost estimator should also calculate descriptive statistics to characterize and describe the data. Important measures and statistics include sample size, mean, standard deviation, and coefficient of variation. The coefficient of variation is calculated by dividing the standard deviation by the mean. The resulting percentage can be used to compare the extent of variation within data sets.

Visual displays of the descriptive statistics help discern differences among groups. Bar charts, for example, are useful for comparing means. Histograms can be used to examine the distribution of different data, the frequency of values, and for determining potential outliers.

Many times, estimates are not based on actual data but are derived by subjective engineering judgment. All engineering judgments should be validated before being used in a cost estimate. Validation involves cross-checking the results, in addition to analyzing the data and examining the documentation for the judgment. Graphs and scatterplots can help validate an engineering judgment because they can quickly point out any outliers.

An outlier is a data point that is typically defined as falling outside the expected range of three standard deviations. Statistically speaking, outliers are rare, occurring only 0.3 percent of the time. If a data point is truly an outlier, it should be removed from the data set, because it can skew the results. However, an outlier should not be removed simply because it appears too high or too low compared to the rest of the data set. Instead, a cost estimator should provide adequate documentation as to why an outlier was removed. The documentation should include comparisons to historical data that show the outlier is in fact an anomaly. If possible, the documentation should describe why the outlier exists. For example, there might have been a strike, a program restructuring, or a natural disaster that skewed the data. If the historical data show the outlier is simply an extreme case, the cost estimator should retain the data point; otherwise, it will appear that the estimator was trying to manipulate the data. Removing an extreme case should rarely be done because historical data are necessary for capturing the natural variation within programs.