The Use of Econometric and Statistical Analysis in Damages

Introduction

Empirical analysis – using data to test hypotheses and understand relationships – is a critical part of economic research and has become increasingly important in disputes. There are numerous tools available to help to understand data, and yet many lawyers are reluctant to have their experts use all the powerful tools in their toolbox out of a concern that the analysis will be too complex, too confusing to present, or that it will appear to have been created out of thin air. Although those concerns may be reasonable in principle, in practice they are readily addressed.

To see why that is so, and before addressing the details of any data-driven analysis – particularly a statistical or econometric methodology like regression analysis – it is useful to take a step back and think about the underpinnings of an empirical study, which is data. A useful way to think about data is that it is information. ‘Data’ does not mean numbers alone but also refers to information – whether numerical or textual – as well as the context and source of that information. Thus, when thinking about statistical analysis, a fundamental question to be considered is whether or not a data-driven analysis – or, more specifically, any statistical analysis – is appropriate for the matter at hand. To that end, there are at least four questions to be addressed.

  • What do I want to prove and can I use data to prove it?
  • How do I use data properly to prove what I want to establish?
  • How can I be sure of what the data is telling me?
  • How do I choose the right methodology to analyse the data I have?

If it is decided that data is necessary (or at least useful) in proving what needs to be established, then the information requirements for that analysis must be considered. In assessing whether a set of numbers are the ‘right’ numbers for a statistical study, one also needs to consider their ‘context’ and ‘source’.

As to context, although it is clear what a number is within a data set (e.g., sales volume, price, a person’s age or income), the meaning of those numbers requires context. To make clear what context is, imagine being confronted with a data set showing sales.

Table 1

14,361,21112,017,002-
15,074,19112,316,616-
15,166,20713,040,21010,611
16,891,91613,896,60791,431
17,110,00114,771,432283,771
17,916,14215,210,312356,291
17,432,11616,007,519517,991
15,817,21016,743,910843,062
15,612,90117,212,3101,120,312
15,156,21017,943,5191,753,077
15,001,71018,337,2632,910,816

Standing alone, those numbers that apparently reflect sales, are meaningless. Their meaning starts to emerge with an understanding of the context of the data. That is, sales of what product? What period is covered by the sales? What geographical area is covered? Sales to whom? How are they measured (i.e., units or in a currency (and which currency))? Are the sales numbers final or merely preliminary? Net or gross? Whether this data is useful, or even appropriate, for a statistical study cannot be answered without having a sense of the data’s context.

Equally important before venturing into the world of statistical study is a grasp of the source of the data: the entity or place or publication in which the data is found; who provided the data; and whether the basis by which to view them is reliable.

Fundamentally, for a data-driven analysis, whether it is based on a simple analysis of means and standard deviations or a more sophisticated and complex econometric study, having good answers to these questions is fundamental. If nothing else, for an expert who is presenting the data or statistical analysis, surviving an admissibility challenge may often depend on establishing the reliability and appropriateness of the data, and these issues are fundamental to that process.

Platitudes aside, what does context add?

Table 2: Best Widgets Company – annual widget sales by region (units)

YearNorth AmericaEuropean UnionRest of the world
201114,361,21112,017,002-
201215,074,19112,316,616-
201315,166,20713,040,21010,611
201416,891,91613,896,60791,431
201517,110,00114,771,432283,771
201617,916,14215,210,312356,291
201717,432,11616,007,519517,991
201815,817,21016,743,910843,062
201915,612,90117,212,3101,120,312
202015,156,21017,943,5191,753,077
202115,001,71018,337,2632,910,816

Simply by putting a time and geographical dimension on the data, and indicating the unit of measurement, the stories the data can tell become clearer. Whether the data alone is sufficient to answer the relevant questions is unclear, but simply by adding context to the numbers, the way in which the data might be useful is much more obvious. It may also be clear that to analyse the question at hand fully, more data is needed. Either way, context is critical.

The same is true for source. Not all data is created equal. One critical task for an analyst using statistical methods to address a complicated problem is to understand where the data comes from and to take reasonable steps to be sure the data is reliable. It is necessary to understand the source of the data, how it is created and collected, and to ensure that the data is capable of generating and supporting robust results.

As stated above, data is information but it is distinct from statistics. Statistics is a method for analysing that information. At a high level, there are descriptive statistics; for example, measures of central tendency (mean, median, mode) or measures of dispersion (variance and standard deviation). There are also more analytical statistics, such as correlation, measures of statistical significance, goodness of fit and regression analysis. The type of statistic needed determines what to do with the available data and how to analyse it. It should be noted that the choice of, or need for, descriptive or analytical statistics is not an either/or choice. The question being asked and the available data determine what statistical analyses are supportable. It will often be the case that a question can best be answered with a wide range of statistical analyses. What must be emphasised, however, is that even simple descriptive statistics can be powerful tools for analysis and can support robust conclusions, even when analytical statistical methods are needed to answer more complex questions.

Overview of regression analysis

Regression analysis is a statistical technique that can be used to estimate the relationship between a dependent variable and one or more independent variables. The term ‘dependent variable’ simply means that the variable may depend on, respond to or be associated with the values of one or more other variables.[2] Similarly, an ‘independent variable’ is a one that may partially determine, influence or otherwise help to predict the value of the dependent variable.[3]

Economists use regression analysis to evaluate relationships between variables of interest in economic data or to predict future values of a variable of interest.[4] It is in this context where regression analysis is most frequently used for damages purposes. For example, consider a multi-feature product such as a smartphone or microprocessor. A damages expert may be interested in the relationship between a patented feature and the price of an accused product. Such a relationship can be estimated using a type of regression called hedonic regression, discussed below. As another example, a damages expert may be interested in estimating the effects of a product’s sales on the sales of a competitive product. The damages expert could use time series regression to estimate the product’s sales for the periods of interest given historical data on the product’s sales prior to the sales of the competitive product.[5] Both examples are discussed in more detail below.

There are many different types of regression analyses and estimation techniques that can be used, depending importantly on the question of interest (e.g., how an improved camera zoom capability affects the price of a smartphone). The most common type of regression used in econometric analysis is linear regression, which assumes there is a linear relationship between the dependent variable and the independent variable, or variables. Additionally, the most commonly used estimation technique for linear regression is referred to as ordinary least squares. With this estimation technique, the coefficients on the independent variables are set such that the resulting linear function provides the best ‘fit’ with the values of the dependent variables in the sample.[6]

It is important to note that the results obtained from a regression estimation are dependent on the data sample used. In other words, with a different sample of data, the analyst would expect to obtain different results. Thus, the analyst should not view the results of a regression estimation as providing an irrefutable answer to a question of interest: such is the nature of statistical inference. Fortunately, a number of tools are readily available that can help the analyst evaluate the regression results while accounting for the probabilistic nature of statistical analysis. Although a complete overview is beyond the scope of this chapter, a few important concepts that frequently arise when performing regression analysis are described below.

  • R-squared: This metric is the proportion of variation in the dependent variable that is explained by the independent variables. The closer the observed values are to the predicted values, the better the regression fits the actual data. R-squared values range from zero to one, with higher values indicating a better fit.[7] An alternative metric, known as adjusted R-squared, measures the same fraction of variation in the predicted value that is explained by the regression, adjusted for the number of variables used in the regression.[8]
  • Multicollinearity: This issue arises when an independent variable can be predicted with a high degree of accuracy from one or more other independent variables. When a data sample suffers from multicollinearity, the resulting estimated coefficients on independent variables may be unreliable. In some sense, the group of independent variables suffering from multicollinearity contain redundant information in the sense of their respective effects on the dependent variable. As an example of multicollinearity, consider height measured in inches and height measured in centimetres. Attempting to include both as independent variables in a regression analysis would be problematic as they contain the same information, albeit in different units of measurement. In fact, it would not be possible to estimate such a regression since height in inches perfectly predicts height in centimetres and, thus, it would be impossible to identify the effect of either variable on the dependent variable. Such a case is known as perfect multicollinearity.
  • P-value: This is a random variable that provides information on how likely a hypothesis about a coefficient of interest is to be true. Given a hypothesis (e.g., an improved camera zoom capability has no effect on the price of a smartphone), the p-value is equal to the probability of obtaining the observed test statistic value if the hypothesis were true. Thus, a low p-value provides evidence that the hypothesis is not true, as it would indicate that the likelihood of obtaining the result would be low if it were true. In statistical parlance, an analyst may ‘reject the null hypothesis’ if the p-value is less than some pre-specified threshold, with 0.10, 0.05 and 0.01 being the most commonly used.
  • Standard error: The standard error of a statistic is equal to the standard deviation of the statistic’s distribution (or an estimate of the standard deviation).[9] For a regression analysis, the analyst will often be interested in the standard error of an estimated coefficient on a variable of interest, as it can provide insight into the precision of the estimated coefficient. Generally, the smaller the standard error, the more precise the estimate, in the sense that repeated estimation with different samples would be expected to generate lower variation than if the standard error were large. Dividing the estimated coefficient by the standard error results in the t-value or t-statistic for the estimated coefficient. A t-value provides another way by which hypothesis testing can be conducted, as a t-value has a corresponding p-value.

Examples of regressions

Predict a trend

A predictive model is a useful and reliable statistical tool because, unlike a model that estimates an outcome based on all the factors that cause that outcome (i.e., a causal model), it is not necessary to specify each and every factor that causes an outcome. All that is needed is a factor that can reliably predict an outcome.

By way of analogy, say you want to estimate the average temperature in a certain city in the month of July. There will be variation in the temperature during the days of the month but the best predictive factor will be a data set of past temperatures for that month. You do not need to know the barometric pressure, humidity, precipitation, cloud cover, and so on. You can predict, with reasonable certainty, the average temperature for a given month by knowing what the average temperature was for that month in past years.

This type of regression analysis is referred to as a time series regression, in which a growth trend is estimated and data is projected forward based on that estimation. Time is the independent variable (x-variable) and this type of model assumes that the dependent variable (the y-variable) will vary based on the historical trend.

It is possible to account for discrete anomalies that affect the data through the use of indicator, or ‘dummy’, variables. These are categorical variables that take on a value of one for a single period, or for certain periods in which the same effect is expected to have occurred, and zero for all other periods. For instance, if one were analysing sales data and a company had a 10-year anniversary sale that led to higher than usual sales revenue, this effect could be accounted for with the use of a dummy variable, or variables. The dummy variable would take on a value of one for the duration of the sale and zero for all other times. Any revenue forecast would probably have to account for the increase in revenue during the sale that would not be repeated in the future, absent a comparable sale. In addition, dummy variables are often employed if data is quarterly or monthly, or if it varies by season to account for specific differences during those periods. A chocolatier, for instance, might use a dummy variable that is one for certain days around Valentine’s Day when attempting to predict its sales trend.

A simple example can be given by using a version of the information from the sales of widgets by Best Widgets Company shown above in Table 2. Since 2010, Best Widgets Company has been selling a widget under its trademark WFA. In 2012, the company experienced a warehouse fire that reduced sales for that year, but it rebounded and had been experiencing steady growth since that time. In 2017, a competitor, Great Widgets Company, started selling similar widgets by infringing the WFA trademark. The sales trend for Best Widgets Company is shown in Figure 1, below.

Figure 1: Best Widgets – Widget sales in North America

image-20221214200828-1

The loss in sales experienced by Best Widgets can be estimated using a time series regression through 2017 with a dummy variable for 2012, since the fire was a one-off occurrence that is not expected to be repeated. Then sales between 2018 and 2020 can be predicted. There are many computer programs that can run a simple regression such as this, although more complicated analyses will require more sophisticated tools. The results of the regression are shown below in Table 3.

Table 3

image-20221214201558-4

Some of the statistics in Table 3 that an expert would want to examine would be the coefficients on the dependent variables – in this case Period and Dummy. The coefficient on the Period variable is a positive 731,573.4, with a t-value of 16.92. The p-value (shown in the column with the header ‘P>|t|’) is the probability that an analysis would result in a t-value as large as that calculated in the analysis in a collection of random data in which the variable of interest (e.g., Period) has no effect. With a p-value of 0.000, there is almost zero probability that these results happened by chance. Typically, an analyst will be looking for p-values of less than 0.05 or 0.10, indicating that the coefficients are significant at the 5 per cent or 10 per cent level. With p-values of 0.000, the coefficients on Period and Dummy are both statistically significant at more than even the 1 per cent level.

These coefficients are measures of units, and the coefficient of 731,573 on Period means that the regression predicts an increase of that many units per year. The coefficient on Dummy is a negative 4,665,780, indicating a decrease of approximately 4.7 million units from what would be expected from the regression results. This is the impact that is attributable to the warehouse fire. A dummy variable of this type will shift the results by a constant amount – if the dummy variable had been applied to more than one period it would measure one impact that would apply to all periods for which the dummy variable applied.

Other regression statistics that may be of interest address the overall significance and the goodness of fit of the model. The F-value or F-statistic (shown in the top right of the regression results, above) tests the joint statistical significance of coefficients in the regression, and it is used to determine the likelihood that a combination of coefficients is non-zero. The F-value is 351.10 and the Prob>F = 0.000, which is less than 0.05 and even less than 0.01, indicating that there is a statistically significant relationship between the independent and dependent variables. The F-test tests the hypothesis that a model with no independent variables (so that it only has an intercept) fits the data as well as the regression model with the independent variables. It is possible that none of the independent variables is statistically significant individually, but together they are jointly significant. The more independent variables that are included in a model, the more likely this result is to occur. The adjusted R-squared value, which is 0.9901 in the regression analysis shown in Table 3, indicates that the model is likely to be a good fit.[10]R-squared is always between 0 and 1, and the closer it is to 1, the more of the variability of the data around its mean has been explained by the model or, in other words, the better is the overall fit of the model.

This model predicts that, for the years from 2018 to 2020, unit sales by the Best Widgets Company would have been 20,221,427, 20,953,001 and 21,684,574, respectively. Actual and predicted sales are shown in Figure 2, below. Because actual sales in 2018, 2019 and 2020 were 12,612,901, 11,156,210 and 11,001,710, this indicates a difference of 7,608,526 units, 9,796,791 units and 10,682,864 units, respectively. In total, this means that the regression analysis predicts that Best Widgets Company would have sold an additional 28,088,181 units but for the actions of Great Widgets Company.

Figure 2: Actual and predicted widget sales

image-20221214201053-2

This is likely to be a less complex set of data than typically will be found in the real world. The growth trend in the regression model has been assumed to be linear – that is, the increase is assumed to be a fixed number of widgets per year. It is often the case that a percentage growth is more realistic; instead of assuming an increase of a fixed number of units per year, one would assume a percentage increase per year. One way this can be accomplished is by using a log transformation. There are various types of log transformations, but one common application involves ‘transforming’ the dependent variables by taking the natural log (log base 10, or ln) of those amounts. When the regression is run using the transformed ‘log’ amounts and the ‘level’ or standard amounts, the resulting coefficients have a different interpretation. Instead of interpreting the amount as a fixed number of additional units per one unit change in the independent variable (such as the 731,574 per year shown in the regression analysis in Table 3), the result is now a percentage change per one unit increase in the independent variable.[11]

This regression analysis looked at the number of units that Best Widgets Company lost as a result of the infringing activity of Great Widgets Company. A regression analysis can be a helpful tool in determining damages but typically it will not be the entirety of a damages analysis. For instance, to calculate lost profits, which is typically the relevant measure of damages when looking at the plaintiff’s or claimant’s loss, the profit per widget would have to be determined. It is also possible that the infringing actions of Great Widgets Company caused there to be price erosion (another measure that may be informed by regression analysis). Also, if Best Widgets Company lost sales for reasons other than the infringing activity of Great Widgets Company, that would have to be taken into account in determining damages. A regression, then, can inform a damages analysis but must be accompanied by a reasoned economic model and other considerations may have to be taken into account to arrive at a final calculation of damages.

Explain a relationship

Regression analysis is also frequently used to explain a relationship between or within two or more variables. In this context, the dependent variable is the variable to be explained. The independent variables are those believed to produce changes in the dependent variable – in other words, they are the variables that explain the dependent variable (and are commonly referred to as ‘explanatory variables’ for that reason).

One useful type of explanatory regression analysis is hedonic regression, which is the use of a regression model to estimate the influence that various factors have on the price of a good. For example, the price of a house may be influenced by a number of factors, such as the age of the house, whether it has state-of-the-art features, the level of crime in the neighbourhood, whether it is located near a good school and the level of water and air pollution, among other things. By way of hedonic regression, the influence of each of these factors can be estimated in isolation. Thus, hedonic regression can reveal the value of the factors or features that influence the price of a good even though those factors or features are not priced individually.

There are several statistical issues that frequently arise in hedonic regression analysis. One is that there may be an important explanatory variable that has been left out of the regression (referred to as an omitted variable). An omitted variable causes bias in the coefficients of the variables in the regression. In the above example relating to residential housing, leaving out house size as one of the independent variables would bias the coefficients of the other variables in the regression. This is because the regression model is effectively attributing the effect of house size to the variables included in the model.

Another statistical issue is that there may be multicollinearity in the independent variables. This means that two or more of the independent variables are correlated (i.e., there is a statistical relationship between them). Although multicollinearity does not reduce the power or reliability of the model as a whole, it leads to results in which individual variable coefficients are not reliable. For instance, including two measures of house size as independent variables in the residential housing regression – number of bedrooms and square metres – may introduce multicollinearity because the two variables are correlated: as number of bedrooms increases, so too does square metres. If number of bedrooms and square metres are highly correlated, the regression model cannot reliably determine the individual effect that either of those variables has independent of the other.

Hedonic regression analysis and explanatory regression analysis more generally can be useful tools for evaluating damages.[12] Disputes often involve questions of whether X caused or resulted in Y and, if so, how much Y changed because of X. These are the types of questions that explanatory regression analysis can answer. For example, in a dispute involving a particular characteristic of a product or service, the value associated with that particular characteristic could be determined using hedonic regression analysis.

Consider the following simple illustration. Company A sues Company B for using Company A’s proprietary feature in its widget product and hires a damages expert to determine damages. There are three features of Company B’s widget product: Feature1, Feature2 and ImportantFeature, which is Company A’s proprietary feature. A hedonic regression analysis in which price is the dependent variable and the three features of Company B’s widget product are the independent variables generates the results shown in Table 4, below.

Table 4

image-20221214201315-3

The coefficient on ImportantFeature is a positive 0.16, with a standard error of 0.02 and a t-value of 6.59. The coefficient signifies how much the dependent variable – price – changes given a one-unit change in ImportantFeature, holding the other variables constant. The results show that a one-unit increase in ImportantFeature is associated with a $0.16 increase in price. The p-value (0.001) demonstrates that the coefficient is statistically significant at higher than the 5 per cent level (i.e., there is a less than 5 per cent probability that the results are due to chance).

Although this is a highly simplified illustration, it shows how regression techniques such as hedonic regression can be used to answer complex damages questions.

Conclusion

This chapter introduces econometric and statistical principles and methods and discusses how they can be used in the context of disputes. Whether and which type of method to use depends on the question being asked and the data available. Regression methods can be particularly powerful tools for determining damages in a wide variety of disputes.


Notes

[1] Jennifer Vanderhart and Steven Schwartz are managing directors, and Richard Brady and Aminta Raffalovich are vice presidents at Intensity, LLC.

[2] Other commonly used names for dependent variable include response variable and outcome variable.

[3] Other commonly used names for independent variable include predictor and explanatory variable.

[4] The use of regression analysis, and statistical analysis more generally, for these purposes is referred to as econometric analysis.

[5] In time series regression, historical values of the variable of interest (e.g., sales from the previous period) are often used as the independent variable, or variables.

[6] When using the ordinary least squares (OLS) technique, the ‘fit’ is measured as the sum of squared residuals, where each residual value is defined by the difference between the value of the dependent variable for that observation and the regression model’s predicted value for the dependent variable. OLS sets the coefficients in the regression function such that the sum of squared residuals is minimised.

[7] James H Stock and Mark W Watson (2011), Introduction to Econometrics (3rd ed., Boston, Massachusetts: Addison-Wesley), at 193–95.

[8] William H Greene (2012), Econometric Analysis (7th ed., Boston, MA: Pearson Education, Inc.), at 139–40.

[9] Note that the terminology is occasionally used in reference to the ‘standard error of the regression’, which, like R-squared, provides a measure of how well the model fits the data.

[10] Adjusted R-squared accounts for the degrees of freedom. Including additional independent variables cannot reduce R-squared, but if their inclusion does not reduce the residual variance, then it can reduce the adjusted R-squared statistic. Adjusted R-squared is typically a better measure of fit than R-squared.

[11] There are other statistical concepts that will typically have to be addressed when performing a regression analysis, but these are beyond the scope of this chapter.

[12] They are also useful tools for evaluating liability in some circumstances. For example, in a dispute alleging sex discrimination in salaries, an explanatory regression analysis could be used to evaluate whether sex explains differences in salaries, all else being equal.

Unlock unlimited access to all Global Arbitration Review content