Start with a simple model

As a field, we seem to be gravitating to a modus operandi where model complexity equates to making a theoretical contribution. The notion seems to be that the more variables we have, the more hypotheses we test in a single paper, and the more mediators and moderators we include, the greater the ‘contribution’ of the paper. Models end up with causal paths pointing in each cardinal direction, all under the assumption that the ‘more stuff’ I include equates to richness of understanding.

I think this is a bad trend, for three reasons.

1). The likelihood of model misspecification—The more variables, hypotheses, and complexity in the model, the more likely that the entire model will not just be wrong, but really, really, bad;

2). Creates perverse incentives—Tying a theoretical contribution to model complexity invites HARKing and p-hacking; and

3). Barriers to replication—The more a model depends on a particular set of variables from a particular dataset, each of which with its own complex construction, the harder it is for another researcher to replicate the study.

So what’s the answer? Start with a simple model…

  • A single x, predicting a single y;
  • Minimal measurement error for both variables;
  • A large, appropriately powered sample;
  • Appropriate steps to eliminate alternate explanations for the relationship between x and y (ideally by manipulating x); and
  • A reproducible codebook to ensure others can follow along with what you did.

Seriously, that’s it. Now, it’s actually really hard to do steps 2, 3, and 4. These are, however, critical to yield an unbiased estimate of the effect of x on y. Noisy measures in noisy data with small true effect sizes are far more likely to yield unpredictable (and usually inflated) results. A well developed measure, with measurement error kept to a minimum, needs a large dataset to tease out meaningful insights. Too often we see large datasets, but measures of such a convoluted construction that understanding just what the researcher did to build the measure—let alone have confidence that the observed effect is not simply an artifact of the measurement model—makes the contribution trivial at best. By the same token, well done measurement models tested in small, noisy samples result in a similar interpretational problem; it’s too difficult to separate the signal from the noise.

Step 4—dealing with endogeneity—is a topic near and dear to my heart. Here’s my specific problem…it’s so challenging to isolate a consistent effect size estimate for ONE focal relationship. The more hypotheses and variables added to the model, assuming the researcher tests it simultaneously, the difficulty in recovering consistent effect sizes increases exponentially; you are far more likely to screw the entire model up.

Of course, sharing your code, and ideally your data, is pretty easy. But it’s just not something commonly done in management and entrepreneurship research. I hope that is changing, and for me, all of my papers now include posted codebooks and data. There is just no good reason not too.

I think one solution is for journals to encourage more single hypothesis papers. Take an interesting question—say estimating the probability that a failed entrepreneur will start another new venture—and evaluate that question with 2-3 independent studies, with consistent measures, in large representative samples, and ideally with the same instruments used to address the endogeneity problem. As an incentive, journals could offer expedited review of these studies, assuming that the researcher shared his or her data and code.

The bottom line is that headline grabbing effect sizes with sexy variables in complicated models are, over the long run, far more likely to be found wanting than vindicated as possibly right. Science progresses with small, incremental contributions to our knowledge base. Start with a simple model, test it rigorously, and better our management science.

Selection models and weak instruments

As an editor and reviewer, I’m seeing more selection models (e.g., Heckman) these days that suffer from weak exclusion restrictions (i.e., weak instruments). Weak instruments are a problem in any method dealing with endogeneity where an instrument varible, i, is a proxy for random selection. Heckman selection models share a similar problem of weak instruments, and it has to do with the exclusion restriction (Bushway et al., 2007). Researchers employ a Heckman selection model to address omitted variable bias stemming from a specific sample selection problem. In the classic example, a model predicting the relationship between wages and education would only include in the sample those educated individuals who chose to work. The self-selection into the sample, for reasons unknown to the researcher, create a type of omitted variable problem manifesting as endogeneity (Certo et al., 2016).

In the Heckman correction, the researcher estimates a first-stage probit model predicting the likelihood of the entity selecting into the sampling condition. For example, in a study on corporate venturing on firm performance, the first stage equation would be a probit model^1 predicting the probability of the firm engaging in corporate venturing activity, in the following form:

Pr(y=1|Z)=cdf(ZBeta)

Here we estimate the probability of y occurring given a set of observed predictors, Z, with effects Beta, and cdf as the cumulative distribution function of the standard normal distribution. Heckman’s insight was to recognize that a transformation of the predicted values in the first stage represents the selection hazard of appearing in the sample (Clougherty et al., 2016). Using this transformation of the predicted values from the equation, often called the inverse Mills ratio, in a second stage, ordinary least squares estimate of the focal model of interest, yields an estimate of the selection hazard, typically denoted by lambda. In our example, lambda would represent the selection hazard of the firm engaging in corporate venturing activity. Evaluating the statistical significance of lambda proxies the presence of a meaningful selection effect in the second stage model.

Drawing the correct inference about selection effects in the second stage model depends though on two critical factors. The first factor is that while including the inverse Mills ratio in the second stage equation yields a consistent estimate of x—assuming all other assumptions are met—it also yields inconsistent standard errors for every estimated parameter (Clougherty et al., 2016). There are several methods to correct the standard errors in the second stage, including manual matrix manipulation, but most selection estimators (e.g., sampleSelection in R and heckman in Stata) make this correction automatically. The concern is whether the researcher used these estimators, or simply calculated the inverse Mills ratio by hand and then included the value as an other regressor in a second model and then didn’t correct the standard errors.

The second factor is high collinearity between the inverse Mills ratio and the other predictors in the second stage equation. Because the first and second stage equations share the same vector of predictors, the transformed predicted value in the first stage correlates strongly with the predictors in the second stage. As in any multiple regression model, high collinearity yields inconsistent estimates. The solution is generally to include one or more additional predictors in the first stage that are then excluded in the second stage. Akin to instrument variables, these predictors should influence selection into the sample (the first stage), but then have no relationship to the ultimate disturbance term in the second stage (Certo et al., 2016). Failure to include these exclusion restriction variables, using weak exclusion variables, or using exclusion variables that are themselves endogenous, will yield inconsistent estimates in the second stage equation.

Given the difficulties inherent to properly specifying selection models, and that the selection hazard parameter (lambda) only deals with endogeneity specifically from sample selection, many scholars—myself included—recommend using endogeneity correction approaches that deal with selection along with other omitted variable concerns simultaneously (e.g., 2SLS, regression discontinuity, and so forth). The bottom line though, just like any instrument variable method, the quality of the second stage model is predicated on the quality of the first stage model.


^1: While researchers often use logit and probit interchangeably, the Heckman method is a case where the researcher must use a probit model in the first stage equation. The reason is the distributional assumption differences between the two models—the Heckman method depends on the assumption of bivariate normality, which is an outcome only of probit limited dependent variable models.


References:

Bushway S, Johnson BD, Slocum LA. 2007. Is the magic still there? The use of the Heckman two-step correction for selection bias in criminology. Journal of Quantitative Criminology 23(2): 151–178.

Certo ST, Busenbark JR, Woo H-S, Semadeni M. 2016. Sample selection bias and Heckman models in strategic management research. Strategic Management Journal 37(13): 2639–2657.

Clougherty JA, Duso T, Muck J. 2016. Correcting for self-selection based endogeneity in management research: Review, recommendations and simulations. Organizational Research Methods 19(2): 286–347.

Credibility in strategic management research

Don Bergh1 and colleagues published a great note in Strategic Organization recently on the question of reproducibility of results in strategy research. I agree with virtually everything in the paper, but this passage on page 8 caught my attention…

Overall, based on our sample of 88 SMJ articles, the strategic management literature appears vulnerable to credibility problems for two main reasons. One, the majority of the articles did not report their data sufficiently to permit reproduction, leaving us in the dark with regards to the accuracy of their reported results. Two, among those articles where reproduction analyses were possible, a significant number of discrepancies existed between reported and reproduced significance levels.

I’ve written about this before—what limits our impact on management practice is a lack of rigor, and not an excess of it. Here is another example of the problem. When a second scholar is not able to reproduce the results of a study, using the same data (correlation matrix) and same estimator, that’s a significant concern. We simply cannot say with confidence, especially given threats to causal inference, that a single reported study has the strength of effect reported if data, code, and other related disclosures about research design and methodology are absent. Rigor and transparency, to me, will be the keys to unlocking the potential impact on management practice from strategy and entrepreneurship research.

On a related note, it’s nice in this paper that the authors drew the distinction between reproducibility and replication, which sometimes gets confused. A reproduction of a study is the ability to generate the same results from a secondary analysis as reported in the original study, using the same data. A replication is the ability to draw similar nomological conclusions—generally with overlapping confidence intervals of the estimates—of a study using the same research design and methodology but on a different random sample.

Both reproducibility and replication are critical to building confidence and credibility in scientific findings. To me though, reproducibility is a necessary, but not sufficient condition for credibility. The easiest way to ensure reproducibility is to share data and to share code, and to do this early in the review process. For example, the Open Science Framework allows authors to make use of an anonymized data and file repository, allowing reviewers to check data and code without violating blind review.

While yes, many estimators (OLS, ML, covariance-based SEM) allow you to reproduce results based on a correlation/covariance matrix, as reported in the paper, this can be a tall order, what with the garden of forking paths problem. More problematic for strategy research is the use of panel/multilevel data, which was an area the authors didn’t touch on. In this case, a multilevel study’s reported correlation matrix would pool the lower- and higher-order variance together, effectively eliminating the panel structure. You could reproduce a naive, pooled model from the published correlation matrix, but not the multilevel model, which demonstrably limits its usefulness. This is a major a reason why I’m in favor of dropping the standard convention of reporting a correlation matrix and instead requiring data and code.

Regardless though, lack of reproducibility is a significant problem in strategy, as in other disciplines. We’ve got a lot more work to do to build confidence in our results, and to have the impact on management practice that we could.

  1. In the interest of full disclosure, Dr. Bergh was a mentor of mine at University of Denver—I was a big fan of him then, and I still am 🙂

Another take on p-values

Here’s an interesting take from a few days ago on the American Statistical Association’s statement on the use—and misuse—of p-values that was published last year. I’m certainly in the camp that p-values are more often than not misunderstood and misapplied in published studies, but the challenge I’ve found has been to communicate the myriad of assumptions made when employing the p < .05 standard and how shaky research can be that deviates from these assumptions.

The p-value is the probability that the observed effect, or a larger effect, is due to random chance, assuming that the null hypothesis is true. Generally in the null hypothesis testing framework, the assumption made is that the effect is zero. Not statistically different from zero, but, actually, zero. Therein lies one of the many problems with p-values—very rarely, if ever, would we expect absolutely zero effect in social science research. Our constructs are too noisy, and our theoretical explanations too loose, to reasonably expect an effect of zero.

So given that the null itself isn’t likely to be true, how do we reconcile the p < .05 standard? Well, the best option is to be a Bayesian, but if you have to retain a frequentist perspective, here is one explanation I use with my doctoral students.

The irony of p-values is that the more likely the null hypothesis is to be true, the less valuable the p-value becomes. You can think of it conceptually like a classic conditional probability: Pr(y|x) What’s the probability that a study reporting a statistically significant rejection of the null hypothesis is accurate, given the probability that the null hypothesis is true?

Now, to be clear, the p-value doesn’t—and can’t—say anything about the probability that the null hypothesis is true. What I’m talking about is the researcher using his/her own judgement based on prior work, theory, and deductive reasoning about just how likely it is that the null hypothesis is true in real life. For example, in EO research, the null hypothesis would be that entrepreneurial firms enjoy no performance advantage over conservatively managed firms. We could put the probability of the null hypothesis being true at about 10%—it’s difficult to imagine a meaningful content in which it doesn’t pay to be entrepreneurial, but it’s possible.

So how does that inform evaluating EO research reporting a p < .05 standard? Lets imagine a study reports an effect of EO on firm growth at p = .01. Under a conventional interpretation, we would say that the probability was 1 in 100 that the observed effect, or a larger effect, was due to random chance assuming that the null hypothesis is true. But real life and our judgement says that the null hypothesis has little chance of being true (say 10% in our example). In this case, the p-value actually works pretty well, although it’s not very valuable. The 1 in 100 chance that the results are due to random chance is not that far off from our 1 in 10 chance that being entrepreneurial doesn’t help the firm grow. It’s not valuable in the sense that its told us something that we already knew (or guessed) to be true in the real world.

But what about the case where a study reports p = .01, so still a 1 in 100 probability, but the likelihood of the null hypothesis itself being true is high, say 95%? In other words, the likelihood of the effect being real is very small. The best discussion of the exact probability breakdown is by Sellke et al (2001), and here’s a non-paywalled discussion of the same concept, but in this case the p-value can be downright dangerous. The probability of incorrectly rejecting a highly probable null hypothesis is over 10% at p = .01, and almost 30% at p = .05. In short, the more probable the null, the more likely it is that ‘statistically significant evidence’ in favor of its rejection is flawed.

The bottom line is that there is no substitute for using your own judgement when evaluating a study. Ask yourself just how likely it is that the null hypothesis is to be true, particularly when evaluating research purporting to offer ‘surprising’, ‘novel’, and ‘counterintuitive’ findings. You might find that the author’s statistically significant novel finding is itself likely to be a random variation, or as Andrew Gelman might say, the difference between significant and not-significant is not itself statistically significant.

Rigor and relevance

This post challenges the assumption that for an academic paper to be relevant it must be interesting, and for the paper to be interesting, it only needs appropriate empirics, as opposed to a rigorous research design and empirical treatment.

An easy critique of this assumption is to say that I’ve got a straw-man argument; to publish you need rigorous empirics AND a compelling story that makes a contribution. I don’t think that’s the case. I think as a field (management and entrepreneurship specifically), we are too willing to trade studies that are interesting for those that are less interesting, even if the less interesting paper has a stronger design and stronger empirics. The term interesting is, without question, subjectively determined by journal editors and reviewers—what is interesting to one scholar may or may not be interesting to another.

Generally we think of interesting in terms of making a theoretical contribution; the standard for publication at most of our top empirical journals is that a paper must make a novel insight—or insights—to be publishable. The problem with this standard, as has been amply covered by others is that is encourages, or forgives, researcher degrees of freedom that may weaken statistical and causal inference to maximize the ‘interesting-ness’ factor. The ongoing debate over the replicability of power-posing is a notable case in point.

My hypothesis is that the willingness to trade rigorous research design for ‘novel’ insights is the root cause for the very real gap between academic management research and management practice. The requirement to make a novel insight encourages poor research behavior while minimizing the critical role that replicability plays in the trustworthiness of scientific research. In entrepreneurship research, we are also late in embracing concepts like counterfactual reasoning and appropriate techniques to deal with endogeneity, which diminishes the causal inference of our research and hence its usefulness.

In short, managers are less likely to adopt practices borne out of academic research not because such findings are unapproachable—although true, studies aren’t easy reads—but that most academic research simply isn’t trustworthy.  I’m not suggesting most research is the result of academic misconduct, far from it.  But I am suggesting that weak designs and poorly done analyses lower the trustworthiness of study results and their usefulness to practice.

To be clear, a well done study that maximizes causal inference AND is theoretically novel is, certainly, ideal. But the next most important consideration should be a rigorously designed and executed study on a simple main effect relationship that maximizes causal inference and understanding. It may not be interesting, but at least it will be accurate.

The best way to be relevant is to be trustworthy, and the best way to be trustworthy is to be rigorous. You can’t have external validity without first maximizing internal validity.