Start with a simple model

As a field, we seem to be gravitating to a modus operandi where model complexity equates to making a theoretical contribution. The notion seems to be that the more variables we have, the more hypotheses we test in a single paper, and the more mediators and moderators we include, the greater the ‘contribution’ of the paper. Models end up with causal paths pointing in each cardinal direction, all under the assumption that the ‘more stuff’ I include equates to richness of understanding.

I think this is a bad trend, for three reasons.

1). The likelihood of model misspecification—The more variables, hypotheses, and complexity in the model, the more likely that the entire model will not just be wrong, but really, really, bad;

2). Creates perverse incentives—Tying a theoretical contribution to model complexity invites HARKing and p-hacking; and

3). Barriers to replication—The more a model depends on a particular set of variables from a particular dataset, each of which with its own complex construction, the harder it is for another researcher to replicate the study.

So what’s the answer? Start with a simple model…

  • A single x, predicting a single y;
  • Minimal measurement error for both variables;
  • A large, appropriately powered sample;
  • Appropriate steps to eliminate alternate explanations for the relationship between x and y (ideally by manipulating x); and
  • A reproducible codebook to ensure others can follow along with what you did.

Seriously, that’s it. Now, it’s actually really hard to do steps 2, 3, and 4. These are, however, critical to yield an unbiased estimate of the effect of x on y. Noisy measures in noisy data with small true effect sizes are far more likely to yield unpredictable (and usually inflated) results. A well developed measure, with measurement error kept to a minimum, needs a large dataset to tease out meaningful insights. Too often we see large datasets, but measures of such a convoluted construction that understanding just what the researcher did to build the measure—let alone have confidence that the observed effect is not simply an artifact of the measurement model—makes the contribution trivial at best. By the same token, well done measurement models tested in small, noisy samples result in a similar interpretational problem; it’s too difficult to separate the signal from the noise.

Step 4—dealing with endogeneity—is a topic near and dear to my heart. Here’s my specific problem…it’s so challenging to isolate a consistent effect size estimate for ONE focal relationship. The more hypotheses and variables added to the model, assuming the researcher tests it simultaneously, the difficulty in recovering consistent effect sizes increases exponentially; you are far more likely to screw the entire model up.

Of course, sharing your code, and ideally your data, is pretty easy. But it’s just not something commonly done in management and entrepreneurship research. I hope that is changing, and for me, all of my papers now include posted codebooks and data. There is just no good reason not too.

I think one solution is for journals to encourage more single hypothesis papers. Take an interesting question—say estimating the probability that a failed entrepreneur will start another new venture—and evaluate that question with 2-3 independent studies, with consistent measures, in large representative samples, and ideally with the same instruments used to address the endogeneity problem. As an incentive, journals could offer expedited review of these studies, assuming that the researcher shared his or her data and code.

The bottom line is that headline grabbing effect sizes with sexy variables in complicated models are, over the long run, far more likely to be found wanting than vindicated as possibly right. Science progresses with small, incremental contributions to our knowledge base. Start with a simple model, test it rigorously, and better our management science.

Credibility in strategic management research

Don Bergh1 and colleagues published a great note in Strategic Organization recently on the question of reproducibility of results in strategy research. I agree with virtually everything in the paper, but this passage on page 8 caught my attention…

Overall, based on our sample of 88 SMJ articles, the strategic management literature appears vulnerable to credibility problems for two main reasons. One, the majority of the articles did not report their data sufficiently to permit reproduction, leaving us in the dark with regards to the accuracy of their reported results. Two, among those articles where reproduction analyses were possible, a significant number of discrepancies existed between reported and reproduced significance levels.

I’ve written about this before—what limits our impact on management practice is a lack of rigor, and not an excess of it. Here is another example of the problem. When a second scholar is not able to reproduce the results of a study, using the same data (correlation matrix) and same estimator, that’s a significant concern. We simply cannot say with confidence, especially given threats to causal inference, that a single reported study has the strength of effect reported if data, code, and other related disclosures about research design and methodology are absent. Rigor and transparency, to me, will be the keys to unlocking the potential impact on management practice from strategy and entrepreneurship research.

On a related note, it’s nice in this paper that the authors drew the distinction between reproducibility and replication, which sometimes gets confused. A reproduction of a study is the ability to generate the same results from a secondary analysis as reported in the original study, using the same data. A replication is the ability to draw similar nomological conclusions—generally with overlapping confidence intervals of the estimates—of a study using the same research design and methodology but on a different random sample.

Both reproducibility and replication are critical to building confidence and credibility in scientific findings. To me though, reproducibility is a necessary, but not sufficient condition for credibility. The easiest way to ensure reproducibility is to share data and to share code, and to do this early in the review process. For example, the Open Science Framework allows authors to make use of an anonymized data and file repository, allowing reviewers to check data and code without violating blind review.

While yes, many estimators (OLS, ML, covariance-based SEM) allow you to reproduce results based on a correlation/covariance matrix, as reported in the paper, this can be a tall order, what with the garden of forking paths problem. More problematic for strategy research is the use of panel/multilevel data, which was an area the authors didn’t touch on. In this case, a multilevel study’s reported correlation matrix would pool the lower- and higher-order variance together, effectively eliminating the panel structure. You could reproduce a naive, pooled model from the published correlation matrix, but not the multilevel model, which demonstrably limits its usefulness. This is a major a reason why I’m in favor of dropping the standard convention of reporting a correlation matrix and instead requiring data and code.

Regardless though, lack of reproducibility is a significant problem in strategy, as in other disciplines. We’ve got a lot more work to do to build confidence in our results, and to have the impact on management practice that we could.

  1. In the interest of full disclosure, Dr. Bergh was a mentor of mine at University of Denver—I was a big fan of him then, and I still am 🙂 ↩

Bad statistics and theoretical looseness

I’m actually a big fan of theory—I’m just not wild about the ways in which we (management and entrepreneurship scholars) test it. The driving reason is theoretical looseness; the ability to offer any number of theoretical explanations for a phenomenon of interest.

What concerns me most with theoretical looseness is that researchers often become blind to questioning results that don’t align with the preponderance of evidence in the literature. The race for publication, combined with the ability to offer what is a logically consistent—even if contradictory to most published research—explanation makes it all to easy to slip studies with flimsy results into the conversation.

In EO research, we see this often with studies purporting to find a nill, or a negative, effect of entrepreneurial behavior and firm growth. Is it possible? Sure. A good Bayesian will always allow for a non-zero prior, however small it might be. But is is logical? Well, therein lies the problem. Because our theories are generally broad, or because we can pull from a plethora of possible theoretical explanations that rarely provide specific estimates of causal effects and magnitudes, it is easy to take a contradictory result and offer an argument about why being entrepreneurial results in a firm’s growth decreasing.

The problem is, researchers often don’t take the extra steps to evaluate the efficacy of the model he or she estimated. Even checking basics like distributional assumptions and outliers are foregone in the race to write up the results and send it out for review. As estimators have become easier to use thanks to point and click software and macros, it’s even easier for researchers to throw data into the black box, get three asterisks, and then find some theoretical rationale to explain seemingly inconsistent results. It’s just too easy for bad statistics but easy theorizing to get published.

The answer, as others have noted, is to slow the process down. Here I think pre-prints are particularly valuable, and one reason why I’ll be starting to use them myself. Ideas and results need time to percolate—to be looked at and to be challenged by the community. Once a paper is published it is simply too hard to ‘correct’ the record from one-off studies that, tragically, can become influential simply because they are ‘interesting’. In short, take the time to get it right, and avoid the temptation to pull a theoretical rabbit out of the hat when the results don’t align with the majority of the conversation.

My five guidelines for evaluating a study

I didn’t fully understand the added workload that comes with being a field (associate) editor. Don’t get me wrong, I’m loving the position, but in writing decision letters, I find that I’m often making similar observations regarding a study’s reported empirics, so I thought I’d crystalize my five primary guidelines for evaluating a study. These are in no particular order, they are equally weighted, and this list isn’t exhaustive of other important methodological contributions.

Oh, one important point. Nowhere on this list does ‘make a theoretical contribution‘ appear. That’s on purpose, and the logic is simple. You can’t make a contribution with faulty science. Get the empirics tight, and then we can move on to framing and argumentation. John Antonakis summed it up best—“Research that is not rigorous simply cannot be relevant.”

1) How big was the sample?

Statistical power is certainly an issue, and small samples may simply be underpowered to detect small effects. That’s not really my concern though. The vast majority of published studies have statistically significant results, so failing to detect an effect isn’t the problem. The problem, as described much better by others, is that small effects in small samples are often more likely to appear as a function of noise than reflecting a true effect. What’s worse, the ‘false’ signal is likely to be inflated, making the type I error bigger. It’s always problematic to have studies with a low probability of replication, but in this case, it’s particularly bad because such studies can make it seem that the estimated effect is quite large when, in reality, if the effect exists at all it’s likely to be small. These studies just add noise to an already noisy literature.

2) How noisy are the measures?

Speaking of noise, measurement error is a particularly pernicious beast. In entrepreneurship and management research in general, we use a lot of proxy variables and latent constructs to capture phenomenon of interest. There is nothing wrong with this, so long as there is an adequate discussion of construct (proxy) validity. What I’m concerned about specifically is measurement error and it’s impact on structural parameters. Measurement error is effectively omitted variable bias (endogeneity), which renders parameter estimates inconsistent—no matter how big you make the sample, the estimates will always be wrong. This is particularly concerning in mediation models and in moderation models. Mediation generally assumes the mediator is error free, and in moderation, the reliability of the interaction term (xm) is the product of the lower order reliabilities—it’s always going to be lower then the constituent terms. So if the measures were noisy to begin with, the moderator will be even worse.

3) Was there a manipulation of x?

As Paul Holland noted, there is no causation without manipulation. Experimental designs, natural or otherwise, are not as common in entrepreneurship and strategy research as they need to be. Given that we deal largely with observational data, we can never make the claim that selection effects and other omitted variables are not materially influencing a given model. That means in any paper without a manipulation, the bar is high for the author(s) to demonstrate that endogeneity has been thoroughly addressed. 2SLS, regression discontinuity, blocking, and other related designs are all fine from my perspective assuming they are well done, but something must be there to show that the author(s) are recovering consistent parameter estimates.

4) Does the researcher have skin in the game (confirmation bias)?

To be clear, I’m not talking necessary about the influence of grant/funding providers, etc. I’m talking in about confirmation bias—the extent to which a researcher seeks out information that conforms to his/her world view. In my own area of strategic entrepreneurship, there is a strong consensus (world view) that firms that are entrepreneurial outperform conservatively managed firms. There’s quite a bit of evidence to support that claim, but it also makes it less likely that someone with that world view is willing to accept evidence that entrepreneurial firms don’t outperform a non-entrepreneurial firm. I’ve got skin in the pro-entrepreneurship game, so I’m subject to confirmation bias. The bigger problem as I see it that researchers with a strong normative bias towards a given conclusion are less likely to critically analyze their results, and more concerning, may be more likely to utilize researcher degrees of freedom to ensure a p < .05 result. To be clear, it’s not about accusing an author of having confirmation bias, rather it’s a conditional probability—the probability of engaging in researcher degrees of freedom is higher for a given researcher with skin in the game on the study’s topic.

5) What is the margin of error for the prediction?

As a field, we don’t pay close enough attention to standard errors. I’m not in the camp that we need to show that a given effect is practically significant. I think that this standard actually encourages researcher degrees of freedom. The better standard is to just be honest about a consistently estimated effect size, which in entrepreneurship and management, is likely to be small. So better to be accurate, than to be practically important. That said, particularly with small samples and noisy measures, the resulting confidence intervals become even more important. We generally just dichotomize hypotheses in entrepreneurship—there is a positive effect of x on y—but effect sizes and standard errors matter a lot for replications and meta-analyses. So, the margin of error around the estimate is particularly important for me—the bigger the range, the less useful the prediction for science.

One of the things I’ve found from my own methodological awakening is how helpful these criteria are for evaluating my own research. It’s a check, if you will, on my own biases and excitement over promising early stage results, and I’m going to be using these more in my decision letters and reviews.

Reviewer certification

I’m a fan of John Ioannidis and his work. He’s done a lot to raise attention to the use and abuse of frequentist statistics in, well, lots of the sciences.

In a recent article, he made the argument that “Teams which publish scientific literature need a ‘licence to analyse’ and this licence should be kept active through continuing methodological education.” I threw the question out to my PhD students, and there was a mixed reaction. One argument was the pace at which statistical theory advances and the difficulty for applied researchers to keep up. I’m sympathetic with that perspective, because I still feel like I’m playing catch up with my own methods knowledge (still have a long way to go!).

John also made this observation though “Journals also lack the expertise required to understand and properly review statistics. Until a couple of years ago, even Nature and Science did not have trained statisticians routinely reviewing papers, which continues to be true of most biomedical and biology journals.” That got me thinking about my other role as a Field Editor for JBV about the importance of reviewer education of statistical inference.

I’m thinking that flipping the requirement around might be a better way to go. What about a required online training class for members of a journal editorial board? For ad hoc reviewers, this course might be optional, but highly encouraged. The course wouldn’t take long, but would highlight the correct interpretation of the p-value, the importance of standard errors in consistency of inference, etc. It would also dispel some popular myths that still occur in the management literature, with the intention of improving the data science capabilities of the field; to publish, you need to ensure you are meeting the standard that reviewers and editors are trained to look for.

I’m thinking this would be a net value add, and not difficult to implement…