My five guidelines for evaluating a study

I didn’t fully understand the added workload that comes with being a field (associate) editor. Don’t get me wrong, I’m loving the position, but in writing decision letters, I find that I’m often making similar observations regarding a study’s reported empirics, so I thought I’d crystalize my five primary guidelines for evaluating a study. These are in no particular order, they are equally weighted, and this list isn’t exhaustive of other important methodological contributions.

Oh, one important point. Nowhere on this list does ‘make a theoretical contribution‘ appear. That’s on purpose, and the logic is simple. You can’t make a contribution with faulty science. Get the empirics tight, and then we can move on to framing and argumentation. John Antonakis summed it up best—“Research that is not rigorous simply cannot be relevant.”

1) How big was the sample?

Statistical power is certainly an issue, and small samples may simply be underpowered to detect small effects. That’s not really my concern though. The vast majority of published studies have statistically significant results, so failing to detect an effect isn’t the problem. The problem, as described much better by others, is that small effects in small samples are often more likely to appear as a function of noise than reflecting a true effect. What’s worse, the ‘false’ signal is likely to be inflated, making the type I error bigger. It’s always problematic to have studies with a low probability of replication, but in this case, it’s particularly bad because such studies can make it seem that the estimated effect is quite large when, in reality, if the effect exists at all it’s likely to be small. These studies just add noise to an already noisy literature.

2) How noisy are the measures?

Speaking of noise, measurement error is a particularly pernicious beast. In entrepreneurship and management research in general, we use a lot of proxy variables and latent constructs to capture phenomenon of interest. There is nothing wrong with this, so long as there is an adequate discussion of construct (proxy) validity. What I’m concerned about specifically is measurement error and it’s impact on structural parameters. Measurement error is effectively omitted variable bias (endogeneity), which renders parameter estimates inconsistent—no matter how big you make the sample, the estimates will always be wrong. This is particularly concerning in mediation models and in moderation models. Mediation generally assumes the mediator is error free, and in moderation, the reliability of the interaction term (xm) is the product of the lower order reliabilities—it’s always going to be lower then the constituent terms. So if the measures were noisy to begin with, the moderator will be even worse.

3) Was there a manipulation of x?

As Paul Holland noted, there is no causation without manipulation. Experimental designs, natural or otherwise, are not as common in entrepreneurship and strategy research as they need to be. Given that we deal largely with observational data, we can never make the claim that selection effects and other omitted variables are not materially influencing a given model. That means in any paper without a manipulation, the bar is high for the author(s) to demonstrate that endogeneity has been thoroughly addressed. 2SLS, regression discontinuity, blocking, and other related designs are all fine from my perspective assuming they are well done, but something must be there to show that the author(s) are recovering consistent parameter estimates.

4) Does the researcher have skin in the game (confirmation bias)?

To be clear, I’m not talking necessary about the influence of grant/funding providers, etc. I’m talking in about confirmation bias—the extent to which a researcher seeks out information that conforms to his/her world view. In my own area of strategic entrepreneurship, there is a strong consensus (world view) that firms that are entrepreneurial outperform conservatively managed firms. There’s quite a bit of evidence to support that claim, but it also makes it less likely that someone with that world view is willing to accept evidence that entrepreneurial firms don’t outperform a non-entrepreneurial firm. I’ve got skin in the pro-entrepreneurship game, so I’m subject to confirmation bias. The bigger problem as I see it that researchers with a strong normative bias towards a given conclusion are less likely to critically analyze their results, and more concerning, may be more likely to utilize researcher degrees of freedom to ensure a p < .05 result. To be clear, it’s not about accusing an author of having confirmation bias, rather it’s a conditional probability—the probability of engaging in researcher degrees of freedom is higher for a given researcher with skin in the game on the study’s topic.

5) What is the margin of error for the prediction?

As a field, we don’t pay close enough attention to standard errors. I’m not in the camp that we need to show that a given effect is practically significant. I think that this standard actually encourages researcher degrees of freedom. The better standard is to just be honest about a consistently estimated effect size, which in entrepreneurship and management, is likely to be small. So better to be accurate, than to be practically important. That said, particularly with small samples and noisy measures, the resulting confidence intervals become even more important. We generally just dichotomize hypotheses in entrepreneurship—there is a positive effect of x on y—but effect sizes and standard errors matter a lot for replications and meta-analyses. So, the margin of error around the estimate is particularly important for me—the bigger the range, the less useful the prediction for science.

One of the things I’ve found from my own methodological awakening is how helpful these criteria are for evaluating my own research. It’s a check, if you will, on my own biases and excitement over promising early stage results, and I’m going to be using these more in my decision letters and reviews.

Reviewer certification

I’m a fan of John Ioannidis and his work. He’s done a lot to raise attention to the use and abuse of frequentist statistics in, well, lots of the sciences.

In a recent article, he made the argument that “Teams which publish scientific literature need a ‘licence to analyse’ and this licence should be kept active through continuing methodological education.” I threw the question out to my PhD students, and there was a mixed reaction. One argument was the pace at which statistical theory advances and the difficulty for applied researchers to keep up. I’m sympathetic with that perspective, because I still feel like I’m playing catch up with my own methods knowledge (still have a long way to go!).

John also made this observation though “Journals also lack the expertise required to understand and properly review statistics. Until a couple of years ago, even Nature and Science did not have trained statisticians routinely reviewing papers, which continues to be true of most biomedical and biology journals.” That got me thinking about my other role as a Field Editor for JBV about the importance of reviewer education of statistical inference.

I’m thinking that flipping the requirement around might be a better way to go. What about a required online training class for members of a journal editorial board? For ad hoc reviewers, this course might be optional, but highly encouraged. The course wouldn’t take long, but would highlight the correct interpretation of the p-value, the importance of standard errors in consistency of inference, etc. It would also dispel some popular myths that still occur in the management literature, with the intention of improving the data science capabilities of the field; to publish, you need to ensure you are meeting the standard that reviewers and editors are trained to look for.

I’m thinking this would be a net value add, and not difficult to implement…

Making a theoretical contribution

I'm not a fan of the necessity to have a novel theoretical contribution to publish in top management journals. Arguably, I think this standard has contributed to the replication crisis in the social sciences. Nonetheless, that is the standard, so it's helpful to think through just what a theoretical contribution means in the era of the replication crisis.

Making a theoretical contribution is absolutely in the eye of the beholder. My working hypothesis is that whether an editor believes you have met this standard has a lot to do with the clarity of the writing. The better the writing/argumentation, the better the probability that the editor and reviewers will see the 'theoretical contribution' in the paper. This makes writing quality the single biggest predictor of paper acceptance, and not the 'That's Interesting!' standard that, also, likely contributed to the replication crisis in the first place.

So given that writing quality is the key to crossing the theoretical contribution bar, I would argue that the single best way to enhance writing quality in an management study is clarity of purpose. This is, quite simply, being clear in what you are trying to accomplish. If the purpose of the study is to offer a are grand theory of the firm, great. Just. Say. So.

If you have more modest aims, just say so. To me, a theoretical contribution is something that improves our understanding of the nomological relationship between two or more phenomena. If that means your study is a replication, GREAT! Just say that's what you are doing. If your study asks an existing question but does it in a more rigorous way, even better!!! We need to revisit a number of past findings with new data and new—yes, that means better—methods. My point is that the key to making a theoretical contribution is to just be clear; to be intellectually honest about the purpose behind a study.

As a spillover benefit, I think clarity of purpose will also help address the HARKing problem. If a study is using data from an already published paper, if the code isn't made available, and if the original study wasn't pre-registered, well, the paper is probably a fishing expedition masquerading as, well, a new source of broad managerial insight. If it's fishing, just call it that. But, you better have a self-replication included in the new submission!

Going all in with R

As a doctoral student, I took a multivariate statistics class with SAS. I took a time series class that used R. I had a SEM course using LISREL. My methods guru used Stata, and my adviser SPSS (yes, there is a reason I didn’t link to SPSS). Needless to say, I had a diverse background in statistics software.

Over the years I used, and taught with, mostly Stata, while trying to keep up with R.

This year, I’m all in with R.

Why? Because point and click statistics only makes the researcher degrees of freedom problem worse. Others much more qualified than me have spoken on this, but I think the general trend towards using context-menu based software has resulted in a ‘dumbing-down’ effect in the quality of empirical analyses. It’s not because researchers are dumber, to be clear, it’s that context menu software makes running analyses without understanding what’s going on under the hood far too simple.

Case in point that I use in class a lot? Cronbach’s alpha.

Here’s the formula for Alpha…

\(\alpha=\frac{N*\bar{c}}{\bar{v}+(N-1)*\bar{c}}\)

Without going through the details, \(N\) is the number of indicators being evaluated, and \(\bar{c}\) is the average inter-item covariance. The kicker with alpha is that a large number of indicators will inflate alpha, even in the presence of a low correlation between the actual indicators themselves.

So far so good, but lets look at the output from a common point and click software package—I don’t actually have a license for it, so I’m borrowing this image from the very excellent UCLA stats page.

Notice there is nothing in the output about the average inter-item covariance or correlation. My argument is that SPSS is one reason why we have so many 20+ item psychometric scales floating around with questionable reliability.

Now lets take a look at R, with the popular psych package…

library(psych)
alpha(my.df)
## 
## Reliability analysis   
## Call: alpha(x = my.df)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.83      0.83    0.76      0.61 4.7 0.028    4 1.2
## 
##  lower alpha upper     95% confidence boundaries
## 0.77 0.83 0.88 
## 
##  Reliability if an item is dropped:
##       raw_alpha std.alpha G6(smc) average_r S/N alpha se
## RISK1      0.78      0.78    0.64      0.64 3.6    0.040
## RISK2      0.71      0.71    0.55      0.55 2.4    0.054
## RISK3      0.78      0.78    0.64      0.64 3.6    0.040
## 
##  Item statistics 
##         n raw.r std.r r.cor r.drop mean  sd
## RISK1 116  0.84  0.85  0.72   0.66  4.0 1.4
## RISK2 116  0.89  0.89  0.81   0.73  3.9 1.4
## RISK3 116  0.85  0.85  0.73   0.66  4.1 1.4
## 
## Non missing response frequency for each item
##          1    2    3    4    5    6    7 miss
## RISK1 0.03 0.13 0.15 0.35 0.20 0.11 0.03    0
## RISK2 0.05 0.16 0.19 0.22 0.28 0.09 0.02    0
## RISK3 0.06 0.09 0.15 0.28 0.28 0.13 0.02    0

The data source isn’t important, but beyond the terrific output we see that the average_r—the average inter-item correlation is pretty good, .61. The alpha itself is .83, well within the recommended range.

Now lets take a look at a nine item scale…

library(psych)
alpha(my.df2)
## 
## Reliability analysis   
## Call: alpha(x = my.df2)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.85      0.85    0.88      0.39 5.9 0.021  4.1 1.1
## 
##  lower alpha upper     95% confidence boundaries
## 0.81 0.85 0.89 
## 
##  Reliability if an item is dropped:
##       raw_alpha std.alpha G6(smc) average_r S/N alpha se
## INN1       0.83      0.83    0.86      0.38 4.9    0.025
## INN2       0.83      0.84    0.85      0.39 5.1    0.024
## INN3       0.84      0.85    0.86      0.41 5.5    0.022
## PRO1       0.83      0.84    0.85      0.39 5.1    0.023
## PRO2       0.84      0.84    0.86      0.40 5.4    0.022
## PRO3       0.86      0.86    0.89      0.44 6.3    0.020
## RISK1      0.83      0.83    0.86      0.38 5.0    0.024
## RISK2      0.83      0.83    0.85      0.37 4.8    0.024
## RISK3      0.83      0.83    0.86      0.38 4.9    0.024
## 
##  Item statistics 
##         n raw.r std.r r.cor r.drop mean  sd
## INN1  113  0.76  0.74  0.70   0.65  3.9 1.9
## INN2  113  0.72  0.70  0.68   0.61  4.2 1.7
## INN3  113  0.64  0.62  0.58   0.52  3.8 1.7
## PRO1  113  0.70  0.70  0.67   0.60  4.1 1.6
## PRO2  113  0.65  0.64  0.59   0.53  4.3 1.7
## PRO3  113  0.46  0.48  0.36   0.33  4.5 1.4
## RISK1 113  0.71  0.72  0.68   0.62  4.0 1.4
## RISK2 113  0.75  0.77  0.75   0.68  3.8 1.4
## RISK3 113  0.73  0.75  0.72   0.65  4.1 1.4
## 
## Non missing response frequency for each item
##          1    2    3    4    5    6    7 miss
## INN1  0.12 0.23 0.06 0.17 0.18 0.16 0.09    0
## INN2  0.06 0.18 0.11 0.19 0.20 0.18 0.08    0
## INN3  0.10 0.13 0.20 0.20 0.18 0.13 0.05    0
## PRO1  0.08 0.11 0.12 0.26 0.25 0.14 0.05    0
## PRO2  0.09 0.08 0.12 0.21 0.25 0.18 0.08    0
## PRO3  0.04 0.09 0.09 0.20 0.36 0.18 0.04    0
## RISK1 0.04 0.13 0.14 0.35 0.20 0.12 0.02    0
## RISK2 0.05 0.15 0.19 0.22 0.28 0.09 0.01    0
## RISK3 0.06 0.10 0.15 0.27 0.28 0.12 0.02    0

Our alpha went up to .85—even better than before! But the average_r dropped to .39. The formula is biased by the large number of indicators, not because the scale actually has higher internal reliability.

Ok, I’m not being entirely fair to SPSS, and its always the responsibility of the researcher to understand the nuts and bolts of any analysis he or she runs.

Nonetheless, what I love about R is that it facilitates—empowers—a deeper understanding of the statistical tools that most applied researchers use on a daily basis. Rather than moving more and more of the stats to the background, R brings it to the foreground, giving the analyst the information to draw better conclusions. The added bonus? The R community is so large now that if you are stuck or having trouble, a quick search is usually all you need to find the answer.

The bottom line? R is a tool for data science, and given the replication crisis, we could all use a little more science, and the tools that support it.

Science and journalism in academic publishing


Journalism

I’ve drawn a version of that graphic dozens of times now when talking to PhD students about publishing in management/entrepreneurship. The purpose of the graphic is to talk about publishing probabilities based on a given paper’s strengths—its ability to draw causal inference (good science), or its ability to tell an interesting and compelling story (good journalism).

As a field, we have a silly devotion to ‘making a theoretical contribution’ as a standard for publication. The necessity for each study to bring something new to the table is the exact opposite of what we should want as scientists—trustworthy, replicable results—that imbue confidence in a model’s predictions.

Now, the happiest face, and hence the highest publication probability, is absolutely a paper that addresses an important topic, is well written and argued, AND has a high quality design with supporting empirics. This should be, of course, the goal. Producing such work consistently, however, is not easy. In our publish or perish world, promotion and tenure standards call for a portfolio of work, at least some of which is not likely to fall in our ideal box. So the question becomes, as a field, should we favor high quality science that may address less interesting topics or simple main effects? Or, should we favor papers that speak to an interesting topic but with research designs that represent a garden of forking paths and have less trustworthy results?

To put it another way, what matters more to our field, internal validity or external validity?

Again the ideal is that both matter, although I’m in the camp that internal validity is a necessary element for external validity—what’s the point of a generalizable finding that is wrong in the first place? But when it comes to editorial decisions—and I’ve certainly seen this in my own work—I would argue that as a field good journalism improves the odds of publication even with questionable empirics. I don’t have any data to support my hypothesis, although I typically don’t get much resistance when I draw my picture during a seminar.

Fortunately though, I think we’re slowly changing as a field. The increasing recognition of the replication in science broadly and in associated fields like psychology and organizational behavior will, over time I believe, change the incentive structure to favor scientific rigor over journalistic novelty. Closer to my field, the encouraging changes in the editorial policies of Strategic Management Journal may help tilt the balance in favor of rigor.

In the spirit then of Joe Simmons’ recent post on prioritizing replication, I’d like our field to demonstrably lower the bar for novel theoretical insights in each new published study. It is, to me, what is holding our field back from bridging the gap between academia and practice—why should we expect a manager to use our work if we can’t show that our results replicate?