Reviewer certification

I’m a fan of John Ioannidis and his work. He’s done a lot to raise attention to the use and abuse of frequentist statistics in, well, lots of the sciences.

In a recent article, he made the argument that “Teams which publish scientific literature need a ‘licence to analyse’ and this licence should be kept active through continuing methodological education.” I threw the question out to my PhD students, and there was a mixed reaction. One argument was the pace at which statistical theory advances and the difficulty for applied researchers to keep up. I’m sympathetic with that perspective, because I still feel like I’m playing catch up with my own methods knowledge (still have a long way to go!).

John also made this observation though “Journals also lack the expertise required to understand and properly review statistics. Until a couple of years ago, even Nature and Science did not have trained statisticians routinely reviewing papers, which continues to be true of most biomedical and biology journals.” That got me thinking about my other role as a Field Editor for JBV about the importance of reviewer education of statistical inference.

I’m thinking that flipping the requirement around might be a better way to go. What about a required online training class for members of a journal editorial board? For ad hoc reviewers, this course might be optional, but highly encouraged. The course wouldn’t take long, but would highlight the correct interpretation of the p-value, the importance of standard errors in consistency of inference, etc. It would also dispel some popular myths that still occur in the management literature, with the intention of improving the data science capabilities of the field; to publish, you need to ensure you are meeting the standard that reviewers and editors are trained to look for.

I’m thinking this would be a net value add, and not difficult to implement…

Making a theoretical contribution

I'm not a fan of the necessity to have a novel theoretical contribution to publish in top management journals. Arguably, I think this standard has contributed to the replication crisis in the social sciences. Nonetheless, that is the standard, so it's helpful to think through just what a theoretical contribution means in the era of the replication crisis.

Making a theoretical contribution is absolutely in the eye of the beholder. My working hypothesis is that whether an editor believes you have met this standard has a lot to do with the clarity of the writing. The better the writing/argumentation, the better the probability that the editor and reviewers will see the 'theoretical contribution' in the paper. This makes writing quality the single biggest predictor of paper acceptance, and not the 'That's Interesting!' standard that, also, likely contributed to the replication crisis in the first place.

So given that writing quality is the key to crossing the theoretical contribution bar, I would argue that the single best way to enhance writing quality in an management study is clarity of purpose. This is, quite simply, being clear in what you are trying to accomplish. If the purpose of the study is to offer a are grand theory of the firm, great. Just. Say. So.

If you have more modest aims, just say so. To me, a theoretical contribution is something that improves our understanding of the nomological relationship between two or more phenomena. If that means your study is a replication, GREAT! Just say that's what you are doing. If your study asks an existing question but does it in a more rigorous way, even better!!! We need to revisit a number of past findings with new data and new—yes, that means better—methods. My point is that the key to making a theoretical contribution is to just be clear; to be intellectually honest about the purpose behind a study.

As a spillover benefit, I think clarity of purpose will also help address the HARKing problem. If a study is using data from an already published paper, if the code isn't made available, and if the original study wasn't pre-registered, well, the paper is probably a fishing expedition masquerading as, well, a new source of broad managerial insight. If it's fishing, just call it that. But, you better have a self-replication included in the new submission!

Going all in with R

As a doctoral student, I took a multivariate statistics class with SAS. I took a time series class that used R. I had a SEM course using LISREL. My methods guru used Stata, and my adviser SPSS (yes, there is a reason I didn’t link to SPSS). Needless to say, I had a diverse background in statistics software.

Over the years I used, and taught with, mostly Stata, while trying to keep up with R.

This year, I’m all in with R.

Why? Because point and click statistics only makes the researcher degrees of freedom problem worse. Others much more qualified than me have spoken on this, but I think the general trend towards using context-menu based software has resulted in a ‘dumbing-down’ effect in the quality of empirical analyses. It’s not because researchers are dumber, to be clear, it’s that context menu software makes running analyses without understanding what’s going on under the hood far too simple.

Case in point that I use in class a lot? Cronbach’s alpha.

Here’s the formula for Alpha…

\(\alpha=\frac{N*\bar{c}}{\bar{v}+(N-1)*\bar{c}}\)

Without going through the details, \(N\) is the number of indicators being evaluated, and \(\bar{c}\) is the average inter-item covariance. The kicker with alpha is that a large number of indicators will inflate alpha, even in the presence of a low correlation between the actual indicators themselves.

So far so good, but lets look at the output from a common point and click software package—I don’t actually have a license for it, so I’m borrowing this image from the very excellent UCLA stats page.

Notice there is nothing in the output about the average inter-item covariance or correlation. My argument is that SPSS is one reason why we have so many 20+ item psychometric scales floating around with questionable reliability.

Now lets take a look at R, with the popular psych package…

library(psych)
alpha(my.df)
## 
## Reliability analysis   
## Call: alpha(x = my.df)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.83      0.83    0.76      0.61 4.7 0.028    4 1.2
## 
##  lower alpha upper     95% confidence boundaries
## 0.77 0.83 0.88 
## 
##  Reliability if an item is dropped:
##       raw_alpha std.alpha G6(smc) average_r S/N alpha se
## RISK1      0.78      0.78    0.64      0.64 3.6    0.040
## RISK2      0.71      0.71    0.55      0.55 2.4    0.054
## RISK3      0.78      0.78    0.64      0.64 3.6    0.040
## 
##  Item statistics 
##         n raw.r std.r r.cor r.drop mean  sd
## RISK1 116  0.84  0.85  0.72   0.66  4.0 1.4
## RISK2 116  0.89  0.89  0.81   0.73  3.9 1.4
## RISK3 116  0.85  0.85  0.73   0.66  4.1 1.4
## 
## Non missing response frequency for each item
##          1    2    3    4    5    6    7 miss
## RISK1 0.03 0.13 0.15 0.35 0.20 0.11 0.03    0
## RISK2 0.05 0.16 0.19 0.22 0.28 0.09 0.02    0
## RISK3 0.06 0.09 0.15 0.28 0.28 0.13 0.02    0

The data source isn’t important, but beyond the terrific output we see that the average_r—the average inter-item correlation is pretty good, .61. The alpha itself is .83, well within the recommended range.

Now lets take a look at a nine item scale…

library(psych)
alpha(my.df2)
## 
## Reliability analysis   
## Call: alpha(x = my.df2)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.85      0.85    0.88      0.39 5.9 0.021  4.1 1.1
## 
##  lower alpha upper     95% confidence boundaries
## 0.81 0.85 0.89 
## 
##  Reliability if an item is dropped:
##       raw_alpha std.alpha G6(smc) average_r S/N alpha se
## INN1       0.83      0.83    0.86      0.38 4.9    0.025
## INN2       0.83      0.84    0.85      0.39 5.1    0.024
## INN3       0.84      0.85    0.86      0.41 5.5    0.022
## PRO1       0.83      0.84    0.85      0.39 5.1    0.023
## PRO2       0.84      0.84    0.86      0.40 5.4    0.022
## PRO3       0.86      0.86    0.89      0.44 6.3    0.020
## RISK1      0.83      0.83    0.86      0.38 5.0    0.024
## RISK2      0.83      0.83    0.85      0.37 4.8    0.024
## RISK3      0.83      0.83    0.86      0.38 4.9    0.024
## 
##  Item statistics 
##         n raw.r std.r r.cor r.drop mean  sd
## INN1  113  0.76  0.74  0.70   0.65  3.9 1.9
## INN2  113  0.72  0.70  0.68   0.61  4.2 1.7
## INN3  113  0.64  0.62  0.58   0.52  3.8 1.7
## PRO1  113  0.70  0.70  0.67   0.60  4.1 1.6
## PRO2  113  0.65  0.64  0.59   0.53  4.3 1.7
## PRO3  113  0.46  0.48  0.36   0.33  4.5 1.4
## RISK1 113  0.71  0.72  0.68   0.62  4.0 1.4
## RISK2 113  0.75  0.77  0.75   0.68  3.8 1.4
## RISK3 113  0.73  0.75  0.72   0.65  4.1 1.4
## 
## Non missing response frequency for each item
##          1    2    3    4    5    6    7 miss
## INN1  0.12 0.23 0.06 0.17 0.18 0.16 0.09    0
## INN2  0.06 0.18 0.11 0.19 0.20 0.18 0.08    0
## INN3  0.10 0.13 0.20 0.20 0.18 0.13 0.05    0
## PRO1  0.08 0.11 0.12 0.26 0.25 0.14 0.05    0
## PRO2  0.09 0.08 0.12 0.21 0.25 0.18 0.08    0
## PRO3  0.04 0.09 0.09 0.20 0.36 0.18 0.04    0
## RISK1 0.04 0.13 0.14 0.35 0.20 0.12 0.02    0
## RISK2 0.05 0.15 0.19 0.22 0.28 0.09 0.01    0
## RISK3 0.06 0.10 0.15 0.27 0.28 0.12 0.02    0

Our alpha went up to .85—even better than before! But the average_r dropped to .39. The formula is biased by the large number of indicators, not because the scale actually has higher internal reliability.

Ok, I’m not being entirely fair to SPSS, and its always the responsibility of the researcher to understand the nuts and bolts of any analysis he or she runs.

Nonetheless, what I love about R is that it facilitates—empowers—a deeper understanding of the statistical tools that most applied researchers use on a daily basis. Rather than moving more and more of the stats to the background, R brings it to the foreground, giving the analyst the information to draw better conclusions. The added bonus? The R community is so large now that if you are stuck or having trouble, a quick search is usually all you need to find the answer.

The bottom line? R is a tool for data science, and given the replication crisis, we could all use a little more science, and the tools that support it.

Science and journalism in academic publishing


Journalism

I’ve drawn a version of that graphic dozens of times now when talking to PhD students about publishing in management/entrepreneurship. The purpose of the graphic is to talk about publishing probabilities based on a given paper’s strengths—its ability to draw causal inference (good science), or its ability to tell an interesting and compelling story (good journalism).

As a field, we have a silly devotion to ‘making a theoretical contribution’ as a standard for publication. The necessity for each study to bring something new to the table is the exact opposite of what we should want as scientists—trustworthy, replicable results—that imbue confidence in a model’s predictions.

Now, the happiest face, and hence the highest publication probability, is absolutely a paper that addresses an important topic, is well written and argued, AND has a high quality design with supporting empirics. This should be, of course, the goal. Producing such work consistently, however, is not easy. In our publish or perish world, promotion and tenure standards call for a portfolio of work, at least some of which is not likely to fall in our ideal box. So the question becomes, as a field, should we favor high quality science that may address less interesting topics or simple main effects? Or, should we favor papers that speak to an interesting topic but with research designs that represent a garden of forking paths and have less trustworthy results?

To put it another way, what matters more to our field, internal validity or external validity?

Again the ideal is that both matter, although I’m in the camp that internal validity is a necessary element for external validity—what’s the point of a generalizable finding that is wrong in the first place? But when it comes to editorial decisions—and I’ve certainly seen this in my own work—I would argue that as a field good journalism improves the odds of publication even with questionable empirics. I don’t have any data to support my hypothesis, although I typically don’t get much resistance when I draw my picture during a seminar.

Fortunately though, I think we’re slowly changing as a field. The increasing recognition of the replication in science broadly and in associated fields like psychology and organizational behavior will, over time I believe, change the incentive structure to favor scientific rigor over journalistic novelty. Closer to my field, the encouraging changes in the editorial policies of Strategic Management Journal may help tilt the balance in favor of rigor.

In the spirit then of Joe Simmons’ recent post on prioritizing replication, I’d like our field to demonstrably lower the bar for novel theoretical insights in each new published study. It is, to me, what is holding our field back from bridging the gap between academia and practice—why should we expect a manager to use our work if we can’t show that our results replicate?

Rigor and relevance

This post challenges the assumption that for an academic paper to be relevant it must be interesting, and for the paper to be interesting, it only needs appropriate empirics, as opposed to a rigorous research design and empirical treatment.

An easy critique of this assumption is to say that I’ve got a straw-man argument; to publish you need rigorous empirics AND a compelling story that makes a contribution. I don’t think that’s the case. I think as a field (management and entrepreneurship specifically), we are too willing to trade studies that are interesting for those that are less interesting, even if the less interesting paper has a stronger design and stronger empirics. The term interesting is, without question, subjectively determined by journal editors and reviewers—what is interesting to one scholar may or may not be interesting to another.

Generally we think of interesting in terms of making a theoretical contribution; the standard for publication at most of our top empirical journals is that a paper must make a novel insight—or insights—to be publishable. The problem with this standard, as has been amply covered by others is that is encourages, or forgives, researcher degrees of freedom that may weaken statistical and causal inference to maximize the ‘interesting-ness’ factor. The ongoing debate over the replicability of power-posing is a notable case in point.

My hypothesis is that the willingness to trade rigorous research design for ‘novel’ insights is the root cause for the very real gap between academic management research and management practice. The requirement to make a novel insight encourages poor research behavior while minimizing the critical role that replicability plays in the trustworthiness of scientific research. In entrepreneurship research, we are also late in embracing concepts like counterfactual reasoning and appropriate techniques to deal with endogeneity, which diminishes the causal inference of our research and hence its usefulness.

In short, managers are less likely to adopt practices borne out of academic research not because such findings are unapproachable—although true, studies aren’t easy reads—but that most academic research simply isn’t trustworthy.  I’m not suggesting most research is the result of academic misconduct, far from it.  But I am suggesting that weak designs and poorly done analyses lower the trustworthiness of study results and their usefulness to practice.

To be clear, a well done study that maximizes causal inference AND is theoretically novel is, certainly, ideal. But the next most important consideration should be a rigorously designed and executed study on a simple main effect relationship that maximizes causal inference and understanding. It may not be interesting, but at least it will be accurate.

The best way to be relevant is to be trustworthy, and the best way to be trustworthy is to be rigorous. You can’t have external validity without first maximizing internal validity.