Start with a simple model

As a field, we seem to be gravitating to a modus operandi where model complexity equates to making a theoretical contribution. The notion seems to be that the more variables we have, the more hypotheses we test in a single paper, and the more mediators and moderators we include, the greater the ‘contribution’ of the paper. Models end up with causal paths pointing in each cardinal direction, all under the assumption that the ‘more stuff’ I include equates to richness of understanding.

I think this is a bad trend, for three reasons.

1). The likelihood of model misspecification—The more variables, hypotheses, and complexity in the model, the more likely that the entire model will not just be wrong, but really, really, bad;

2). Creates perverse incentives—Tying a theoretical contribution to model complexity invites HARKing and p-hacking; and

3). Barriers to replication—The more a model depends on a particular set of variables from a particular dataset, each of which with its own complex construction, the harder it is for another researcher to replicate the study.

So what’s the answer? Start with a simple model…

  • A single x, predicting a single y;
  • Minimal measurement error for both variables;
  • A large, appropriately powered sample;
  • Appropriate steps to eliminate alternate explanations for the relationship between x and y (ideally by manipulating x); and
  • A reproducible codebook to ensure others can follow along with what you did.

Seriously, that’s it. Now, it’s actually really hard to do steps 2, 3, and 4. These are, however, critical to yield an unbiased estimate of the effect of x on y. Noisy measures in noisy data with small true effect sizes are far more likely to yield unpredictable (and usually inflated) results. A well developed measure, with measurement error kept to a minimum, needs a large dataset to tease out meaningful insights. Too often we see large datasets, but measures of such a convoluted construction that understanding just what the researcher did to build the measure—let alone have confidence that the observed effect is not simply an artifact of the measurement model—makes the contribution trivial at best. By the same token, well done measurement models tested in small, noisy samples result in a similar interpretational problem; it’s too difficult to separate the signal from the noise.

Step 4—dealing with endogeneity—is a topic near and dear to my heart. Here’s my specific problem…it’s so challenging to isolate a consistent effect size estimate for ONE focal relationship. The more hypotheses and variables added to the model, assuming the researcher tests it simultaneously, the difficulty in recovering consistent effect sizes increases exponentially; you are far more likely to screw the entire model up.

Of course, sharing your code, and ideally your data, is pretty easy. But it’s just not something commonly done in management and entrepreneurship research. I hope that is changing, and for me, all of my papers now include posted codebooks and data. There is just no good reason not too.

I think one solution is for journals to encourage more single hypothesis papers. Take an interesting question—say estimating the probability that a failed entrepreneur will start another new venture—and evaluate that question with 2-3 independent studies, with consistent measures, in large representative samples, and ideally with the same instruments used to address the endogeneity problem. As an incentive, journals could offer expedited review of these studies, assuming that the researcher shared his or her data and code.

The bottom line is that headline grabbing effect sizes with sexy variables in complicated models are, over the long run, far more likely to be found wanting than vindicated as possibly right. Science progresses with small, incremental contributions to our knowledge base. Start with a simple model, test it rigorously, and better our management science.

The Grand Theory of Entrepreneurship Fallacy

Periodically I have a conversation where the topic turns to entrepreneurship researchers inability to answer—with precision—why some ventures succeed, some fail, some become zombies, and some become unicorns. Similar conversations surround the topic of startup communities and clusters, and the role of research universities in supporting entrepreneurial ecosystems. Often someone bemoans that we have study after study that addresses only one small piece of the puzzle, or that one study may be contradictory to another study, or that a study is simply too esoteric to be useful.

My response is, well, that’s social science.

I am a social scientist, and proud to be one. I think across the social science domain, including management and entrepreneurship research, we have much to offer the students, businesses, governments, and other stakeholders we serve. But the one thing we aren’t particularly good at is humility. Humility in the sense that when we talk about our research and what we can offer, we’re aren’t always very good at acknowledging the limitations of our work.

Think about predicting the weather. The cool thing about the weather is that it’s governed by the laws of physics, and we know a lot about physics. But even with our knowledge, computational power, and millions of data points, there remains considerable uncertainty about predicting the weather over the next 24, 48, and 72 hours. Part of the reason is that interactions between variables in the environment are difficulty to account for, difficult to model, and especially difficult to predict. Meteorologists are exceptionally good forecasters, but are far from perfect. This is in a field where the fundamental relationships are governed by underlying law-like relationships.

The hard reality is that establishing unequivocal causal relationships in the social sciences is extremely hard, let alone forecasting specific cause and effect sizes. We don’t deal with law-like relationships, measuring latent phenomenon makes error always present, eliminating alternate explanations is maddeningly complex, and, well, we’re humans (that not-being-perfect-thing). Interactions among social forces and social phenomena are not only difficult to model, but in many ways are simply incomprehensible.

One technique we use as social scientists is to hold many factors that we cannot control and cannot observe as constant, and to build a much simpler model of a phenomenon than exists in reality. It helps us make sense of the world, but it comes at the cost of ignoring other factors that may be important, or even more important, than what we are trying to understand. It also means that our models are subjective—the answer provided by one model may not be the answer provided by another. In a sense, models are equally right and equally wrong.

Where stakeholders who are not social scientist get frustrated with us is the desire for simple, unequivocal answers. What is also troublesome is that some social scientists—despite knowing better—are more than happy to tell the stakeholder that “yes, I’ve got the answer, and this is it.” When that answer turns out not to work as advertised, the search begins again, although this time with the stakeholder even more frustrated then before.

Making the matter even more complicated are statistical tools and methodologies that seem to provide that unequivocal answer; the effect of x on y is z—when x changes by a given amount, expect y to change by z amount. It seems so simple, so believable, that it’s easy to be fooled into thinking that the numbers produced by a statistics package represent truth, when the reality of that number is, well, far from ‘truth’.

In conversations which turn to wanting simple, unequivocal answers about entrepreneurship—what I call the grand theory of entrepreneurship fallacy—telling the weather analogy helps. But it’s also easy to say that there simply aren’t simple answers. I can’t answer the question because there isn’t an answer; you are trying to solve an unsolvable problem. The best that I can provide, and the best that entrepreneurship data science can provide, is an educated guess. That guess will have a credibility interval around it, and will be narrowly applicable, and be subject to update as new data comes in and new relationships between variables emerge. That’s the best we can do, and be extremely wary of the researcher who says he or she can do better!

We characterize our human experience with uncertainty and with variance. Don’t expect anything better from data science on that human experience.

Selection models and weak instruments

As an editor and reviewer, I’m seeing more selection models (e.g., Heckman) these days that suffer from weak exclusion restrictions (i.e., weak instruments). Weak instruments are a problem in any method dealing with endogeneity where an instrument varible, i, is a proxy for random selection. Heckman selection models share a similar problem of weak instruments, and it has to do with the exclusion restriction (Bushway et al., 2007). Researchers employ a Heckman selection model to address omitted variable bias stemming from a specific sample selection problem. In the classic example, a model predicting the relationship between wages and education would only include in the sample those educated individuals who chose to work. The self-selection into the sample, for reasons unknown to the researcher, create a type of omitted variable problem manifesting as endogeneity (Certo et al., 2016).

In the Heckman correction, the researcher estimates a first-stage probit model predicting the likelihood of the entity selecting into the sampling condition. For example, in a study on corporate venturing on firm performance, the first stage equation would be a probit model^1 predicting the probability of the firm engaging in corporate venturing activity, in the following form:


Here we estimate the probability of y occurring given a set of observed predictors, Z, with effects Beta, and cdf as the cumulative distribution function of the standard normal distribution. Heckman’s insight was to recognize that a transformation of the predicted values in the first stage represents the selection hazard of appearing in the sample (Clougherty et al., 2016). Using this transformation of the predicted values from the equation, often called the inverse Mills ratio, in a second stage, ordinary least squares estimate of the focal model of interest, yields an estimate of the selection hazard, typically denoted by lambda. In our example, lambda would represent the selection hazard of the firm engaging in corporate venturing activity. Evaluating the statistical significance of lambda proxies the presence of a meaningful selection effect in the second stage model.

Drawing the correct inference about selection effects in the second stage model depends though on two critical factors. The first factor is that while including the inverse Mills ratio in the second stage equation yields a consistent estimate of x—assuming all other assumptions are met—it also yields inconsistent standard errors for every estimated parameter (Clougherty et al., 2016). There are several methods to correct the standard errors in the second stage, including manual matrix manipulation, but most selection estimators (e.g., sampleSelection in R and heckman in Stata) make this correction automatically. The concern is whether the researcher used these estimators, or simply calculated the inverse Mills ratio by hand and then included the value as an other regressor in a second model and then didn’t correct the standard errors.

The second factor is high collinearity between the inverse Mills ratio and the other predictors in the second stage equation. Because the first and second stage equations share the same vector of predictors, the transformed predicted value in the first stage correlates strongly with the predictors in the second stage. As in any multiple regression model, high collinearity yields inconsistent estimates. The solution is generally to include one or more additional predictors in the first stage that are then excluded in the second stage. Akin to instrument variables, these predictors should influence selection into the sample (the first stage), but then have no relationship to the ultimate disturbance term in the second stage (Certo et al., 2016). Failure to include these exclusion restriction variables, using weak exclusion variables, or using exclusion variables that are themselves endogenous, will yield inconsistent estimates in the second stage equation.

Given the difficulties inherent to properly specifying selection models, and that the selection hazard parameter (lambda) only deals with endogeneity specifically from sample selection, many scholars—myself included—recommend using endogeneity correction approaches that deal with selection along with other omitted variable concerns simultaneously (e.g., 2SLS, regression discontinuity, and so forth). The bottom line though, just like any instrument variable method, the quality of the second stage model is predicated on the quality of the first stage model.

^1: While researchers often use logit and probit interchangeably, the Heckman method is a case where the researcher must use a probit model in the first stage equation. The reason is the distributional assumption differences between the two models—the Heckman method depends on the assumption of bivariate normality, which is an outcome only of probit limited dependent variable models.


Bushway S, Johnson BD, Slocum LA. 2007. Is the magic still there? The use of the Heckman two-step correction for selection bias in criminology. Journal of Quantitative Criminology 23(2): 151–178.

Certo ST, Busenbark JR, Woo H-S, Semadeni M. 2016. Sample selection bias and Heckman models in strategic management research. Strategic Management Journal 37(13): 2639–2657.

Clougherty JA, Duso T, Muck J. 2016. Correcting for self-selection based endogeneity in management research: Review, recommendations and simulations. Organizational Research Methods 19(2): 286–347.

Interpreting logistic regression – Part II

Part I dealt with logit models with dichotomous (0/1) predictors. These are handy models, but usually we deal with predictors that are continuous. That’s the point of Part II, where I walk through making sense of these models, focusing especially on calculating and plotting marginal effects and predicted probabilities.

You can find a link to Part II on my resources page, or here:

Happy Memorial Day weekend, and thanks to all who serve and have served!

Logistic regression – Pesky equations!

So the original post was a good lesson in verifying that MathJax works on my WordPress site before hitting the ‘Publish’ button. Sorry for my workflow hiccup.

To get the equations to render correctly, I’ve moved the post over to my Resources page, which you can access here.

I’ve been kicking around the idea of moving over to blogdown—might have to move that timeframe up a bit!