Selection models and weak instruments

As an editor and reviewer, I’m seeing more selection models (e.g., Heckman) these days that suffer from weak exclusion restrictions (i.e., weak instruments). Weak instruments are a problem in any method dealing with endogeneity where an instrument varible, i, is a proxy for random selection. Heckman selection models share a similar problem of weak instruments, and it has to do with the exclusion restriction (Bushway et al., 2007). Researchers employ a Heckman selection model to address omitted variable bias stemming from a specific sample selection problem. In the classic example, a model predicting the relationship between wages and education would only include in the sample those educated individuals who chose to work. The self-selection into the sample, for reasons unknown to the researcher, create a type of omitted variable problem manifesting as endogeneity (Certo et al., 2016).

In the Heckman correction, the researcher estimates a first-stage probit model predicting the likelihood of the entity selecting into the sampling condition. For example, in a study on corporate venturing on firm performance, the first stage equation would be a probit model^1 predicting the probability of the firm engaging in corporate venturing activity, in the following form:

Pr(y=1|Z)=cdf(ZBeta)

Here we estimate the probability of y occurring given a set of observed predictors, Z, with effects Beta, and cdf as the cumulative distribution function of the standard normal distribution. Heckman’s insight was to recognize that a transformation of the predicted values in the first stage represents the selection hazard of appearing in the sample (Clougherty et al., 2016). Using this transformation of the predicted values from the equation, often called the inverse Mills ratio, in a second stage, ordinary least squares estimate of the focal model of interest, yields an estimate of the selection hazard, typically denoted by lambda. In our example, lambda would represent the selection hazard of the firm engaging in corporate venturing activity. Evaluating the statistical significance of lambda proxies the presence of a meaningful selection effect in the second stage model.

Drawing the correct inference about selection effects in the second stage model depends though on two critical factors. The first factor is that while including the inverse Mills ratio in the second stage equation yields a consistent estimate of x—assuming all other assumptions are met—it also yields inconsistent standard errors for every estimated parameter (Clougherty et al., 2016). There are several methods to correct the standard errors in the second stage, including manual matrix manipulation, but most selection estimators (e.g., sampleSelection in R and heckman in Stata) make this correction automatically. The concern is whether the researcher used these estimators, or simply calculated the inverse Mills ratio by hand and then included the value as an other regressor in a second model and then didn’t correct the standard errors.

The second factor is high collinearity between the inverse Mills ratio and the other predictors in the second stage equation. Because the first and second stage equations share the same vector of predictors, the transformed predicted value in the first stage correlates strongly with the predictors in the second stage. As in any multiple regression model, high collinearity yields inconsistent estimates. The solution is generally to include one or more additional predictors in the first stage that are then excluded in the second stage. Akin to instrument variables, these predictors should influence selection into the sample (the first stage), but then have no relationship to the ultimate disturbance term in the second stage (Certo et al., 2016). Failure to include these exclusion restriction variables, using weak exclusion variables, or using exclusion variables that are themselves endogenous, will yield inconsistent estimates in the second stage equation.

Given the difficulties inherent to properly specifying selection models, and that the selection hazard parameter (lambda) only deals with endogeneity specifically from sample selection, many scholars—myself included—recommend using endogeneity correction approaches that deal with selection along with other omitted variable concerns simultaneously (e.g., 2SLS, regression discontinuity, and so forth). The bottom line though, just like any instrument variable method, the quality of the second stage model is predicated on the quality of the first stage model.


^1: While researchers often use logit and probit interchangeably, the Heckman method is a case where the researcher must use a probit model in the first stage equation. The reason is the distributional assumption differences between the two models—the Heckman method depends on the assumption of bivariate normality, which is an outcome only of probit limited dependent variable models.


References:

Bushway S, Johnson BD, Slocum LA. 2007. Is the magic still there? The use of the Heckman two-step correction for selection bias in criminology. Journal of Quantitative Criminology 23(2): 151–178.

Certo ST, Busenbark JR, Woo H-S, Semadeni M. 2016. Sample selection bias and Heckman models in strategic management research. Strategic Management Journal 37(13): 2639–2657.

Clougherty JA, Duso T, Muck J. 2016. Correcting for self-selection based endogeneity in management research: Review, recommendations and simulations. Organizational Research Methods 19(2): 286–347.

Interpreting logistic regression – Part II

Part I dealt with logit models with dichotomous (0/1) predictors. These are handy models, but usually we deal with predictors that are continuous. That’s the point of Part II, where I walk through making sense of these models, focusing especially on calculating and plotting marginal effects and predicted probabilities.

You can find a link to Part II on my resources page, or here: http://bit.ly/2rZ0dqD

Happy Memorial Day weekend, and thanks to all who serve and have served!

Logistic regression – Pesky equations!

So the original post was a good lesson in verifying that MathJax works on my WordPress site before hitting the ‘Publish’ button. Sorry for my workflow hiccup.

To get the equations to render correctly, I’ve moved the post over to my Resources page, which you can access here.

I’ve been kicking around the idea of moving over to blogdown—might have to move that timeframe up a bit!

Interpreting logistic regression — Part I

 

I’m a big fan of logit models. Mostly because, in my opinion, they offer the clearest way to answer questions that are usually the most interesting to academics and entrepreneurs. Because we can turn the results of a logit model into a set of predicted probabilities, they let us answer questions like, “what’s the probability of a new business making a first sale given these conditions?” and “what’s the probability of an entrepreneur starting a new business given that he/she started a business before?”

The kicker, as I’ve found teaching logit models, is that while the final outcome is very helpful, getting there isn’t so easy. The path to probability is also a reason why logit and other limited dependent variable models are often misinterpreted in strategy and entrepreneurship research.

So this post walks through a logit model with two predictor variables, \(x1\) and \(x2\), that are both dichotomous (0/1). In a future Part II, I’ll do a logit model where \(x1\) and \(x2\) are both continuous variables.

Generating some data

We’ll start by generating the data for Model 1, with two dichotomous predictors

library(tidyverse)  # Love the tidyverse!

# Set the random seed generator to ensure reproducible results
set.seed(080203)

# Create a new dataframe with 1,000 observations and our two predictors 
model1.df <- data_frame(x1 = rbinom(1000, 1, .5),
                        x2 = rbinom(1000, 1, .5))

# Next we'll create our predicted value of y, and include a constant
model1.df <- model1.df %>%
  mutate(y = (.75*x1 + 1.5*x2)) %>%
  mutate(pr_y = exp(y)/(1 + exp(y))) %>%
  mutate(y = rbinom(1000, 1, pr_y)) %>%
  select(-pr_y)

# Export the data file to a csv for data sharing
write.csv(model1.df, file = "logit_sim_data.csv", row.names = FALSE)

# Now lets take a look
head(model1.df, 10)

Estimate our model

A logistic regression model is simply a logistic transformation of the linear model, so we can write our model in a familiar form…

\(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}+\epsilon\)

Our outcome variable is the probability of \(y\) occurring (y = 1), which we write as \(\left[\frac{p}{1-p}\right]\), the probability of \(y\) occurring divided by the probability of \(y\) not occurring.

We’ll use the glm function to estimate our model in R, and specify the binomial link function, which gives us our logistic transformation.

model1 <- glm(y ~ x1 + x2, data = model1.df, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial", data = model1.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1750  -1.2270   0.6248   0.8443   1.1286  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1160     0.1197   0.969    0.333    
## x1            0.7321     0.1566   4.675 2.94e-06 ***
## x2            1.4185     0.1604   8.846  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1137.7  on 999  degrees of freedom
## Residual deviance: 1034.1  on 997  degrees of freedom
## AIC: 1040.1
## 
## Number of Fisher Scoring iterations: 4

Making sense of the model

Here is where things get complicated.

Lets take a look at the estimated coefficients…

log.odds <- summary(model1)$coefficients[1:3,1]
log.odds
## (Intercept)          x1          x2 
##   0.1159635   0.7320719   1.4185449

Our \(y\) variable is on the logit scale. That means that the coefficient estimates are the log odds of \(y\) occurring given a unit change in the two predictors. The intercept in this case is the log odds of \(y\) occurring when both \(x1\) and \(x2\) are zero.

Log odds are not particularly helpful, so we can convert them from the logit scale by taking the exponent:

exp.odds <- exp(log.odds)
exp.odds
## (Intercept)          x1          x2 
##    1.122955    2.079384    4.131105

The exponented coefficients are not the “effect” of \(x\) on \(y\) as would be the common interpretation of a regular linear regression. They also aren’t the likelihood of \(y\) occurring, which is really a statement of probability (we’ll get to that). Because \(x1\) is dichotomous, we interpret \(x1\) as the expected change, on average, in the odds of \(y\) occurring as \(x1\) shifts from 0 to 1, holding constant \(x2\).

Also because \(x1\) is dichotomous, the coefficient estimate is equivalent to the marginal effect of \(x1\) on \(y\), holding constant \(x2\) (more on that later).

So what we can say is that the odds of \(y\) occurring are just over two times higher in the presence of \(x1\) (\(x1\) = 1) then when \(x1\) is not present (\(x1\) = 0), holding \(x2\) constant. To put it another way, the odds of \(y\) occurring given the presence of \(x1\) and holding \(x2\) constant are roughly 2.08 to 1 (2.08:1).

Predicted probabilities

Odds are helpful, but they really are useful when it comes to predicting the probability that \(y\) occurs at discrete values of the two predictors.

In our model, given that both \(x1\) and \(x2\) are dichotomous, we have four different possible probabilities of \(y\) predicted by our model, and I’ve shown each of them with the corresponding equation:

Probability x1 x2 Equation
Pr_y 0 0 \(ln\left[\frac{p}{1-p}\right]=\alpha\)
Pr_y 1 0 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}\)
Pr_y 0 1 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_2}\)
Pr_y 1 1 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}\)

Yes, there is a shortcut to doing this in R, but it’s helpful to walk through how to do this by hand. Lets calculate the second row in the table—the probability that \(y\) occurs when \(x1\) occurs (\(x1\) = 0) and \(x2\) does not occur (\(x2\) = 0). Recall that the equation for this model is \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}\), or equivalently: \(Pr\_ y=\frac{1}{1+{e}^{-(\alpha+{\beta}{x_{1}})}}\) (see Wooldridge, 2012).

model1.intercept <- (summary(model1)$coefficients[1,1])
model1.betax1 <- (summary(model1)$coefficients[2,1])
pr_y = 1 / (1 + (exp(-1 * (model1.intercept + model1.betax1))))
pr_y
## [1] 0.7001548

On average and for this sample, there is a 70% probability of \(y\) occurring given that \(x1\) occurred and \(x2\) did not.

Ok, now for the shortcut. We can use R’s predict function to pass all of the possible combinations of from our table and give us the predicted probabilities of each. Note that we must use the ‘response’ option to return the predicted probabilities, as opposed to the log odds.

# We start with a new dataframe that contains our possible...
# ...values of x1 and x2, following our table above
model1.values <- data_frame(x1 = c(0, 1, 0, 1),
                            x2 = c(0, 0, 1, 1))
model1.values

We then pass the new dataframe to the predict function, and we’ll set the se.fit = TRUE option to return the standard errors for the prediction, so that we can construct confidence intervals.

model1.predict <- predict(model1, newdata = model1.values, type = "response", se.fit = TRUE)

# Create a new dataframe to store predictions and calculated confidence...
# ...intervals using +/- 1.96 for a 95% interval
model1.predict.df <- data_frame(pr_y = model1.predict$fit,
                                lower.ci = model1.predict$fit - (1.96 * model1.predict$se.fit),
                                upper.ci = model1.predict$fit + (1.96 * model1.predict$se.fit))

# Lets also convert these values to percentages to make them...
# ...easier to manage and interpret
model1.predict.df <- model1.predict.df %>%
  mutate(pr_y = 100 * round(pr_y, 4),
         lower.ci = 100 * round(lower.ci, 4),
         upper.ci = 100 * round(upper.ci, 4))

Now we can bind our predicted values dataframe to our model values dataframe (the one with the x1 and x2 values) to make a nice output table:

library(knitr)
model1.outcome <- bind_cols(model1.values, model1.predict.df)
kable(model1.outcome, caption = "Predicted probabilites of y at possible combinations of x1 and x2")
x1 x2 pr_y lower.ci upper.ci
0 0 52.90 47.05 58.74
1 0 70.02 64.75 75.28
0 1 82.27 78.34 86.19
1 1 90.61 87.86 93.35

While I’m a big fan of visualizations, there’s not much to plot with these two variables—all we get are point estimates. We’ll tackle plotting predicted probabilities in Logit Part II with continuous variables.

Another take on p-values

Here’s an interesting take from a few days ago on the American Statistical Association’s statement on the use—and misuse—of p-values that was published last year. I’m certainly in the camp that p-values are more often than not misunderstood and misapplied in published studies, but the challenge I’ve found has been to communicate the myriad of assumptions made when employing the p < .05 standard and how shaky research can be that deviates from these assumptions.

The p-value is the probability that the observed effect, or a larger effect, is due to random chance, assuming that the null hypothesis is true. Generally in the null hypothesis testing framework, the assumption made is that the effect is zero. Not statistically different from zero, but, actually, zero. Therein lies one of the many problems with p-values—very rarely, if ever, would we expect absolutely zero effect in social science research. Our constructs are too noisy, and our theoretical explanations too loose, to reasonably expect an effect of zero.

So given that the null itself isn’t likely to be true, how do we reconcile the p < .05 standard? Well, the best option is to be a Bayesian, but if you have to retain a frequentist perspective, here is one explanation I use with my doctoral students.

The irony of p-values is that the more likely the null hypothesis is to be true, the less valuable the p-value becomes. You can think of it conceptually like a classic conditional probability: Pr(y|x) What’s the probability that a study reporting a statistically significant rejection of the null hypothesis is accurate, given the probability that the null hypothesis is true?

Now, to be clear, the p-value doesn’t—and can’t—say anything about the probability that the null hypothesis is true. What I’m talking about is the researcher using his/her own judgement based on prior work, theory, and deductive reasoning about just how likely it is that the null hypothesis is true in real life. For example, in EO research, the null hypothesis would be that entrepreneurial firms enjoy no performance advantage over conservatively managed firms. We could put the probability of the null hypothesis being true at about 10%—it’s difficult to imagine a meaningful content in which it doesn’t pay to be entrepreneurial, but it’s possible.

So how does that inform evaluating EO research reporting a p < .05 standard? Lets imagine a study reports an effect of EO on firm growth at p = .01. Under a conventional interpretation, we would say that the probability was 1 in 100 that the observed effect, or a larger effect, was due to random chance assuming that the null hypothesis is true. But real life and our judgement says that the null hypothesis has little chance of being true (say 10% in our example). In this case, the p-value actually works pretty well, although it’s not very valuable. The 1 in 100 chance that the results are due to random chance is not that far off from our 1 in 10 chance that being entrepreneurial doesn’t help the firm grow. It’s not valuable in the sense that its told us something that we already knew (or guessed) to be true in the real world.

But what about the case where a study reports p = .01, so still a 1 in 100 probability, but the likelihood of the null hypothesis itself being true is high, say 95%? In other words, the likelihood of the effect being real is very small. The best discussion of the exact probability breakdown is by Sellke et al (2001), and here’s a non-paywalled discussion of the same concept, but in this case the p-value can be downright dangerous. The probability of incorrectly rejecting a highly probable null hypothesis is over 10% at p = .01, and almost 30% at p = .05. In short, the more probable the null, the more likely it is that ‘statistically significant evidence’ in favor of its rejection is flawed.

The bottom line is that there is no substitute for using your own judgement when evaluating a study. Ask yourself just how likely it is that the null hypothesis is to be true, particularly when evaluating research purporting to offer ‘surprising’, ‘novel’, and ‘counterintuitive’ findings. You might find that the author’s statistically significant novel finding is itself likely to be a random variation, or as Andrew Gelman might say, the difference between significant and not-significant is not itself statistically significant.