Logistic regression – Pesky equations!

So the original post was a good lesson in verifying that MathJax works on my WordPress site before hitting the ‘Publish’ button. Sorry for my workflow hiccup.

To get the equations to render correctly, I’ve moved the post over to my Resources page, which you can access here.

I’ve been kicking around the idea of moving over to blogdown—might have to move that timeframe up a bit!

Interpreting logistic regression — Part I

 

I’m a big fan of logit models. Mostly because, in my opinion, they offer the clearest way to answer questions that are usually the most interesting to academics and entrepreneurs. Because we can turn the results of a logit model into a set of predicted probabilities, they let us answer questions like, “what’s the probability of a new business making a first sale given these conditions?” and “what’s the probability of an entrepreneur starting a new business given that he/she started a business before?”

The kicker, as I’ve found teaching logit models, is that while the final outcome is very helpful, getting there isn’t so easy. The path to probability is also a reason why logit and other limited dependent variable models are often misinterpreted in strategy and entrepreneurship research.

So this post walks through a logit model with two predictor variables, \(x1\) and \(x2\), that are both dichotomous (0/1). In a future Part II, I’ll do a logit model where \(x1\) and \(x2\) are both continuous variables.

Generating some data

We’ll start by generating the data for Model 1, with two dichotomous predictors

library(tidyverse)  # Love the tidyverse!

# Set the random seed generator to ensure reproducible results
set.seed(080203)

# Create a new dataframe with 1,000 observations and our two predictors 
model1.df <- data_frame(x1 = rbinom(1000, 1, .5),
                        x2 = rbinom(1000, 1, .5))

# Next we'll create our predicted value of y, and include a constant
model1.df <- model1.df %>%
  mutate(y = (.75*x1 + 1.5*x2)) %>%
  mutate(pr_y = exp(y)/(1 + exp(y))) %>%
  mutate(y = rbinom(1000, 1, pr_y)) %>%
  select(-pr_y)

# Export the data file to a csv for data sharing
write.csv(model1.df, file = "logit_sim_data.csv", row.names = FALSE)

# Now lets take a look
head(model1.df, 10)

Estimate our model

A logistic regression model is simply a logistic transformation of the linear model, so we can write our model in a familiar form…

\(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}+\epsilon\)

Our outcome variable is the probability of \(y\) occurring (y = 1), which we write as \(\left[\frac{p}{1-p}\right]\), the probability of \(y\) occurring divided by the probability of \(y\) not occurring.

We’ll use the glm function to estimate our model in R, and specify the binomial link function, which gives us our logistic transformation.

model1 <- glm(y ~ x1 + x2, data = model1.df, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial", data = model1.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1750  -1.2270   0.6248   0.8443   1.1286  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1160     0.1197   0.969    0.333    
## x1            0.7321     0.1566   4.675 2.94e-06 ***
## x2            1.4185     0.1604   8.846  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1137.7  on 999  degrees of freedom
## Residual deviance: 1034.1  on 997  degrees of freedom
## AIC: 1040.1
## 
## Number of Fisher Scoring iterations: 4

Making sense of the model

Here is where things get complicated.

Lets take a look at the estimated coefficients…

log.odds <- summary(model1)$coefficients[1:3,1]
log.odds
## (Intercept)          x1          x2 
##   0.1159635   0.7320719   1.4185449

Our \(y\) variable is on the logit scale. That means that the coefficient estimates are the log odds of \(y\) occurring given a unit change in the two predictors. The intercept in this case is the log odds of \(y\) occurring when both \(x1\) and \(x2\) are zero.

Log odds are not particularly helpful, so we can convert them from the logit scale by taking the exponent:

exp.odds <- exp(log.odds)
exp.odds
## (Intercept)          x1          x2 
##    1.122955    2.079384    4.131105

The exponented coefficients are not the “effect” of \(x\) on \(y\) as would be the common interpretation of a regular linear regression. They also aren’t the likelihood of \(y\) occurring, which is really a statement of probability (we’ll get to that). Because \(x1\) is dichotomous, we interpret \(x1\) as the expected change, on average, in the odds of \(y\) occurring as \(x1\) shifts from 0 to 1, holding constant \(x2\).

Also because \(x1\) is dichotomous, the coefficient estimate is equivalent to the marginal effect of \(x1\) on \(y\), holding constant \(x2\) (more on that later).

So what we can say is that the odds of \(y\) occurring are just over two times higher in the presence of \(x1\) (\(x1\) = 1) then when \(x1\) is not present (\(x1\) = 0), holding \(x2\) constant. To put it another way, the odds of \(y\) occurring given the presence of \(x1\) and holding \(x2\) constant are roughly 2.08 to 1 (2.08:1).

Predicted probabilities

Odds are helpful, but they really are useful when it comes to predicting the probability that \(y\) occurs at discrete values of the two predictors.

In our model, given that both \(x1\) and \(x2\) are dichotomous, we have four different possible probabilities of \(y\) predicted by our model, and I’ve shown each of them with the corresponding equation:

Probability x1 x2 Equation
Pr_y 0 0 \(ln\left[\frac{p}{1-p}\right]=\alpha\)
Pr_y 1 0 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}\)
Pr_y 0 1 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_2}\)
Pr_y 1 1 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}\)

Yes, there is a shortcut to doing this in R, but it’s helpful to walk through how to do this by hand. Lets calculate the second row in the table—the probability that \(y\) occurs when \(x1\) occurs (\(x1\) = 0) and \(x2\) does not occur (\(x2\) = 0). Recall that the equation for this model is \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}\), or equivalently: \(Pr\_ y=\frac{1}{1+{e}^{-(\alpha+{\beta}{x_{1}})}}\) (see Wooldridge, 2012).

model1.intercept <- (summary(model1)$coefficients[1,1])
model1.betax1 <- (summary(model1)$coefficients[2,1])
pr_y = 1 / (1 + (exp(-1 * (model1.intercept + model1.betax1))))
pr_y
## [1] 0.7001548

On average and for this sample, there is a 70% probability of \(y\) occurring given that \(x1\) occurred and \(x2\) did not.

Ok, now for the shortcut. We can use R’s predict function to pass all of the possible combinations of from our table and give us the predicted probabilities of each. Note that we must use the ‘response’ option to return the predicted probabilities, as opposed to the log odds.

# We start with a new dataframe that contains our possible...
# ...values of x1 and x2, following our table above
model1.values <- data_frame(x1 = c(0, 1, 0, 1),
                            x2 = c(0, 0, 1, 1))
model1.values

We then pass the new dataframe to the predict function, and we’ll set the se.fit = TRUE option to return the standard errors for the prediction, so that we can construct confidence intervals.

model1.predict <- predict(model1, newdata = model1.values, type = "response", se.fit = TRUE)

# Create a new dataframe to store predictions and calculated confidence...
# ...intervals using +/- 1.96 for a 95% interval
model1.predict.df <- data_frame(pr_y = model1.predict$fit,
                                lower.ci = model1.predict$fit - (1.96 * model1.predict$se.fit),
                                upper.ci = model1.predict$fit + (1.96 * model1.predict$se.fit))

# Lets also convert these values to percentages to make them...
# ...easier to manage and interpret
model1.predict.df <- model1.predict.df %>%
  mutate(pr_y = 100 * round(pr_y, 4),
         lower.ci = 100 * round(lower.ci, 4),
         upper.ci = 100 * round(upper.ci, 4))

Now we can bind our predicted values dataframe to our model values dataframe (the one with the x1 and x2 values) to make a nice output table:

library(knitr)
model1.outcome <- bind_cols(model1.values, model1.predict.df)
kable(model1.outcome, caption = "Predicted probabilites of y at possible combinations of x1 and x2")
x1 x2 pr_y lower.ci upper.ci
0 0 52.90 47.05 58.74
1 0 70.02 64.75 75.28
0 1 82.27 78.34 86.19
1 1 90.61 87.86 93.35

While I’m a big fan of visualizations, there’s not much to plot with these two variables—all we get are point estimates. We’ll tackle plotting predicted probabilities in Logit Part II with continuous variables.

Credibility in strategic management research

Don Bergh1 and colleagues published a great note in Strategic Organization recently on the question of reproducibility of results in strategy research. I agree with virtually everything in the paper, but this passage on page 8 caught my attention…

Overall, based on our sample of 88 SMJ articles, the strategic management literature appears vulnerable to credibility problems for two main reasons. One, the majority of the articles did not report their data sufficiently to permit reproduction, leaving us in the dark with regards to the accuracy of their reported results. Two, among those articles where reproduction analyses were possible, a significant number of discrepancies existed between reported and reproduced significance levels.

I’ve written about this before—what limits our impact on management practice is a lack of rigor, and not an excess of it. Here is another example of the problem. When a second scholar is not able to reproduce the results of a study, using the same data (correlation matrix) and same estimator, that’s a significant concern. We simply cannot say with confidence, especially given threats to causal inference, that a single reported study has the strength of effect reported if data, code, and other related disclosures about research design and methodology are absent. Rigor and transparency, to me, will be the keys to unlocking the potential impact on management practice from strategy and entrepreneurship research.

On a related note, it’s nice in this paper that the authors drew the distinction between reproducibility and replication, which sometimes gets confused. A reproduction of a study is the ability to generate the same results from a secondary analysis as reported in the original study, using the same data. A replication is the ability to draw similar nomological conclusions—generally with overlapping confidence intervals of the estimates—of a study using the same research design and methodology but on a different random sample.

Both reproducibility and replication are critical to building confidence and credibility in scientific findings. To me though, reproducibility is a necessary, but not sufficient condition for credibility. The easiest way to ensure reproducibility is to share data and to share code, and to do this early in the review process. For example, the Open Science Framework allows authors to make use of an anonymized data and file repository, allowing reviewers to check data and code without violating blind review.

While yes, many estimators (OLS, ML, covariance-based SEM) allow you to reproduce results based on a correlation/covariance matrix, as reported in the paper, this can be a tall order, what with the garden of forking paths problem. More problematic for strategy research is the use of panel/multilevel data, which was an area the authors didn’t touch on. In this case, a multilevel study’s reported correlation matrix would pool the lower- and higher-order variance together, effectively eliminating the panel structure. You could reproduce a naive, pooled model from the published correlation matrix, but not the multilevel model, which demonstrably limits its usefulness. This is a major a reason why I’m in favor of dropping the standard convention of reporting a correlation matrix and instead requiring data and code.

Regardless though, lack of reproducibility is a significant problem in strategy, as in other disciplines. We’ve got a lot more work to do to build confidence in our results, and to have the impact on management practice that we could.

  1. In the interest of full disclosure, Dr. Bergh was a mentor of mine at University of Denver—I was a big fan of him then, and I still am 🙂

Another take on p-values

Here’s an interesting take from a few days ago on the American Statistical Association’s statement on the use—and misuse—of p-values that was published last year. I’m certainly in the camp that p-values are more often than not misunderstood and misapplied in published studies, but the challenge I’ve found has been to communicate the myriad of assumptions made when employing the p < .05 standard and how shaky research can be that deviates from these assumptions.

The p-value is the probability that the observed effect, or a larger effect, is due to random chance, assuming that the null hypothesis is true. Generally in the null hypothesis testing framework, the assumption made is that the effect is zero. Not statistically different from zero, but, actually, zero. Therein lies one of the many problems with p-values—very rarely, if ever, would we expect absolutely zero effect in social science research. Our constructs are too noisy, and our theoretical explanations too loose, to reasonably expect an effect of zero.

So given that the null itself isn’t likely to be true, how do we reconcile the p < .05 standard? Well, the best option is to be a Bayesian, but if you have to retain a frequentist perspective, here is one explanation I use with my doctoral students.

The irony of p-values is that the more likely the null hypothesis is to be true, the less valuable the p-value becomes. You can think of it conceptually like a classic conditional probability: Pr(y|x) What’s the probability that a study reporting a statistically significant rejection of the null hypothesis is accurate, given the probability that the null hypothesis is true?

Now, to be clear, the p-value doesn’t—and can’t—say anything about the probability that the null hypothesis is true. What I’m talking about is the researcher using his/her own judgement based on prior work, theory, and deductive reasoning about just how likely it is that the null hypothesis is true in real life. For example, in EO research, the null hypothesis would be that entrepreneurial firms enjoy no performance advantage over conservatively managed firms. We could put the probability of the null hypothesis being true at about 10%—it’s difficult to imagine a meaningful content in which it doesn’t pay to be entrepreneurial, but it’s possible.

So how does that inform evaluating EO research reporting a p < .05 standard? Lets imagine a study reports an effect of EO on firm growth at p = .01. Under a conventional interpretation, we would say that the probability was 1 in 100 that the observed effect, or a larger effect, was due to random chance assuming that the null hypothesis is true. But real life and our judgement says that the null hypothesis has little chance of being true (say 10% in our example). In this case, the p-value actually works pretty well, although it’s not very valuable. The 1 in 100 chance that the results are due to random chance is not that far off from our 1 in 10 chance that being entrepreneurial doesn’t help the firm grow. It’s not valuable in the sense that its told us something that we already knew (or guessed) to be true in the real world.

But what about the case where a study reports p = .01, so still a 1 in 100 probability, but the likelihood of the null hypothesis itself being true is high, say 95%? In other words, the likelihood of the effect being real is very small. The best discussion of the exact probability breakdown is by Sellke et al (2001), and here’s a non-paywalled discussion of the same concept, but in this case the p-value can be downright dangerous. The probability of incorrectly rejecting a highly probable null hypothesis is over 10% at p = .01, and almost 30% at p = .05. In short, the more probable the null, the more likely it is that ‘statistically significant evidence’ in favor of its rejection is flawed.

The bottom line is that there is no substitute for using your own judgement when evaluating a study. Ask yourself just how likely it is that the null hypothesis is to be true, particularly when evaluating research purporting to offer ‘surprising’, ‘novel’, and ‘counterintuitive’ findings. You might find that the author’s statistically significant novel finding is itself likely to be a random variation, or as Andrew Gelman might say, the difference between significant and not-significant is not itself statistically significant.

Bad statistics and theoretical looseness

I’m actually a big fan of theory—I’m just not wild about the ways in which we (management and entrepreneurship scholars) test it. The driving reason is theoretical looseness; the ability to offer any number of theoretical explanations for a phenomenon of interest.

What concerns me most with theoretical looseness is that researchers often become blind to questioning results that don’t align with the preponderance of evidence in the literature. The race for publication, combined with the ability to offer what is a logically consistent—even if contradictory to most published research—explanation makes it all to easy to slip studies with flimsy results into the conversation.

In EO research, we see this often with studies purporting to find a nill, or a negative, effect of entrepreneurial behavior and firm growth. Is it possible? Sure. A good Bayesian will always allow for a non-zero prior, however small it might be. But is is logical? Well, therein lies the problem. Because our theories are generally broad, or because we can pull from a plethora of possible theoretical explanations that rarely provide specific estimates of causal effects and magnitudes, it is easy to take a contradictory result and offer an argument about why being entrepreneurial results in a firm’s growth decreasing.

The problem is, researchers often don’t take the extra steps to evaluate the efficacy of the model he or she estimated. Even checking basics like distributional assumptions and outliers are foregone in the race to write up the results and send it out for review. As estimators have become easier to use thanks to point and click software and macros, it’s even easier for researchers to throw data into the black box, get three asterisks, and then find some theoretical rationale to explain seemingly inconsistent results. It’s just too easy for bad statistics but easy theorizing to get published.

The answer, as others have noted, is to slow the process down. Here I think pre-prints are particularly valuable, and one reason why I’ll be starting to use them myself. Ideas and results need time to percolate—to be looked at and to be challenged by the community. Once a paper is published it is simply too hard to ‘correct’ the record from one-off studies that, tragically, can become influential simply because they are ‘interesting’. In short, take the time to get it right, and avoid the temptation to pull a theoretical rabbit out of the hat when the results don’t align with the majority of the conversation.