Interpreting logistic regression – Part II

Part I dealt with logit models with dichotomous (0/1) predictors. These are handy models, but usually we deal with predictors that are continuous. That’s the point of Part II, where I walk through making sense of these models, focusing especially on calculating and plotting marginal effects and predicted probabilities.

You can find a link to Part II on my resources page, or here: http://bit.ly/2rZ0dqD

Happy Memorial Day weekend, and thanks to all who serve and have served!

Logistic regression – Pesky equations!

So the original post was a good lesson in verifying that MathJax works on my WordPress site before hitting the ‘Publish’ button. Sorry for my workflow hiccup.

To get the equations to render correctly, I’ve moved the post over to my Resources page, which you can access here.

I’ve been kicking around the idea of moving over to blogdown—might have to move that timeframe up a bit!

Interpreting logistic regression — Part I

 

I’m a big fan of logit models. Mostly because, in my opinion, they offer the clearest way to answer questions that are usually the most interesting to academics and entrepreneurs. Because we can turn the results of a logit model into a set of predicted probabilities, they let us answer questions like, “what’s the probability of a new business making a first sale given these conditions?” and “what’s the probability of an entrepreneur starting a new business given that he/she started a business before?”

The kicker, as I’ve found teaching logit models, is that while the final outcome is very helpful, getting there isn’t so easy. The path to probability is also a reason why logit and other limited dependent variable models are often misinterpreted in strategy and entrepreneurship research.

So this post walks through a logit model with two predictor variables, \(x1\) and \(x2\), that are both dichotomous (0/1). In a future Part II, I’ll do a logit model where \(x1\) and \(x2\) are both continuous variables.

Generating some data

We’ll start by generating the data for Model 1, with two dichotomous predictors

library(tidyverse)  # Love the tidyverse!

# Set the random seed generator to ensure reproducible results
set.seed(080203)

# Create a new dataframe with 1,000 observations and our two predictors 
model1.df <- data_frame(x1 = rbinom(1000, 1, .5),
                        x2 = rbinom(1000, 1, .5))

# Next we'll create our predicted value of y, and include a constant
model1.df <- model1.df %>%
  mutate(y = (.75*x1 + 1.5*x2)) %>%
  mutate(pr_y = exp(y)/(1 + exp(y))) %>%
  mutate(y = rbinom(1000, 1, pr_y)) %>%
  select(-pr_y)

# Export the data file to a csv for data sharing
write.csv(model1.df, file = "logit_sim_data.csv", row.names = FALSE)

# Now lets take a look
head(model1.df, 10)

Estimate our model

A logistic regression model is simply a logistic transformation of the linear model, so we can write our model in a familiar form…

\(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}+\epsilon\)

Our outcome variable is the probability of \(y\) occurring (y = 1), which we write as \(\left[\frac{p}{1-p}\right]\), the probability of \(y\) occurring divided by the probability of \(y\) not occurring.

We’ll use the glm function to estimate our model in R, and specify the binomial link function, which gives us our logistic transformation.

model1 <- glm(y ~ x1 + x2, data = model1.df, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial", data = model1.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1750  -1.2270   0.6248   0.8443   1.1286  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1160     0.1197   0.969    0.333    
## x1            0.7321     0.1566   4.675 2.94e-06 ***
## x2            1.4185     0.1604   8.846  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1137.7  on 999  degrees of freedom
## Residual deviance: 1034.1  on 997  degrees of freedom
## AIC: 1040.1
## 
## Number of Fisher Scoring iterations: 4

Making sense of the model

Here is where things get complicated.

Lets take a look at the estimated coefficients…

log.odds <- summary(model1)$coefficients[1:3,1]
log.odds
## (Intercept)          x1          x2 
##   0.1159635   0.7320719   1.4185449

Our \(y\) variable is on the logit scale. That means that the coefficient estimates are the log odds of \(y\) occurring given a unit change in the two predictors. The intercept in this case is the log odds of \(y\) occurring when both \(x1\) and \(x2\) are zero.

Log odds are not particularly helpful, so we can convert them from the logit scale by taking the exponent:

exp.odds <- exp(log.odds)
exp.odds
## (Intercept)          x1          x2 
##    1.122955    2.079384    4.131105

The exponented coefficients are not the “effect” of \(x\) on \(y\) as would be the common interpretation of a regular linear regression. They also aren’t the likelihood of \(y\) occurring, which is really a statement of probability (we’ll get to that). Because \(x1\) is dichotomous, we interpret \(x1\) as the expected change, on average, in the odds of \(y\) occurring as \(x1\) shifts from 0 to 1, holding constant \(x2\).

Also because \(x1\) is dichotomous, the coefficient estimate is equivalent to the marginal effect of \(x1\) on \(y\), holding constant \(x2\) (more on that later).

So what we can say is that the odds of \(y\) occurring are just over two times higher in the presence of \(x1\) (\(x1\) = 1) then when \(x1\) is not present (\(x1\) = 0), holding \(x2\) constant. To put it another way, the odds of \(y\) occurring given the presence of \(x1\) and holding \(x2\) constant are roughly 2.08 to 1 (2.08:1).

Predicted probabilities

Odds are helpful, but they really are useful when it comes to predicting the probability that \(y\) occurs at discrete values of the two predictors.

In our model, given that both \(x1\) and \(x2\) are dichotomous, we have four different possible probabilities of \(y\) predicted by our model, and I’ve shown each of them with the corresponding equation:

Probability x1 x2 Equation
Pr_y 0 0 \(ln\left[\frac{p}{1-p}\right]=\alpha\)
Pr_y 1 0 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}\)
Pr_y 0 1 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_2}\)
Pr_y 1 1 \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}\)

Yes, there is a shortcut to doing this in R, but it’s helpful to walk through how to do this by hand. Lets calculate the second row in the table—the probability that \(y\) occurs when \(x1\) occurs (\(x1\) = 0) and \(x2\) does not occur (\(x2\) = 0). Recall that the equation for this model is \(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}\), or equivalently: \(Pr\_ y=\frac{1}{1+{e}^{-(\alpha+{\beta}{x_{1}})}}\) (see Wooldridge, 2012).

model1.intercept <- (summary(model1)$coefficients[1,1])
model1.betax1 <- (summary(model1)$coefficients[2,1])
pr_y = 1 / (1 + (exp(-1 * (model1.intercept + model1.betax1))))
pr_y
## [1] 0.7001548

On average and for this sample, there is a 70% probability of \(y\) occurring given that \(x1\) occurred and \(x2\) did not.

Ok, now for the shortcut. We can use R’s predict function to pass all of the possible combinations of from our table and give us the predicted probabilities of each. Note that we must use the ‘response’ option to return the predicted probabilities, as opposed to the log odds.

# We start with a new dataframe that contains our possible...
# ...values of x1 and x2, following our table above
model1.values <- data_frame(x1 = c(0, 1, 0, 1),
                            x2 = c(0, 0, 1, 1))
model1.values

We then pass the new dataframe to the predict function, and we’ll set the se.fit = TRUE option to return the standard errors for the prediction, so that we can construct confidence intervals.

model1.predict <- predict(model1, newdata = model1.values, type = "response", se.fit = TRUE)

# Create a new dataframe to store predictions and calculated confidence...
# ...intervals using +/- 1.96 for a 95% interval
model1.predict.df <- data_frame(pr_y = model1.predict$fit,
                                lower.ci = model1.predict$fit - (1.96 * model1.predict$se.fit),
                                upper.ci = model1.predict$fit + (1.96 * model1.predict$se.fit))

# Lets also convert these values to percentages to make them...
# ...easier to manage and interpret
model1.predict.df <- model1.predict.df %>%
  mutate(pr_y = 100 * round(pr_y, 4),
         lower.ci = 100 * round(lower.ci, 4),
         upper.ci = 100 * round(upper.ci, 4))

Now we can bind our predicted values dataframe to our model values dataframe (the one with the x1 and x2 values) to make a nice output table:

library(knitr)
model1.outcome <- bind_cols(model1.values, model1.predict.df)
kable(model1.outcome, caption = "Predicted probabilites of y at possible combinations of x1 and x2")
x1 x2 pr_y lower.ci upper.ci
0 0 52.90 47.05 58.74
1 0 70.02 64.75 75.28
0 1 82.27 78.34 86.19
1 1 90.61 87.86 93.35

While I’m a big fan of visualizations, there’s not much to plot with these two variables—all we get are point estimates. We’ll tackle plotting predicted probabilities in Logit Part II with continuous variables.

Going all in with R

As a doctoral student, I took a multivariate statistics class with SAS. I took a time series class that used R. I had a SEM course using LISREL. My methods guru used Stata, and my adviser SPSS (yes, there is a reason I didn’t link to SPSS). Needless to say, I had a diverse background in statistics software.

Over the years I used, and taught with, mostly Stata, while trying to keep up with R.

This year, I’m all in with R.

Why? Because point and click statistics only makes the researcher degrees of freedom problem worse. Others much more qualified than me have spoken on this, but I think the general trend towards using context-menu based software has resulted in a ‘dumbing-down’ effect in the quality of empirical analyses. It’s not because researchers are dumber, to be clear, it’s that context menu software makes running analyses without understanding what’s going on under the hood far too simple.

Case in point that I use in class a lot? Cronbach’s alpha.

Here’s the formula for Alpha…

\(\alpha=\frac{N*\bar{c}}{\bar{v}+(N-1)*\bar{c}}\)

Without going through the details, \(N\) is the number of indicators being evaluated, and \(\bar{c}\) is the average inter-item covariance. The kicker with alpha is that a large number of indicators will inflate alpha, even in the presence of a low correlation between the actual indicators themselves.

So far so good, but lets look at the output from a common point and click software package—I don’t actually have a license for it, so I’m borrowing this image from the very excellent UCLA stats page.

Notice there is nothing in the output about the average inter-item covariance or correlation. My argument is that SPSS is one reason why we have so many 20+ item psychometric scales floating around with questionable reliability.

Now lets take a look at R, with the popular psych package…

library(psych)
alpha(my.df)
## 
## Reliability analysis   
## Call: alpha(x = my.df)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.83      0.83    0.76      0.61 4.7 0.028    4 1.2
## 
##  lower alpha upper     95% confidence boundaries
## 0.77 0.83 0.88 
## 
##  Reliability if an item is dropped:
##       raw_alpha std.alpha G6(smc) average_r S/N alpha se
## RISK1      0.78      0.78    0.64      0.64 3.6    0.040
## RISK2      0.71      0.71    0.55      0.55 2.4    0.054
## RISK3      0.78      0.78    0.64      0.64 3.6    0.040
## 
##  Item statistics 
##         n raw.r std.r r.cor r.drop mean  sd
## RISK1 116  0.84  0.85  0.72   0.66  4.0 1.4
## RISK2 116  0.89  0.89  0.81   0.73  3.9 1.4
## RISK3 116  0.85  0.85  0.73   0.66  4.1 1.4
## 
## Non missing response frequency for each item
##          1    2    3    4    5    6    7 miss
## RISK1 0.03 0.13 0.15 0.35 0.20 0.11 0.03    0
## RISK2 0.05 0.16 0.19 0.22 0.28 0.09 0.02    0
## RISK3 0.06 0.09 0.15 0.28 0.28 0.13 0.02    0

The data source isn’t important, but beyond the terrific output we see that the average_r—the average inter-item correlation is pretty good, .61. The alpha itself is .83, well within the recommended range.

Now lets take a look at a nine item scale…

library(psych)
alpha(my.df2)
## 
## Reliability analysis   
## Call: alpha(x = my.df2)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
##       0.85      0.85    0.88      0.39 5.9 0.021  4.1 1.1
## 
##  lower alpha upper     95% confidence boundaries
## 0.81 0.85 0.89 
## 
##  Reliability if an item is dropped:
##       raw_alpha std.alpha G6(smc) average_r S/N alpha se
## INN1       0.83      0.83    0.86      0.38 4.9    0.025
## INN2       0.83      0.84    0.85      0.39 5.1    0.024
## INN3       0.84      0.85    0.86      0.41 5.5    0.022
## PRO1       0.83      0.84    0.85      0.39 5.1    0.023
## PRO2       0.84      0.84    0.86      0.40 5.4    0.022
## PRO3       0.86      0.86    0.89      0.44 6.3    0.020
## RISK1      0.83      0.83    0.86      0.38 5.0    0.024
## RISK2      0.83      0.83    0.85      0.37 4.8    0.024
## RISK3      0.83      0.83    0.86      0.38 4.9    0.024
## 
##  Item statistics 
##         n raw.r std.r r.cor r.drop mean  sd
## INN1  113  0.76  0.74  0.70   0.65  3.9 1.9
## INN2  113  0.72  0.70  0.68   0.61  4.2 1.7
## INN3  113  0.64  0.62  0.58   0.52  3.8 1.7
## PRO1  113  0.70  0.70  0.67   0.60  4.1 1.6
## PRO2  113  0.65  0.64  0.59   0.53  4.3 1.7
## PRO3  113  0.46  0.48  0.36   0.33  4.5 1.4
## RISK1 113  0.71  0.72  0.68   0.62  4.0 1.4
## RISK2 113  0.75  0.77  0.75   0.68  3.8 1.4
## RISK3 113  0.73  0.75  0.72   0.65  4.1 1.4
## 
## Non missing response frequency for each item
##          1    2    3    4    5    6    7 miss
## INN1  0.12 0.23 0.06 0.17 0.18 0.16 0.09    0
## INN2  0.06 0.18 0.11 0.19 0.20 0.18 0.08    0
## INN3  0.10 0.13 0.20 0.20 0.18 0.13 0.05    0
## PRO1  0.08 0.11 0.12 0.26 0.25 0.14 0.05    0
## PRO2  0.09 0.08 0.12 0.21 0.25 0.18 0.08    0
## PRO3  0.04 0.09 0.09 0.20 0.36 0.18 0.04    0
## RISK1 0.04 0.13 0.14 0.35 0.20 0.12 0.02    0
## RISK2 0.05 0.15 0.19 0.22 0.28 0.09 0.01    0
## RISK3 0.06 0.10 0.15 0.27 0.28 0.12 0.02    0

Our alpha went up to .85—even better than before! But the average_r dropped to .39. The formula is biased by the large number of indicators, not because the scale actually has higher internal reliability.

Ok, I’m not being entirely fair to SPSS, and its always the responsibility of the researcher to understand the nuts and bolts of any analysis he or she runs.

Nonetheless, what I love about R is that it facilitates—empowers—a deeper understanding of the statistical tools that most applied researchers use on a daily basis. Rather than moving more and more of the stats to the background, R brings it to the foreground, giving the analyst the information to draw better conclusions. The added bonus? The R community is so large now that if you are stuck or having trouble, a quick search is usually all you need to find the answer.

The bottom line? R is a tool for data science, and given the replication crisis, we could all use a little more science, and the tools that support it.