Paw Hansen - The policy evaluator’s toolbox: Presenting your next instrumental variable analysis

New post for my “policy evaluator’s toolbox” series! Today, let’s have a stab at an instrumental variable analysis.

As always, I’ll go over some of the basic intuition but the main focus will be on writing some code to produce a few useful graphs and tables. For more in-depth treatment of the topic, check out the references listed below.

Packages used in this post

library(tidyverse)
library(broom)
library(ggsci)
library(modelsummary)
library(marginaleffects)
library(estimatr)

theme_set(theme_minimal(base_size = 12))

Example: How does smoking affect your health?¹

Suppose we wanted to know if smoking affected your health. The core problem is that whether people smoke or not is not random. Rather, it is entangled in all kinds of other factors that might also affect health outcomes: education, gender, job type, and so on. In addition, health might affect smoking status because bad health could cause you to smoke fewer cigarettes or stop smoking altogether.

IV can help us. We need some variable affecting health outcomes only through smoking. Ideas? Economists have naturally had their eyes fixed on taxes — a cigarette tax will affect whether people smoke or not but not their health except through smoking status.

The idea is that we should look for provinces, states, or other units with varying cigarette taxes and then use that variation to estimate the effect of smoking on health outcomes.

To get started, we’ll simulate some data. We begin with a data frame of people and whether they smoke or not:

set.seed(853)

num_observations <- 10000

iv_example_data <- tibble(
  person = c(1:num_observations),
  smoker = 
    sample(x = c(0:1), 
           size = num_observations, 
           replace = TRUE)
  )

Then we add health outcomes:

iv_example_data <-
  iv_example_data |>
  mutate(health = if_else(
    smoker == 0,
    rnorm(n = n(), mean = 1, sd = 1),
    rnorm(n = n(), mean = 0, sd = 1)
  ))

And finally different tax levels on smoking for each of the two provinces:

iv_example_data <- iv_example_data |>
  mutate(
    province = case_when(
      smoker == 0 ~ sample(
        c("Nova Scotia", "Alberta"),
        size = n(),
        replace = TRUE,
        prob = c(1/2, 1/2)
      ),
      smoker == 1 ~ sample(
        c("Nova Scotia", "Alberta"),
        size = n(),
        replace = TRUE,
        prob = c(1/4, 3/4)
      )
    ),
    tax = case_when(province == "Alberta" ~ 0.3, 
                    province == "Nova Scotia" ~ 0.5,
                    TRUE ~ 9999999
                    )
    )

iv_example_data

# A tibble: 10,000 × 5
   person smoker  health province      tax
    <int>  <int>   <dbl> <chr>       <dbl>
 1      1      0  1.11   Alberta       0.3
 2      2      1 -0.0831 Alberta       0.3
 3      3      1 -0.0363 Alberta       0.3
 4      4      0  2.48   Alberta       0.3
 5      5      0  0.617  Nova Scotia   0.5
 6      6      0  0.748  Alberta       0.3
 7      7      0  0.499  Alberta       0.3
 8      8      0  1.05   Nova Scotia   0.5
 9      9      1  0.113  Alberta       0.3
10     10      1 -0.0105 Alberta       0.3
# ℹ 9,990 more rows

A first inspection of the data indicates that smoking does have some effect on health—healt status is lower for smokers in both provinces:

iv_example_data |> 
  mutate(smoker = as.factor(smoker)) |> 
  ggplot(aes(smoker, health)) +
  geom_jitter(aes(color = smoker), 
                  alpha = .2) + 
  facet_wrap(vars(province)) +
  scale_color_uchicago() +
  labs(y = "Health status",
       x = "Smoker or not") + 
  theme(legend.position = "bottom")

Figure 1: Comparing health status across smoking status, by provinces with different cigarette tax rates

To estimate the effect, we can run two regressions and take their ratio:

A regression predicting health from tax levels
A regression prediction smoker status from tax levels
The ratio of the two tax coefficients will then represent the effect of smoking on health status:

health_on_tax <- lm(health ~ tax, data = iv_example_data)
smoker_on_tax <- lm(smoker ~ tax, data = iv_example_data)

tibble(
  coefficient = c("health ~ tax", "smoker ~ tax", "ratio"),
  value = c(
    coef(health_on_tax)["tax"],
    coef(smoker_on_tax)["tax"],
    coef(health_on_tax)["tax"] / coef(smoker_on_tax)["tax"]
  )
)

# A tibble: 3 × 2
  coefficient   value
  <chr>         <dbl>
1 health ~ tax  1.24 
2 smoker ~ tax -1.27 
3 ratio        -0.980

Another way would be to use iv_robust() from the estimatr package:

Code

iv_res <- 
  iv_robust(health ~ smoker | tax, data = iv_example_data)

iv_res |> 
  modelsummary(gof_omit = "IC|Adj.",
               coef_rename = c("smoker" = "Effect of smoking"),
               stars = TRUE)

Table 1: Instrumental variable example using simulated data
	(1)
(Intercept)	0.977***
	(0.041)
Effect of smoking	−0.980***
	(0.081)
Num.Obs.	10000
R2	0.201
RMSE	1.00
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Finally, we could accompany Table 1 with some predictions on the original scale. For our simulated example, the values on “health status” does not have any meaningful interpretation but for a real world application, they might. In any case, it is always a sound approach to not only present the difference between treatment and control since the difference doesn’t convey any information about whether it is “large” or “small”.

Code

preds <- 
  predictions(iv_res, by = "smoker")

And then plot those predictions:

Code

preds |> 
  mutate(lbl = ifelse(smoker == 1, "Smoker", "Not smoker")) |> 
  ggplot(aes(as.factor(smoker), estimate, 
             ymin = conf.low, 
             ymax = conf.high)) + 
  geom_pointrange() + 
  geom_text(aes(label = lbl), 
            nudge_x = .25) + 
  scale_x_discrete(labels = NULL) + 
  labs(x = NULL,
       y = "Health status")

Figure 2: Instrumental variable example using simulated data

Final thoughts: Presenting your next instrumental variable analysis

There you have it. A super-brief introduction to get you started with IV estimation. There are a whole bunch of additional details I have now shown here as the goal was simply to provide some intuition. But now you have the building blocks to start thinking about using IV in your own work.

Cool! Where can I learn more?

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.
Alexander, R. (2023). Telling Stories with Data: With Applications in R. CRC Press.

Footnotes

Code for this example is largely adapted from Telling Stories with Data↩︎

Example: How does smoking affect your health?1

Final thoughts: Presenting your next instrumental variable analysis

Cool! Where can I learn more?

Footnotes

Example: How does smoking affect your health?¹