I am a political scientist who uses, and teaches, quantitative methods in both Stata and R. I write functions for models, techniques, and analysis routines I employ frequently in my own research and to assist students in my classes as they are introduced to statistics using R. These functions save me time when I’m conducting my analyses, and share them here in the hope they will do the same for you.
Each of the functions comes with a list of instructions and examples for usage embedded into the notes at the start of the functions themselves. If you find any errors in these functions, have any questions about their use, or (even better) have made an improvement to them that you would like to share, please don’t hesitate to contact me.
To use these functions, simply download my R Startup file that contains them all, copy and paste the script for the particular function(s) you would like to use into your R session, and then follow the instructions and examples embedded in each function describing the function’s use. I will update these periodically and will include in the notes for each function the date of the last update, as well as a list of any changes.
While the startup file contains a variety of functions, many of which are not listed here, here are some that will likely be of most general interest (the last set of functions work similar to the “margins” command in Stata):
robustse.f: a function that generates robust and cluster-robust standard errors in a format that makes them easily used with packages like stargazer and in plotting.
TwowayME.f: a functions to generate two-way marginal effects plots (for both continuous and dichotomous modifying variables)
ThreewayME.f: a function to generate three-way marginal effects plots, placing stars on the effect lines where the effect is statistically significant.
predict.f: a function to generate predicted values in a format that they are easily plotted by another of my functions:
All of these are described, with examples, below:
This is a function to be run after you estimate a lm, glm, or glm-like model in R. It will estimate Huber-White robust standard errors as well as the HAC cluster-robust standard errors you will likely want to use in contexts like fixed-effects regression. Additionally, it provides an option for a degrees of freedom correction for the clustered standard errors.
When this function is applied to a model object (e.g.
test.lm), it generates a new model object that appends either “rob” or “clustrob” to the original model name, depending on whether you specified robust or cluster-robust errors (e.g. “test.lm.rob” or “test.lm.clustrob”).
This new object contains two new items, in addition to the items in the original model estimation object:
se: which contains the robust standard errors, and
vcovHC/VcovCL: the robust/cluster-robust variance-covariance matrix.
I use the “se” item in this new object with the “se” option in the “stargazer” package to print model results with robust standard errors. Here is a code example using data from Stock and Watson’s California test scores dataset:
## after loading the data (and the stargazer package): test.lm <- lm(testscr ~ str + el_pct, data = test.df) robustse.f(test.lm) #This will print out the results of the model, with robust standard errors, and also create a robust model object containing "se" and "vcovHC": test.lm.rob. stargazer(test.lm.rob, covariate.labels=c("Intercept","Student-teacher ratio","English learner pct."), type="html", style="apsr", title="Effects of STR and EL_PCT on Test Scores", intercept.bottom = FALSE, intercept.top=TRUE, digits=3, star.cutoffs=c(.05,.01,.001), se=list(test.lm.rob$se), notes=c("Robust standard errors in parentheses"), out = PUT PATH HERE... )
Were this panel data (it is not), with scores gathered across states, then we could calculate cluster-robust standard errors for test.lm as follows:
robustse.f(test.lm, ~state, F) #The "F" here means that the state dummies should not count against the model's degrees of freedom.
Together, these next set of functions provide functionality similar to Stata’s “margins” command:
This function provides an easy way to generate a two-way marginal effects plot from your lm or glm model when you have a continuous, as-if continuous, or dichotomous modifying variable in your interaction term. Note that for this to work well, the independent variables must be in the following order (see further notes on this in the function itself): x z controls x*z. The function has a number of options, allowing users to identify the level of confidence intervals shown, whether they would like a density histogram or rug plot for the variable on the x-axis, whether they would like non-robust, robust, or cluster-robust standard errors, etc.
A code example, once again using the test scores data identified above:
#Let's begin with an interactive model (all variables are coded continuously): test.lm <- lm(testscr ~ str + el_pct + str*el_pct, data = test.df) #Now to generate the marginal effects plot: TwowayME.f(test.lm, test.df$str, test.df$el_pct, "English learner pct.", "Marginal effect of STR on test scores", 95, rob, , ,hist) #Note that we've left the "cluster" and "df_correction" fields blank, as we do not want clustered standard errors. But if we did, we would put in the name of the cluster variable and identify whether or not we wanted the df_correction (just as in the robustse.f function). If we left a blank in place of "hist," we would get a rug plot of el_pct on the x-axis instead of a density histogram.
This function works similar to the two-way marginal effects functions, but allows for the plotting of three-way interactions. These plots, of necessity, do not include confidence intervals around the effect lines, replacing these instead with stars on the line that signify where the estimate effect is significant.
This function provides an easy way to graph the predicted values, or conditional expected values, of Y from an lm or glm model. It works similarly to R’s built-in predict function, but allows for predictions with robust, and cluster-robust, standard errors, and it outputs the data in a format that it can be automatically plotted by the predict.plot.f function described below.
As with the stock predict function in R, to use this function, one must first create a predictions dataframe, containing the values in the independent variables one would like to use to predict the expected values of Y. This predictions dataframe is then used in the function to generate our predicted values.
Here’s an example using the test scores data above:
test.lm <- lm(testscr ~ str + el_pct, data = test.df) pred.df <- data.frame("str" = c(20, 40), "el_pct" = c(50,50)) predict.lm.f(test.lm,pred.df,rob) #If I wanted clustered standard errors, I would add the cluster variable, by name preceded by a ~, as in the robustse.f function.
This will generate a new dataframe titled
test.lm.pred.df that contains the predicted values as well as the values used for prediction. The results can then easily be plotted by the
predict.plot.f function as follows:
predict.plot.f(test.lm.pred.df, test.lm.pred.df$str,"Student Teacher Ratio", "Predicted Test Scores")
Note that for the glm models, this function generates predicted probabilities, transforming the regression coefficients. Note also that the
predict.f function generates confidence intervals around the predictions. It does not generate prediction intervals.
This function is designed to work like Stata’s “marginsplot” command, easily graphing the results from the
predict.f function above. It can generate multiple prediction lines with a legend. See the earlier example for its use.