If you’re an avid SAS user, you’re likely very familiar with SAS macros. SAS macros are a key component to creating efficient and concise code. Although you cannot use macros in R, R offers other features like functions and loops that can perform the same tasks as SAS macros.
Using apply() to loop over variables
In SAS, if we wanted to run multiple linear regressions using different predictor variables, we could use a simple SAS macro to iterate over the independent variables. In R, we can simplify this even more by making use of the apply() function. The apply() function comes from the R base package and is one of many members of the apply() family. The family (which also contains lapply(), sapply(), mapply(), etc) differ in the data structures of the inputs and outputs.
apply(X, Margin, Fun,…) takes three main arguments.
- X is an array or matrix.
- Margin indicates if the function should be applied over rows (Margin = 1) or columns (Margin = 2)
- Fun indicates what function should be applied. Any R function can be used even those created by the user.
In this example, we will use the R dataset mtcars (first 6 rows shown below).
data(mtcars) head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Back to the question presented earlier, how can we iterate over variables to run multiple regressions with different predictor variables?
The following apply function takes the dataset mtcars and subsets the variables cyl (number of cylinders), disp (displacement), and wt (weight) as the variables we want to apply the function to. We specify the margin as 2 so it iterates over the 3 columns. Finally, we specify a user defined function that takes the independent variable as a parameter and outputs the summary statistics of a linear model where mpg (miles per gallon) is the outcome variable.
apply(mtcars[, c("cyl", "disp", "wt")], 2, function(ind) {summary(lm(mpg ~ ind, data = mtcars))})
## $cyl
##
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9814 -2.1185 0.2217 1.0717 7.5186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
## ind -2.8758 0.3224 -8.92 6.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
## F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
##
##
## $disp
##
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## ind -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
##
##
## $wt
##
## Call:
## lm(formula = mpg ~ ind, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## ind -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Using a for loop to iterate over variable names
Let’s say we want to examine product sales of 3 products sold over 100 days. Our goal is to have sold 45 units of each product. We can use a for loop to create new dummy variables that indicate if we sold 45 or more units that day.
First we create a sample data set from randomly generated integers
set.seed(22) product1 <- sample(30:50, 100, replace = TRUE) product2 <- sample(40:60, 100, replace = TRUE) product3 <- sample(35:48, 100, replace = TRUE) sales <- as.data.frame(cbind(product1, product2, product3)) head(sales)
## product1 product2 product3
## 1 36 43 35
## 2 39 43 41
## 3 50 54 37
## 4 40 52 48
## 5 47 41 36
## 6 45 46 44
Then we can use a for loop to iterate over the variable names in the dataset. The paste function allows us to create a new variable name containing the old variable name and the condition. We then can assign a 0 or 1 to the new variable depending on if the sales goal of 45 or more was met.
for (p in names(sales)) { sales[[paste(p, ">45", sep = "")]] <- as.numeric(sales[[p]] >= 45) } print(head(sales))
## product1 product2 product3 product1>45 product2>45 product3>45
## 1 36 43 35 0 0 0
## 2 39 43 41 0 0 0
## 3 50 54 37 1 1 0
## 4 40 52 48 0 1 1
## 5 47 41 36 1 0 0
## 6 45 46 44 1 1 0
Problems with using for loops in R
In general, it is more efficient to use one of the apply() functions when possible instead of using a for loop. For loops in R are generally slower for large data sets, especially if you are consistently adding new values to a dataframe using functions like cbind. It is better to preallocate a new matrix or dataframe for the loop to fill. By preallocating space, you are preventing R from having to copy and expand the vector for every iteration.
Ifelse Functions
One way of getting around using a for loop in our previous example is by using ifelse functions. The benefit of using the ifelse function is that it is vectorized meaning the condition is applied to a whole vector at once compared to only one value at a time.
The ifelse function will read in a vector, check a condition, and then assign one value if the condition is true and a different value if false.
sales$product1Met <- ifelse(sales$product1 >= 45, 1, 0) sales$product2Met <- ifelse(sales$product2 >= 45, 1, 0) sales$product3Met <- ifelse(sales$product3 >= 45, 1, 0) head(sales)
## product1 product2 product3 product1Met product2Met product3Met
## 1 36 43 35 0 0 0
## 2 39 43 41 0 0 0
## 3 50 54 37 1 1 0
## 4 40 52 48 0 1 1
## 5 47 41 36 1 0 0
## 6 45 46 44 1 1 0
While this can get repetitive if you are creating many new variables, in many cases, the ifelse function may be a sufficient option.