IntroductionIn functions, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Reducing code duplication has three main benefits: Show
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated. Another tool for reducing duplication is iteration, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. In this chapter you’ll learn about two important iteration paradigms: imperative programming and functional programming. On the imperative side you have tools like for loops and while loops, which are a great place to start because they make iteration very explicit, so it’s obvious what’s happening. However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors. PrerequisitesOnce you’ve mastered the for loops provided by base R, you’ll learn some of the powerful programming tools provided by purrr, one of the tidyverse core packages. For loopsImagine we have this simple tibble: We want to compute the median of each column. You could do with copy-and-paste:
But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:
Every for loop has three components:
That’s all there is to the for loop! Now is a good time to practice creating some basic (and not so basic) for loops using the exercises below. Then we’ll move on some variations of the for loop that help you solve other problems that will crop up in practice. Exercises
For loop variationsOnce you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don’t forget about them once you’ve mastered the FP techniques you’ll learn about in the next section. There are four variations on the basic theme of the for loop:
Modifying an existing objectSometimes you want to use a for loop to modify an existing object. For example, remember our challenge from functions. We wanted to rescale every column in a data frame:
To solve this with a for loop we again think about the three components:
This gives us:
Typically you’ll be modifying a list or data frame with this sort of loop, so remember to use Looping patternsThere are three basic ways to loop over a vector. So far I’ve shown you the most general: looping over the numeric indices with
Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value: Unknown output lengthSometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:
But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get “quadratic” (\(O(n^2)\)) behaviour which means that a loop with three times as many elements would take nine (\(3^2\)) times as long to run. A better solution to save the results in a list, and then combine into a single vector after the loop is done:
Here I’ve used This pattern occurs in other places too:
Watch out for this pattern. Whenever you see it, switch to a more complex result object, and then combine in one step at the end. Unknown sequence lengthSometimes you don’t even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can’t do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:
A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can’t rewrite every while loop as a for loop:
Here’s how we could use a while loop to find how many tries it takes to get three heads in a row:
I mention while loops only briefly, because I hardly ever use them. They’re most often used for simulation, which is outside the scope of this book. However, it is good to know they exist so that you’re prepared for problems where the number of iterations is not known in advance. Exercises
For loops vs. functionalsFor loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly. To see why this is important, consider (again) this simple data frame: Imagine you want to compute the mean of every column. You could do that with a for loop:
You realise that you’re going to want to compute the means of every column pretty frequently, so you extract it out into a function:
But then you think it’d also be helpful to be able to compute the median, and the standard deviation, so you copy and paste your
Uh oh! You’ve copied-and-pasted this code twice, so it’s time to think about how to generalise it. Notice that most of this code is for-loop boilerplate and it’s hard to see the one thing ( What would you do if you saw a set of functions like this:
Hopefully, you’d notice that there’s a lot of duplication, and extract it out into an additional argument:
You’ve reduced the chance of bugs (because you now have 1/3 of the original code), and made it easy to generalise to new situations. We can do exactly the same thing with
The idea of passing a function to another function is an extremely powerful idea, and it’s one of the behaviours that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it’s worth the investment. In the rest of the chapter, you’ll learn about and use the purrr package, which provides functions that eliminate the need
for many common for loops. The apply family of functions in base R ( The goal of using purrr functions instead of for loops is to allow you to break common list manipulation challenges into independent pieces:
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code. Exercises
The map functionsThe pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:
Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function. Once you master these functions, you’ll find it takes much less time to solve iteration problems. But you should never feel bad about using a for loop instead of a map function. The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work. The important thing is that you solve the problem that you’re working on, not write the most concise and elegant code (although that’s definitely something you want to strive towards!). Some people will tell you to avoid for loops because they are slow. They’re wrong! (Well at least they’re rather out of date, as for loops haven’t been slow for many years.) The chief benefits
of using functions like We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use
Compared to using a for loop, focus is on the operation being performed (i.e.
There are a few differences between
ShortcutsThere are a few shortcuts that you can use with The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula. Here I’ve used When you’re looking at many
models, you might want to extract a summary statistic like the \(R^2\). To do that we need to first run
But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.
You can also use an integer to select elements by position: Base RIf you’re familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
I focus on purrr functions here because they have more consistent names and arguments, helpful shortcuts, and in the future will provide easy parallelism and progress bars. Exercises
Dealing with failureWhen you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail. When this happens, you’ll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn’t ruin the whole barrel? In this section you’ll learn how to deal with this situation with a new function:
(You might be familiar with the Let’s illustrate this with a simple example:
When the function succeeds, the
This would be easier to work with if we had two lists: one of all the errors and one of
all the output. That’s easy to get with
It’s up to you how to deal with the errors, but typically you’ll either look at the values of
Purrr provides two other useful adverbs:
Mapping over multiple argumentsSo far we’ve mapped along a single input. But often you have multiple related
inputs that you need iterate along in parallel. That’s the job of the
What if you also want to vary the standard deviation? One way to do that would be to iterate over the indices and index into vectors of means and sds:
But that obfuscates the intent of the code. Instead
we could use
Note that the arguments that vary for each call come before the function; arguments that are the same for every call come after. Like
You could also imagine
That looks like: If you don’t name the list’s elements, That generates longer, but safer, calls: Since the arguments are all the same length, it makes sense to store them in a data frame:
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.
Invoking different functionsThere’s one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:
To handle this case, you can use
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function. And again, you can use
WalkWalk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here’s a very simple example:
Other patterns of for loopsPurrr provides a number of other functions that abstract over other types of for loops. You’ll use them less frequently than the map functions, but they’re useful to know about. The goal here is to briefly illustrate each function, so hopefully it will come to mind if you see a similar problem in the future. Then you can go look up the documentation for more details. Predicate functionsA number of functions work with predicate functions that return either a single
Reduce and accumulateSometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This is useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
Or maybe you have a list of vectors, and want to find the intersection:
Exercises
Is reducing duplication of code is one of the advantages of using a loop structure?Reducing duplication of code is one of the advantages of using a loop structure. A good way to repeatedly perform an operation is to write the statements for the task once and then place the statements in a loop that will repeat as many times as necessary.
What type of loop structure repeats the code based on the value of Boolean expression?A while loop is a control flow statement that allows code to be executed repeatedly based on a given Boolean condition. The while loop can be thought of as a repeating if statement.
What makes it easier to reuse the same code in more than one program?Modularization makes code easy to understand and more maintainable. It allows easy reuse of methods or functions in a program and reduces the need to write repetitively.
What is an advantage of using a tuple rather than a list?The advantages of tuples over the lists are as follows: Tuples are faster than lists. Tuples make the code safe from any accidental modification. If a data is needed in a program which is not supposed to be changed, then it is better to put it in 'tuples' than in 'list'.
|