## Tutorial 2: More on Vectors, Data Frames, and Functions ## Clinic on Meaningful Modeling of Epidemiological Data ## International Clinics on Infectious Disease Dynamics and Data Program ## African Institute for Mathematical Sciences, Muizenberg, Cape Town, RSA ## ## David M. Goehring 2004 ## Juliet R.C. Pulliam 2008, 2009 ## Steve Bellan 2010, 2012 ## Carl Pearson 2024 ## Stanley Sayianka, Reshma Kassanjee 2025 ## ## ## Some Rights Reserved, ## CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/) ## By the end of this tutorial you should: ## * Be able to retrieve useful subsets of your data ## * Understand more about data frames ## * Know the methods and uses of logical values in R ## * Be able to generate and use factors ## * Know how to write your own generic functions ## ## Section A. Accessing Vector Elements ## ## Subsection A1. Beyond Numbers: Relational and Logical Operations in R ## ## So far everything you have done in R has involved numbers or ## vectors of numbers. To properly exploit R’s complexity, you need to ## become familiar with relational and logical operations in R. ## ## Relational operations work just like numerical operations, in terms ## of how they are processed. Return for a moment to our first ## calculation from the last tutorial, an addition problem: 3 + 4 ## The analogous calculation of a single relational operation is ## something like 5 > 4 ## "Is 5 greater than 4?” Yes. And R tells you that this is a TRUE ## statement. Or, 1 + 1 < 1 ## Makes sense, right? ## ## The greater-than, >, and less-than, <, symbols are ## straightforward. Similarly, R has greater-than-or-equal-to and ## less-than-or-equal-to symbols, >= and <=, respectively. ## ## Slightly less intuitive are the relational operators for equality, ## ==, and inequality, !=. Try x <- 4 x == 1 + 3 y <- x != 4 ## This last example demonstrates that variables can hold logical ## values. These relational operators also operate on logical values, ## as in, y == FALSE ## Logical operations are operations that only make sense when ## performed on TRUE and FALSE values. These will likely be familiar ## to you, the central operations being AND, OR, and NOT. ## ## The operators used in R are standard: &, |, and !, ## respectively. Let’s see them in action: !TRUE to_be <- FALSE to_be | !to_be FALSE & (TRUE | FALSE) ## By combining logical and relational operations, we can make complex ## inquiries about values. ## ## Hands off the keyboard! Pick up a writing implement… ## ## a <- TRUE != (4 > 3) ## b <- a | 1 + 1 == 4 - 2 ## c <- !FALSE & (log(Inf) == Inf + 1) ## ## What do a, b & c equal? Now execute the commands and compare ## your answers. ## Note that R has special values for infinity (Inf), not-a-number ## (NaN), and not-applicable, NA. These generally behave sensibly – a ## mathematical operation on not-a-number is obviously not a number as so is ## returned as NaN. Things are less simple when using logical and relational ## operators. Consider 4 != NaN. In one respect, the answer perhaps ## should be TRUE; that is, 4 definitely isn’t equal to ## not-a-number. But, striving for consistency, R returns NA, much as ## it would for a mathematical operation. Even worse is the situation x <- NaN x == NaN ## You might think that this is a reasonable test for whether x has a ## numerical value, but it won’t work for the same reason mentioned above. ## In general, keep this trickiness in mind and remember there is a special ## function is.nan() for determining whether x is defined as not-a-number: is.nan(x) ## There is also a special function is.finite() for determining whether ## x is a valid (finite) number: is.finite(x) ## This is all getting thrown at you in very quick succession, ## especially if you do not have much experience programming in other ## languages. It is worth noting that information about these ## operations can be pulled up at any time by typing help("&”) or ## help(">") or the using help() function with any of the other ## symbols used in these operations. ## ## Subsection A2. Vectors of Logical Values ## ## As a shorthand, TRUE and FALSE can be entered as T and F. This ## allows for rapid entry of vectors of logical values, for example: logical_vec <- c(TRUE, TRUE, FALSE, TRUE) logical_vec ## Unfortunately, and rather inexplicably, T and F cab be reassigned ## to any arbitrary values. This will render most code utterly ## unpredictable. So, never, never, never do this: T <- 4 # REALLY BAD, BUT NO ERROR IS PRODUCED ## And, if you ever do something like this (though you shouldn’t!), ## make sure you quickly do this: rm(T) ## which will set T (or F) back to its default logical value. ## Aside: to ensure your code is robust, we recommend always spelling out ## TRUE and FALSE for logical values. If you have a library like `lintr` ## installed, you can use it to check your code `lintr::lint("your_code.R")` ## ## Relational or logical operations also act on vectors to produce ## vectors of logical values, as in, x <- rnorm(10) x < 0 y <- (x > -.5) & (x < .5) !y ## This will be especially handy when we look at the concept of ## indexing, below. ## ## Subsection A3. Generating Sequences ## ## There are many occasions in R when you need a patterned sequence of ## numbers. As mentioned in the previous tutorial, most counting can ## be accomplished by use of the seq() function. If you haven’t ## already done so, it is worth taking a look at the help-file on ## seq() because it has a few arguments that can make your life ## easier. ?seq ## For example, seq() can generate a vector of a certain length ## between certain endpoints by typing x <- seq(0, 1, length.out = 20) ## giving you a vector of length 20 between 0 and 1; to confirm this, type length(x) ## A very common need in R is to generate vectors with an interval of ## 1 between each element. R has a shorthand for this using the colon ## notation, as follows: y <- 5:10 ## This generates a vector that counts from 5 to 10, inclusive. Note that ## : is generally treated first in the order of operations. ## ## Don’t underestimate the value of the colon notation. Even for ## typing a vector of length 2, like "(1,2)” or "(2,1),” using the c ## function to generate the vector is pretty tedious (e.g., c(1,2)). ## These vectors can be generated in three quick characters by typing ## 1:2 or 2:1, respectively. I will also point your attention to the ## rep() function, for repeating sequences, which can also save time. ## ## Subsection A4. Indexing ## ## R has an incredibly useful way of accessing items from a ## data set. Each item in a data set has its own index, or numbered ## location, in the object’s structure. Square brackets are used to ## extract an item or items from a data set, but it is crucial to ## understand that there are two completely distinct ways in which ## brackets are used to access items. I will consider the two methods ## for accessing a vector of length n in turn below. ## ## The first option: Logical ## Requirements: Logical vector of length n ## Use it for: Finding a subset of data based on a rule ## ## Logical indexing works as if you’ve asked your indexing vector the ## question, "Do you want this item?” for each of the items in the ## vector. x <- 1:5 x[c(TRUE, FALSE, FALSE, FALSE, TRUE)] ## If we combine this logical indexing with the relational and logical ## operators you learned above, we have an exceptionally powerful tool ## to retrieve data that meet any set of criteria. y <- rnorm(10000) hist(y[!((y > -2) & (y < 0))]) ## I will give more insight below when I discuss indexing data ## frames. Stay tuned. ## In any operation in R, vectors will be automatically repeated until ## they reach the necessary length for the operation to make ## sense. For example, note the results of 1:6 + 1:2 ## The same repetition holds for logical vectors. For this reason, you should ## be very cautious using operations on vectors that differ in length. ## ## The second option: Numerical ## Requirements: Value or vector of any length with values ## (1 to n) OR (-n to -1) ## Use it for: Single item retrieval or shuffling, sorting, and repeating ## ## Accessing single items with brackets and a single index should be ## straightforward x <- 3 * (0:5) x[4] ## One tedious way of creating a new vector of values from a vector’s ## elements would be c(x[2], x[3], x[4]) #TEDIOUS ## So R makes it much easier by allowing a vector of indices to ## generate a vector. Thereby, the command above becomes x[2:4] ## There is nothing preventing you from accessing any element any ## number of times. x[c(2, 2, 2, 5, 5, 5)] ## Additionally, R allows you to use negative indices, indicating ## which items you want to exclude, as in, x[c(-1, -6)] ## This is fine and productive as long as you remember never to mix ## negative and positive indices – R will not know what you want it to ## do: x[c(-1, 4)] # BADCODE ## Subsection A5. Sorting ## ## In Tutorial 1, you were introduced to the sort() function, which is ## handy. ## ## Now that you have been introduced to indexing, you may have an ## inkling of how much more powerful the sorting functions of R can ## become. ## ## As an introduction, let’s say you have a 4-element vector, my_vector <- 5:8 ## Using numerical indexing, we can manually re-order this vector by ## calling each of its indices once in our preferred order, for ## example my_vector[c(2, 3, 4, 1)] ## or, for a quick reversal my_vector[4:1] ## Now, manually generating the vector of indices is not monumentally ## useful, which is where the function order() comes in. As a ## demonstration, imagine we have a vector of student names and a ## corresponding vector of student heights (in meters). student_names <- c( "Dario", "Hloniphile", "Steve", "Innocent", "Abigail", "Cynthia" ) student_heights <- rnorm(6, 1.7, .12) ## What we definitely don’t want to do is to perform sort() on each of ## these vectors independently. This will eliminate the pairing of the ## name to the height. So how can we sort one vector and have the ## other vector align correctly? Try order() on the names, order(student_names) ## Note that it returns the indices in the right order, not the values ## themselves. ## ## From what you learned above, you know it is now an easy matter to ## sort both of our vectors, as follows, student_names[order(student_names)] # same effect as sort() student_heights[order(student_names)] ## And, obviously, sorting the names by the heights is exactly ## analogous, and it will make for a pretty plot barplot( student_heights[order(student_heights)], names.arg = student_names[order(student_heights)], ylab = "Height (m)", main = "Student Heights" ) ## I have conveniently skipped over an important concept, because R ## handles it fairly intuitively, but I want to mention the ## terminology. The variable student_names and the results of ls(), for ## example, are called vectors of strings, or character arrays. R ## handles them conveniently, so we don’t need to worry too much about ## them, but knowing the terminology will improve your understanding ## of R's in-line help documents. ## ## Section B. Data Frames and Alternatives ## ## Subsection B1. Data frames ## ## Before we cover advanced topics of data frames, I wanted to point out the ## function data.frame(), which puts data together to form data ## frames. This is a key alternative to using the prefab data frames ## that you used in the previous tutorial. ## ## First I want to generate a vector of student class-years to ## correspond to the student_names before creating a data frame (Freshmen ## as 1, Sophomores as 2, etc.). student_years <- c(4, 2, 2, 3, 1, 3) ## Now making a data frame is easy (each argument will just add more ## columns to the data), the only trick being that we have to assign ## the constructed data frame to a variable, as follows: student_df <- data.frame(student_names, student_heights, student_years) student_df ## Voila! Your own data frame. ## ## You may want to have better column headings than the redundant variable ## names. There are various options to accomplish this. One option is ## to use the names() function with assignment notation. Let’s take a look: names(student_df) ## What we see is a vector of strings corresponding to the current column ## names. We can change these by assigning replacement strings to the ## indexed values or by substituting our own vector of strings. names(student_df) <- c("names", "heights", "years") student_df ## If we think that "years" is ambigous and might be confused with a student's ## age, we could rename just that column using numerical indexing, e.g.: names(student_df)[3] <- "class.years" student_df ## There is also a similar option, row.names() to access and modify the ## the row names. By default, the row names are a series of integers indicating ## the row number: row.names(student_df) ## The assignments above are the first of many examples in R that seem ## to defy logic: it seems as though we’re assigning something to a ## function, which shouldn’t make sense because a function isn’t a ## variable. In fact, you can think of the functions names() and ## row.names() as "access functions” – they do not perform an action, ## but merely grant access to a property of the argument variable, and ## this is why we can make assignments of the sort seen above. ## ## Subsection B2. Indexing data frames ## ## As with vectors, brackets and logical or numerical vectors are still ## the way to access data frames, but with a slight complication, ## because data frames are multidimensional. The solution (which also ## holds for matrices, etc.) is to separate the two dimensions with a ## comma. R treats the first entry as the row number and the second ## entry as the column number; thus, to access the second column of ## the fourth row, type student_df[4, 2] ## Or the second column of the last three rows, student_df[4:6, 2] ## There are two further complications. ## ## To access an entire row or entire column, leave the index blank, as ## in, student_df[, 1] # FIRST COLUMN student_df[3, ] # THIRD ROW student_df[, ] # ENTIRE FRAME, equivalent to "student_df" ## Subsection B3. Tidyverse ## ## In R, data frames are a common and versatile way to store and work with ## tabular data. However, there are other object types that you may find useful. ## For example, tibbles (from tidyverse) behave like data frames but print more ## cleanly and are widely used in with modern R workflows. Data tables (from the ## data.table package) are another alternative, designed for speed and efficient ## handling of large datasets. Choosing the right structure depends on the ## context of your analysis and the tools you prefer to use. ## ## In the last tutorial, we learned about the tidyverse. ## Tidyverse is a group of R packages like dplyr, stringr, ggplot2, and more. ## You can find the full list here: https://www.tidyverse.org ## ## Instead of loading all tidyverse packages using the library(tidyverse) command, ## we can load only the one we need. For now, we just need the dplyr package. ## ## We now return to indexing data frames, repeating what we did above ## using dplyr to access rows and columns in a data frame; and extending our ## examples, demonstrating both base R and dplyr implementations. library(dplyr) ## To select a column, we use the select() function. ## The first argument is the data frame, the second is the column we want. ## We can use either the column number or the column name. select(.data = student_df, 1) # selects the first column by position select(.data = student_df, names) # selects the column called 'names' ## In select(), the .data argument is where we put the data frame. ## The next argument is the column(s) we want to choose. ## We can also use the pipe symbol (|>) to make the code easier to read. ## Think of the pipe as saying "and then". ## For example, instead of writing select(.data = student_df, 1), ## we can write: student_df |> select(1) ## This means: "take student_df and then select column 1" student_df |> select(1) # selects the first column student_df |> select(names) # selects the column called 'names' ## If we want to save the result into a new data frame, we can assign it like this: student_names_data <- student_df |> select(names) ## The select() function gives us a one-column data frame. ## If we want just a vector (not a data frame), we can use the pull() function. student_df |> pull(1) # gets the first column as a vector student_df |> pull(names) # gets the 'names' column as a vector ## To get a specific row instead of a column, we use the slice() function. student_df[3, ] # base R: gets the third row student_df |> slice(3) # dplyr: gets the third row ## We can also use the values in names() or row.names() as indices: student_df["4", ] student_df[, "class.years"] ## Putting all of this together, we can quickly generate subsets of ## our data. For example, we can create a data frame that includes ## only the students with height greater than the mean height: tall_students <- student_df[student_df$heights > mean(student_df$heights), ] tall_students ## A tidyverse-style way to filter rows is to use the `filter()` function. ## This helps us select only the rows that meet a condition. tall_students <- filter(student_df, heights > mean(heights)) tall_students ## Inside `filter()`, we do not use the `$` operator (like student_df$height). ## This is because `filter()` automatically looks for column names ## within the data frame you pass as the first argument. ## The same code using a pipe: tall_students <- student_df |> filter(heights > mean(heights)) tall_students ## Or sort our data by various aspects: student_df[order(student_df$class.years), ] ## Alternatively, using the `arrange` function from dplyr, we have student_df |> arrange(class.years) # in ascending order student_df |> arrange(desc(class.years)) # in descending order ## Subsection B4. Introduction to factors ## ## When performing statistical analyses, we often want R to look at a ## set of data and compare groups within the data to one another. For ## example, you have the data frame containing data on students in a ## course. There are columns of data representing the students' heights ## and class.years. How can you look at the means of height by class.year? ## ## Or, another example, you have sampled a number of rabbits and have ## a column for weights before a diet treatment and a column for ## weights after a diet treatment and a third column stating the diet ## treatment (e.g, "none,” "grain diet,” and "grapefruit diet”). How ## can you evaluate the change in weight as affected by diet? ## ## The answer to these questions is to use factors. ## ## Many of the data sets that come with R already have their data ## interpreted as factors. Let’s take a look at a data set with ## factors: data(moths, package = "DAAG") help(moths, package = "DAAG") moths ## (Note that you may have to install the DAAG package in order to ## load these data. Do you remember how to do this? If not, ask a neighbor ## for help!) ## ## The help file for the moths data set tells us that our last column, ## habitat, is a factor. What does this mean? ## ## See what happens when we pull up this column by itself: moths$habitat ## It looks pretty standard, at first, but then we notice that it is ## more than just a list of habitat names – it has another component, ## levels. ## ## Factors have levels. Levels are editable, independent of the data ## itself. To see the levels alone, you can type levels(moths$habitat) ## When called that way, it has the identity of a vector of strings. ## ## The levels() function behaves just like the names() and row.names() ## functions (i.e., weird), and you can make assignments or ## reassignments to the levels - e.g., levels(moths$habitat)[1] <- "NEBank" ## Factors come in exceptionally handy when performing statistical ## tests, but the various plot functions can give you an idea of uses ## of a factored variable, such as, boxplot(moths$meters ~ moths$habitat) ## The tilde, ~, used in a number of contexts in R, can generally be ## read as "by,” which gives a general explanation of its use here – ## visualizing transect length ("meters") by habitat type ("habitat"). ## ## Recall, we used the ggplot2 package to make prettier plots ## We will replicate the `boxplot` above: ## theme_bw is a ggplot “style” that makes things a bit less fancy and more sharp, you may think it looks more scientific library(ggplot2); theme_set(theme_bw()) ggplot(data = moths) + geom_boxplot(mapping = aes(x = habitat, y = meters)) ## ## Subsection B5. Making a factor ## ## Now that you know how to employ a factored variable ## the next step is to know how to make a factor out of a ## variable. The general syntax is: x <- factor(c("A", "B", "A", "A", "A", "B")) ## For vectors of strings, like that one. The results are usually fine ## as is. ## ## But let’s go back to our student_df data frame. We listed ## class.years as a number 1 through 4, but these are discreet ## categories with well-defined names. A more elegant solution is to ## factor the column of the data frame, much like is seen with moths. student_df$class.years <- factor(student_df$class.years) levels(student_df$class.years) ## Not ideal, but we can use reassignment to change the names of the ## years. levels(student_df$class.years) <- c("Freshman", "Sophomore", "Junior", "Senior") ## With satisfying (preliminary) results available with: student_df boxplot(student_df$heights ~ student_df$class.years) ## Try to replicate the above plot using ggplot2. ## Subsection B6. Applying functions to data frames ## ## Many functions you might like to apply to your data frames will ## produce unpredictable results. ## ## A few work nicely: nyc_air <- airquality[, c("Wind", "Temp")] nyc_air summary(nyc_air) ## But others that you might try do not work as you want: sum(nyc_air) # sums wind and temperature together mean(nyc_air) # returns an error message ## FIXME (or don't, see below) ## One solution to these troubles is to use the function apply(), ## which performs the function named in the third argument on the ## first argument by the index specified by the second argument ## (in this case, by column). apply(nyc_air, 2, sum) apply(nyc_air, 2, mean) apply(nyc_air, 2, var) ## Section C. Composing your own functions ## ## A more advanced (and very important) topic ## ## So far in R we have used the functions that come with R and its various ## packages; however, since you will often want to perform the same series ## of actions on different objects, R makes it relatively straight-forward ## to compose your own generic functions and store them in R’s memory. ## ## Before you start writing a function you need to have your mind set ## on three things: ## ## * What you want to give the function as input ## * What you want the function to do ## * What you want the function to give as output ## ## Subsection C1. A trivial example ## ## Imagine you need to repeatedly transform sets of data, but your ## transformation is "non-standard.” For this example, I’m imagining that you ## want the natural logarithm of the data, plus one. We know how to perform ## these operations on a number we have stored in a variable, no problem, x <- 1:10 log(x) + 1 ## But what we would really like is a named function which will do ## this in one step, log.plus.one(). ## ## What we will do is make an assignment to log.plus.one, but rather ## than assigning a value (or vector, etc.), we assign a function ## which we define on the spot. We use the command function, which ## looks like a function but is not a function. Instead, it’s ## a control element of the R language – it isn’t executed like a ## function, but rather it informs R to treat the code around it in a ## special way. ## ## The command function has an interesting syntax. Its arguments are ## the names of variables which will serve as the arguments for your ## function (the first of three bullets, above). Then, after this ## parenthetical bit, comes the meat of the function – what you want ## it to do and what you want it to give back to you (the last two ## bullets, above). In our log.plus.one() case, what we want it to do ## and what we want it to give back happen to be the same thing, ## therefore we can define it very simply, as follows, log.plus.one <- function(y) log(y) + 1 ## Cool! Let’s test it out: log.plus.one(x) ## It behaves just like we would want it to. ## ## Aside: you may also see short functions define using the "lambda" syntax: log.plus.one <- \(y) log(y) + 1 ## Subsection C2. A separate little world ## ## Wait a second. I used y in my function definition but called the ## function with my variable x as the argument. What happened to y? y ## The variable is untouched by the function. ## ## In order to keep functions fully generic, when you give the ## function command, R generates a separate, untouchable variable ## space which has no interactions with your R workspace. This means ## that the names of your function arguments (and any variables ## assigned within your function) can be anything you find convenient ## – there is never any risk of a conflict with your active variables. ## ## Subsection C3. Longer functions ## ## Either because the function is too complex to be ## executed on a single line or because you want to make the ## function’s methods clearer, you will often generate functions ## longer than one line. For this purpose, R introduces another type ## of bracket, curly brackets, { }. These are control brackets and ## indicate the contents should be treated as a unit. ## ## As a final example, (function(x, y) { z <- x^2 + y^2; x + y + z })(0:7, 1) ## Note that the function is written on two lines, but this isn’t an ## issue because of the brackets. Note also that this function is ## anonymous. It is never assigned, but used in place. ## ## A common tendency when first learning to program is to write code ## in a condensed form (such as the anonymous inline function defined ## above) so that it is difficult to follow what is going on when you ## return to the code later on (or when your instructor is helping you ## find a bug that is keeping your code from working correctly). While ## writing code in this way takes a certain amount of cleverness and ## demonstrates that you have understood the concepts, it is better ## practice to write out your code so that it is easy to follow. This ## includes using plenty of whitespace, to make your code easy to ## read and thoroughly commenting your commands as you go. ## The example above is therefore better written as follows: ## @title Sum Values and Sum of Their Squares ## @description ## A function that takes two numerical values as input and returns the sum of ## the values plus the sum of their squares ## ## @param x A numerical vector ## @param y A numerical vector ## ## @details Note that `x` and `y` must be compatible lengths, or the recycling ## rules ## ## @return A numerical value, x + y + x^2 + y^2 sum_vals_plus_sum_sqs <- function(x, y) { z <- x^2 + y^2 # define z as the sum of the values’ squares ss <- x + y + z # add the values to the sum of their squares return(ss) # and return the result as output } ## Perform the above function with x equal to the numbers from 0 to ## 7 and y equal to 1: sum_vals_plus_sum_sqs(0:7, 1) ## Benchmark Questions ## ## This concludes Tutorial 2. Because there are some advanced topics ## here that require practice to get your head around, you should ## make sure to work through the benchmark questions before you ## move on to Tutorial 3. ## ## Question: Semantics? ## ## R sometimes uses confusingly similar names for distinct concepts. ## Define for yourself: names, factors, levels. When would you use each? ## ## Question: Subsetting ## ## You need a subset of the mtcars data set that has only every other ## row of data included. ## a. Do this with numerical indexing. ## b. Do this with logical indexing. ## ## Question: Function Definition ## ## Write a function, `jumble()`, that takes a vector as an argument and ## returns a vector with the original elements in random order. (Note: R ## does have a built-in function that can do this, but the point of this ## question is rather for you to build a new function using the tools that ## have been introduced in the tutorials so far.)