Objects & Functions

This chapter will introduce the concept of an object in R. It will also serve as an introduction to functions, which, are also objects. This chapter will give a broad overview of these two ideas. Subsequent chapters will reveal their details. The following quotation should highlight just how important these ideas are.

To understand computations in R, two slogans are helpful:

  • Everything that exists is an object.
  • Everything that happens is a function call.

John Chambers

After reading these notes you should be able to:

Creating Objects

In the previous chapter, you created a number of objects, because everything in R is an object, so of course they were objects.1 In this chapter, we’re going to mostly ignore the details of objects, and instead give you a general sense of what they are and can be used for. In the most general sense, objects are data stored in memory that R can access.

Although it is a massive oversimplification, you can broadly group objects into two categories:

  • Objects that store data such as numbers, string, and logical values.
  • Objects that store code, which for the most part you can think of as functions.2

Let’s create some objects to demonstrate:

42 # a number
[1] 42
"STAT 385" # a string
[1] "STAT 385"
TRUE # a logical value
[1] TRUE

Running the above code, you will see R create output in the console. Note that the output you see in the console is not the object itself, but the result of R printing information about the object for you. This is a very technical distinction that we will return to several times including when we discuss data types and structures as well as the S3 system. The object itself exists only in your computer’s memory.

The trouble is, we made those objects, but now have no way to return to them. They’re in memory, but inaccessible to us. We could recreate them, and they would output to the console again, but those would technically be different objects as they would exist at a different location in memory.

Names and Assignment

What we need now is a variable, which is a way to associate a name with an object. Often you will hear language such as “store object 42 in variable x” but this is rather misleading. A variable is nothing but a name. That name “points” to the object in memory and is more or less a human shortcut for accessing that memory.3

Pointing a name to an object is called assignment. There are multiple ways to do this in R, but the most common is to use either <- or = with the name on the left-hand side (LHS) and the object on the right-hand side (RHS).

x = 42

As a result of this assignment, two things have happened:

  • We have created the object that stores the value 42 which now exists in memory.
  • We have associated the name x with this object.4

Now, to access the object 42 we can use the name we have given it:

x
[1] 42

A brief note about the difference between = and <-. Both can be used for assignment. That is, the following would have the same effect as the above code:

x <- 42

We demonstrate both because the vast majority of R code that you see will use <-. However, there is small but seemingly growing group of users who prefer =. There is a very slight technical difference, but you will not encounter it in these notes, and it is possible you never encounter it in practice. The most important thing is that you pick one and are consistent. We recommend = as it is what you will see throughout these notes, it is much easier to type, and it will be less confusing or frustrating for those coming from other languages. For more information on this difference, and some rationale as to why = is preferred, see the Appendix section on Assignment.

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

Giving your objects names is a bit of an art. In notes like these, you’ll often see throw-away variable names like x, y, and z. In practice you’ll see verbose names like a_long_variable_name_describing_the_object. The general heuristic we’ll suggest is: Pick the smallest name possible such that someone reading the code will have a good chance of understanding what the object is, given its surrounding context. Obviously, there is a lot of room for subjectivity here. We’ll return to this later.

When creating names in R, there a few things you should be aware of:

  • Like everything in R, names are case sensitive.
  • From the R documentation that can be accessed with ?make.names: “A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number.” For reasons we’ll see when we discuss S3, we suggest you avoid dots, ..
  • You cannot use some reserved names. For a list of these, use ?Reserved.

Global Enviroment

Let’s make a few more objects:

x = 42
y = "STAT 385"
z = TRUE

After running this code, you might notice that there is no output in the console. So how do you know that those objects were created? You can check via some additional R code, or RStudio will also provide some support here.

objects()
[1] "x" "y" "z"

The objects() function will return the names of all objects in the global environment. R places different objects in different environments, but for now, all the objects you create will be in the global environment.

This ls() function, which you might has used previously in a Unix type terminal, also lists objects in the global environment.

ls()
[1] "x" "y" "z"

RStudio’s Environment tab, by default in the top-right, will also provide information about objects in the global environment, as they are created. This tool can also be used to inspect objects in other environments.

  • TODO: Add screenshot.

Order of Execution

Be aware that the order that you run your code in is important. The following two examples demonstrate.

p = 1
q = 2
p = q
p
[1] 2
q = 2
p = q
p = 1
p
[1] 1

This is something that we will be vigilant about throughout these notes, and comment on often. Always remember this: When running a line of code, the result of running that code is a function of both the line of code itself, and the current state of the environment! The same line of code may have two different outcomes given different states of the environment.

The c() Function

There are many, many ways to create objects in R. By far, the most common is by using the c() function. This function is used to combine5 values into a vector, the most important data structure in R. We will give a detailed description of vectors (both atomic and generic) in the coming chapters, but for now, let’s simply demonstrate the ability to combine values and objects together.

c(3, 2, 1)
[1] 3 2 1

Here, we have combined the values 3, 2, and 1 into a vector. We use commas, ,, inside the function call to separate the individual values we are combining. We could do similar operations with strings and logical values:

c("a string", "another string", "one more string")
[1] "a string"        "another string"  "one more string"
c(TRUE, FALSE, NA, FALSE, TRUE)
[1]  TRUE FALSE    NA FALSE  TRUE

For reasons that will be clear later, (atomic) vectors cannot mix numbers, strings, and logical values.

Now let’s create some (numeric) vectors, and assign them names.

odd = c(1, 3, 5, 7)
even = c(2, 4, 6, 8)
big = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

We can access these vectors (objects) by a name:

odd
[1] 1 3 5 7
even
[1] 2 4 6 8
big
 [1]  1  2  3  4  5  6  7  8  9 10

In addition to combining values, the c() function can actually combine multiple vectors together:

c(42, odd, even, 42, big)
 [1] 42  1  3  5  7  2  4  6  8 42  1  2  3  4  5  6  7  8  9 10

Notice, we’re mixing both “numbers” and vectors. Actually, that’s not true. We are simply combining together a number of vectors. This is because in R, there is no such thing as a scalar. Something like 42 is actually a vector of length 1. Much more on this next chapter.

Writing and Using Functions

You’ve already used a number of functions in R like sqrt() and log(). But before we dive further into some details of using functions, it will probably be useful to write a few of our own functions. To do so, we will need to use the function() function.6

Let’s start with the simplest function we can possibly write.

f = function() {}

By running this code, we have created a function and assigned it a name f.7 It has no inputs. Inputs, which we will call the arguments of the function are specified inside the () of the function() function. As we have left this blank, our function has no arguments, that is, the function has no way to obtain input. What does this function do? A function’s body, is contained within the braces, {}, and specifies what code is run when we use the function.8 So in this case, the function does nothing.

How exciting!

Next, let’s talk about the difference between running f and f().

f
function() {}

Running f, without parentheses, will show you the entire function definition which is the arguments and body together. A function in R is made up of three components, its arguments (sometimes called formal arguments or formals), its body, and its environment. For now, you will only need to understand and use the arguments and body.9

f()
NULL

Running f(), with parentheses, will run the function. That is, it will process the arguments (inputs) together with the code in the body, to produce output. Because this function does nothing, it returned NULL, a specific object we will discuss later.

Additional functions exist to extract and view specific parts of a function, for example, body(), args(), and formals().

Now let’s write a function that has both input and output.

calc_rect_perim = function(length, height) {
  (2 * length) + (2 * height)
}

Here, we have created a function to calculate the perimeter of a rectangle. In has two arguments: length and height. These are the inputs of the function. The body of the function calculates the perimeter for particular values of length and height.

Let’s demonstrate running this function. To do so, be sure you have run the code above, and that the name calc_rect_perim appears in your global environment. In RStudio, it will be in a section called Functions. How very helpful!

calc_rect_perim(length = 3, height = 4)
[1] 14

The following video demonstrates typing the above code very quickly. The trick is to hit the [tab] key, ⇆, often.

Running the above, we have calculated the perimeter of a rectangle with length 3 and height 4. Essentially, the following code was run10:

(2 * 3) + (2 * 4)
[1] 14

Notice that length and height never made it into your global environment. They only exists temporarily as variables inside the function. More on this later when we talk about scoping rules. Think of length and height as names assigned to temporary objects each time you run the function.

What happens if we ran this function without specifying the inputs, that is, values supplied to the arguments.

calc_rect_perim()
Error in calc_rect_perim() : 
  argument "length" is missing, with no default

We get an error because when we wrote the function, we added an input length but now didn’t supply a value for that argument. As such, R won’t know how to run the body of the function because length won’t have a value.

When writing R functions, you can specify a default value of an argument. Let’s re-write this function:

calc_rect_perim = function(length = 1, height = 1) {
  (2 * length) + (2 * height)
}

This new version of the function specifies a default value of 1 for both arguments. Let’s try to run the function again without specifying any input.

calc_rect_perim()
[1] 4

This works! R falls back to any default value if a specific value for an argument isn’t given when running the function. Now that we have defaults, we could specify one input but not the other as well:

calc_rect_perim(length = 2)
[1] 6

Let’s show you another more ways of specifying input to a function. We do not recommend using either of these approaches as beginners.

calc_rect_perim(3, 7)
[1] 20

The above is the same as the following:

calc_rect_perim(length = 3, height = 7)
[1] 20

Here we are using what is called positional argument matching. For the most part we don’t recommend doing this for any argument other than the first argument to a function. Not naming the first argument is somewhat common practice, especially for the numerous functions that use x for the first argument. We’ll return to this idea when we talk about style. For now, either name all of your arguments when running a functions, or all except the first.

There exist another method called partial matching, but pretend we never told you that.11

Let’s write another function.

calc_powers = function(x) {
  zero = x ^ 0
  one = x ^ 1
  two = x ^ 2
  three = x ^ 3
}

Load this function into your environment by running the code above, then run the function:

calc_powers(x = 5)

Notice that nothing happens! This is because we have not told the function to return anything. By convention, R will return the expression on the final line of code, provided it is not an assignment. To better understand why this is, type and run 42 then x = 42 in the console. Note that one produces output while the other does not.

Let’s make an edit:

calc_powers = function(x) {
  zero = x ^ 0
  one = x ^ 1
  two = x ^ 2
  three = x ^ 3
  result = c(zero, one, two, three)
}

This also will not work because the final line creates an object and assigns it a name result. Another edit:

calc_powers = function(x) {
  zero = x ^ 0
  one = x ^ 1
  two = x ^ 2
  three = x ^ 3
  result = c(zero, one, two, three)
  result
}

This will work, because the last line is not an assignment, so R will output the object that result points to when you run this function. However, we can still do better.

calc_powers = function(x) {
  zero = x ^ 0
  one = x ^ 1
  two = x ^ 2
  three = x ^ 3
  c(zero, one, two, three)
}

We can skip the assignment altogether and simply create the object which will be returned. This is the most common practice when writing R functions. Later, there will be a need to potentially exit a function early, thus there is a return() function. We could instead write the following:

calc_powers = function(x) {
  zero = x ^ 0
  one = x ^ 1
  two = x ^ 2
  three = x ^ 3
  return(c(zero, one, two, three))
}

While many R programmers do not write their functions this way, we recommend it for beginners as it makes it abundantly clear what the output of the function is.

Let’s run this function. Be sure to load the most recent edit we made.

calc_powers(x = 5)
[1]   1   5  25 125

It returns a vector, which of course is an object. Notice that the inputs to our functions were also objects, in this case the object 5 which was temporarily assigned the name x inside the function.12

Since the inputs to functions are objects and the outputs of functions are objects, we can run functions on functions. For example:

mean(calc_powers(x = 5))
[1] 39

This is called function composition which you have seen expressed mathematically with expression like the following.

\[ f(g(x)) \]

In our case the output of running calc_powers(x = 5) was supplied as the input to mean(). This idea is extremely powerful, but can sometimes make it difficult to write readable code. Two strategies exist to assist: intermediate variables and piping. We’ll return to these later.

Vectorization

When discussing R code, you will often hear of vectorized code, or vectorization. We make some very brief comments in this section, but will return to this idea when we discuss vectors and functions in more detail later.

Welcome to R Club.

  • The first rule of R Club is: Do not use for loops!
  • The second rule of R Club is: Do not use for loops!
  • And the third and final rule: If you have to use a for loop, do not grow vectors!13

— Unknown

This fictitious quotation is a bit over-the-top, and there isn’t actually anything wrong with for loops in R14, but it should serve to get the reader’s attention. Computations that might require a for loop in other languages can often be written without a for loop in R. Furthermore, by avoiding a for loop in R, your code will likely be easier to write, easier for other programmers to understand, and possibly run faster.

# don't do this
x = c(6, 4, 3, 6, 7, 8, 9, 10)
y = 0
for (i in x) {
  y = y + i
}
y
[1] 53
# instead, do this
x = c(6, 4, 3, 6, 7, 8, 9, 10)
sum(x)
[1] 53
# don't do this
x = c(6, 4, 3, 6, 7, 8, 9, 10)
y = c()
for (i in x) {
  y = c(y, i + 1)
}
y
[1]  7  5  4  7  8  9 10 11
# instead, do this
x = c(6, 4, 3, 6, 7, 8, 9, 10)
x + 1
[1]  7  5  4  7  8  9 10 11

For now, simply pretend that you’ve never heard of or seen a for loop. Also, the loops written above are purposefully written poorly! Do not use these as example for loops.

We’ll dive further into vectorized code later, but for now, know that there are many functions in R that take as input a vector and output some function applied to the vector as a whole. Some examples:

# a vector we will perform operations on
x = c(5, 1, 3, 5, 13, 7, 9, 11)
x
[1]  5  1  3  5 13  7  9 11
length(x) # length of the x vector
[1] 8
sum(x) # sum over x
[1] 54
prod(x) # product over x
[1] 675675

In more mathematical notation, the above would be:

\[ \texttt{sum(x)} = \sum_{i = 1}^{n} x_i \]

\[ \texttt{prod(x)} = \prod_{i = 1}^{n} x_i \]

Written this way, it really appears as if a for loop would be useful, but again, notice how much easier sum(x) is to read and write.

min(x) # find the minimum of x
[1] 1
max(x) # find the maximum of x
[1] 13
mean(x) # sample mean of x
[1] 6.75
var(x)  # sample variance of x
[1] 16.5
sd(x)   # sample standard deviation of x
[1] 4.062019

The three previous examples compute descriptive statistics. They are functions of samples. They are not functions of distributions.15

Mathematically, these are:

\[ \texttt{mean(x)} = \frac{1}{n}\sum_{i = 1}^{n} x_i \]

\[ \texttt{var(x)} = \frac{1}{n - 1}\sum_{i = 1}^{n} (x_i - \bar{x}) ^ 2 \]

\[ \texttt{sd(x)} = \sqrt{\frac{1}{n - 1}\sum_{i = 1}^{n} (x_i - \bar{x}) ^ 2} \]

Here, \(n\) is the length of the vector and \(\bar{x}\) is the sample mean of the vector.

Note that mean(x) can also be written as:

sum(x) / length(x)
[1] 6.75

Students have a tendency to use this, but remember, mean(x) is much easier to read and write.

range(x) # the range of x values
[1]  1 13
summary(x) # a statistical summary of x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    4.50    6.00    6.75    9.50   13.00 
cumsum(x)  # cumulative sum
[1]  5  6  9 14 27 34 43 54
cumprod(x) # cumulative product
[1]      5      5     15     75    975   6825  61425 675675
cummax(x)  # cumulative maximum
[1]  5  5  5  5 13 13 13 13
cummin(x)  # cumulative minimum
[1] 5 1 1 1 1 1 1 1

As a reminder, to view the documentation of any of the above, use ?name_of_function, for example ?mean.

Other functions, especially arithmetic and other mathematical functions, will perform element-by-element operations. Some examples:

# a vector we will perform operations on
y = c(5, 4, 3, 2, 1)
y
[1] 5 4 3 2 1
y + 1
[1] 6 5 4 3 2
y - 2
[1]  3  2  1  0 -1
y * 3
[1] 15 12  9  6  3
y / 2
[1] 2.5 2.0 1.5 1.0 0.5

What’s actually happening in these examples is a bit tricky. We’ll need to revisit when we talk about length coercion.

sqrt(y)
[1] 2.236068 2.000000 1.732051 1.414214 1.000000
log(y)
[1] 1.6094379 1.3862944 1.0986123 0.6931472 0.0000000

You should try some additional mathematical functions from the previous chapter as well.

Summary

  • TODO: You’ve learned to…

What’s Next?

  • TODO: data type, mode, class and structure

Footnotes

  1. If you’re familiar with object-oriented programming don’t confuse everything being an object with the sorts of objects you might be used to seeing in such a paradigm. In this case objects are just “things,” that is something that stores data, and won’t necessarily have methods (code) attached to them. R does have multiple system for OOP, one of which, S3, we will discuss later.↩︎

  2. Code is data, but again, we’re been general and not technical here.↩︎

  3. As you’re first learning R, this may seem like a trivial detail, but it is actually incredibly important. While the full details are outside the scope of this course, it is highly recommended that at some point in your R career you read the Names and Values chapter of Advanced R.↩︎

  4. We use this perhaps odd language to clarify that the object itself doesn’t have a name but that the name points to the object. An object can have multiple names pointing to it, that is, multiple names that will return that specific object.↩︎

  5. You might also hear this function referred to as the concatenate function, but we find that to be confusing as to some users this will imply that you are performing string concatenation.↩︎

  6. Yes there is a function to create functions. Very meta.↩︎

  7. You will sometimes see R programmers create functions without names. These are called anonymous functions. They will be useful later, but for now you should assign a name to all of your functions.↩︎

  8. Technically, you do not need to use braces, and could simply write an expression where the braces are. Any multiple line expression will necessarily require braces.↩︎

  9. You do not need to specify and enviroment when creating a function. This happens automatically.↩︎

  10. R mostly uses a pass by value evaluation strategy.↩︎

  11. For additional details, see the Argument Matching section of the R Language Definition↩︎

  12. Objects assigned to names zero, one, two, and three also temporarily existed when the function ran. Notice they did not appear in your global environment.↩︎

  13. We should probably also suggest you iterate over the indexes of a vector rather than the elements of the vector, but that doesn’t have as nice a ring to it. Again, more on this when we discuss iteration later.↩︎

  14. But for loops can be written very poorly in R, which we will discuss later.↩︎

  15. They are estimates (statistics) of population parameters.↩︎