Information
Introduction
This tutorial introduces you to the R language. Our approach is
inspired by R for Data Science
(2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett
Grolemund. You will learn how to work with data sets using the
tidyverse
meta-package. You will learn how to direct the result of one function to
another using the pipe – |> — and how to make a plot
using the ggplot() function.
This tutorial assumes that you have already completed the “Getting Started” tutorial in the tutorial.helpers package. If you haven’t, do so now. It is quick!
From the main Positron menu, start a new window with
File -> New Window. This new window is the location in
which you will do all the work for the tutorial. The current window, the
one in which you are reading these words, is just used to run this
tutorial.
Working with data
Learn how to explore a data set using functions like
summary(), glimpse(), and
slice_sample().
Exercise 1
Before you start doing data science, you must load the packages you
are going to use. Use the function library() to load the
tidyverse package. Click “Run Code.” The check mark
which appears next to “Exercise 1” above indicates that you have
submitted your answer. It doesn’t verify that you have answered the
question correctly.
library(...)
“Library” and “package” mean the same thing in R. We have different
words for historical reasons. However, only the library()
command will load a package/library, giving us access to the functions
and data which it contains.
Exercise 2
In this tutorial, you will sometimes enter code into the exercise blocks, as you did above. But we will also ask you to run code in the Console. (You will do this in the other Positron window, since the Console in this window is currently busy running this tutorial.) Example:
In the Console, run library(tidyverse).
With Console questions, we will usually ask you to Copy/Paste the Command/Response into an answer block, like the one below. We usually shorten those instructions as CP/CR. Do that now.
Your answer should look like:
> library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
>
Your answer never needs to match ours perfectly. Our goal is just to ensure that you are actually following the instructions.
Exercise 3
Data frames, also referred to as “tibbles,” are spreadsheet-type data sets.
In the Console, run diamonds.
CP/CR.
diamonds
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
Whenever we show outputs like this after a question, then we are showing our answer to the previous question, even if we do not label it as such.
Exercise 4
In the Console, run summary() on
diamonds.
CP/CR.
summary(diamonds)
## carat cut color clarity depth table
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00 1st Qu.:56.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80 Median :57.00
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75 Mean :57.46
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00 Max. :95.00
## J: 2808 (Other): 2531
## price x y z
## Min. : 326 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 2401 Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 3933 Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :18823 Max. :10.740 Max. :58.900 Max. :31.800
##
This function provides a quick statistics overview of each variable in the data set. In some cases, as here, the tutorial displays the same object differently from what you were able to copy/paste. And that is OK! Your answer does not need to match our answer.
Exercise 5
In the Console, run slice_sample() on
diamonds. This selects a random row from the data set.
CP/CR.
slice_sample(diamonds)
## # A tibble: 1 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.36 Good H VS2 63.5 54 568 4.55 4.59 2.9
Your answer will differ from this answer because of the inherent
randomness in functions like slice_sample().
Exercise 6
In the Console, hit the Up Arrow to retrieve the previous command.
Edit it to add the argument n = 4 to
slice_sample(diamonds). This will return 10 random rows
from the diamonds data set.
CP/CR.
slice_sample(diamonds, n = 4)
## # A tibble: 4 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.11 Premium H VS2 62 59 5395 6.66 6.59 4.11
## 2 1.5 Good G I1 57.4 62 3179 7.56 7.39 4.29
## 3 2.09 Premium F SI2 61.3 59 12377 8.19 8.15 5.01
## 4 0.42 Ideal E VS2 62.1 56 1024 4.8 4.77 2.97
Editing code directly in the Console quickly becomes annoying. See the positron.tutorials package for tutorials about using Positron to write and organize your code.
Exercise 7
In the Console, run print() on diamonds.
This returns the same result as typing diamonds.
CP/CR.
print(diamonds)
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
You can choose how many rows to display by using the n
argument in the print() function, and how many columns to
display by using the width argument.
Exercise 8
In the Console, run print() on diamonds
with the argument n = 3. This returns the first 3 rows of
the diamonds data set.
CP/CR.
print(diamonds, n = 3)
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## # ℹ 53,937 more rows
print(), by default, gives the top of the tibble, so
your answer should match our answer. slice_sample(), on the
other hand, picks random rows to return. But, in both cases, the result
is a tibble.
A central organizing principal of the Tidyverse is that most functions take a tibble as their first and return a tibble. This allows us to “chain” commands together, one after the other.
Exercise 9
In the Console, run ?diamonds.
This will look up the help page for the diamonds tibble
from the ggplot2 package, which is one of the core
packages in the Tidyverse. The help page will
appear on the right-side of your Positron window, in the Secondary
Activity Bar, which you might need to activate in order to see.
Copy/paste the Description section of the help page below.
You can find help about an entire package with
help(package = "ggplot2"). It is confusing, but
unavoidable, that package names are sometimes unquoted, as in
library(ggplot2), and sometimes quoted, as in
help(package = "ggplot2"). If one does not work, try the
other.
Exercise 10
In the Console, run glimpse() on diamonds.
CP/CR.
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, …
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very …
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, …
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, …
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 33…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, …
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, …
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, …
glimpse() displays columns running down the page and the
data running across across. Note how the “type” of each variable is
listed next to the variable name. For example, price is
listed as <int>, meaning that it is an integer
variable. To learn more about the glimpse() function, run
?glimpse.
view() is another useful function, but, because it is
interactive, we should not use it within a tutorial.
Exercise 11
In the Console, run sqrt(144).
CP/CR.
sqrt(144)
## [1] 12
The square root function is one of many built-in functions in R. Most return their result, which R then, by default, prints out.
Exercise 12
In the Console, run x <- sqrt(144).
CP/CR.
x <- sqrt(144)
The symbol <- is the assignment operator. In this
case, we are assigning the value of sqrt(144) to
the variable x. Nothing is printed out because of that
assignment.
Also, you can see x in the “Variables” tab under the
“Session” pane in the Secondary Activity Bar on the right-hand side of
the Positron window.
Exercise 13
In the Console, run x.
CP/CR.
x
## [1] 12
Now that x has been defined in the Console, it is
available for your use. Above, we just print it out. But we could also
use it in other calculations, i.e., x + 5.
Pipes and plots
Although the Tidyverse includes hundreds of commands for
data manipulation, the most important are filter(),
select(), arrange(), mutate(),
and summarize().
Exercise 1
Let’s warm up by examining the gss_cat tibble from the
forcats package. Since forcats is a
core tidyverse package, you have already loaded it.
Type gss_cat and hit “Run Code.”
Instead of using the Console, we will be doing the exercises in this section using excercise blocks.
...
As the help page notes, gss_cat is a “sample of
categorical variables from the General
Social Survey.”
Exercise 2
Run summary() on gss_cat.
summary(...)
Note that there are missing values in some columns. The word
NA stands for “Not Available” and is used to represent
missing data in R.
Exercise 3
Pipe gss_cat to drop_na(). This function
removes rows with missing values. The pipe symbol — -> —
allows us to chain R commands together, one after the other, with each
one connected to the next with the pipe symbol. In this case, we
want:
gss_cat |>
drop_na()
... |>
drop_na()
Note the number of rows in the tibble after drop_na().
Since drop_na() removes rows with missing values, the
number of rows in the tibble will be less than the original number of
rows.
We could achieve the same result by running
drop_na(gss_cat). The symbol |> just
“pipes” gss_cat into drop_na() as its first
argument.
Exercise 4
Pipe gss_cat to filter(). Within
filter(), use the argument year == 2014.
gss_cat |>
...(year == 2014)
This workflow — in which we pipe a tibble to a function, which then outputs another tibble, which we can then pipe to another function, and so on — is very common in R programming.
The resulting tibble has the same number of columns as
gss_cat because filter() only affects the
rows. But there are many fewer rows.
Exercise 5
Continue the code and pipe with select(), using the
argument age, marital, race, relig, tvhours. Note that you
do not need to retype the code from the last exercise. You can just
click the “Copy Code” button.
... |>
select(age, ..., race, ..., tvhours)
Note how the Hint only gives the most recent line of the pipe.
Because select() does not affect the rows, we have the same
number as after filter(). But we only have 5 columns now,
consistent with what we told select() to do.
Exercise 6
Copy previous code. Continue the pipe with summary()
... |>
summary()
Note that there are missing values in the tvhours
column. Let’s remove them.
Exercise 7
Copy previous code. Replace the summary() with
drop_na().
... |>
drop_na()
The number of rows has decreased because we removed rows with missing
values. drop_na() removes all rows which have a missing
value for any of the variables. If we wanted to just remove the rows
which are missing tvhours, we would use
drop_na(tvhours).
Exercise 8
Continue the pipe with arrange(), using
tvhours as the argument.
... |>
arrange(...)
The arrange() function sorts the rows of a tibble. By
default, it sorts in ascending order.
Exercise 9
Copy the previous code. Put desc() around
tvhours to sort in descending order.
... |>
arrange(desc(...))
Got to respect someone who watches TV 24 hours a day!
Exercise 10
Let’s make a plot. Copy the previous code, and pipe to
ggplot(). Set aes(x = age, y = tvhours).
... |>
ggplot(aes(x = ..., y = ...))
This will return a plain graph as we have not mapped any data to specific “aesthetics” yet.
Exercise 11
Add another layer with geom_jitter() using the
+ sign. Plotting code in the ggplot2
package uses +, not |>, to connect
different commands together. This difference comes from the fact that
ggplot2 was written 10+ years before the pipe was
invented.
... +
geom_jitter()
This is a scatterplot of age versus
tvhours. The x-axis is age, and the y-axis is the number of
hours of TV watched per day.
Exercise 12
Finally, add a title, subtitle, labels for x and y axes using
labs(). The subtitle should be the one sentence of
information about the graph with which you would hope a reader walks
away. What is the most important fact demonstrated in the graphic?
Consider this example graph:

You can make yours look like ours, if you like.
... +
labs(title = "...",
subtitle = "...",
x = "...",
y = "...")
Note that the code in the exercise block is not saved. If you want to save the code, you can copy/paste it into an R script file.
Summary
This tutorial introduced you to the R language. Our approach was
inspired by R for Data Science
(2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett
Grolemund. You learned how to work with data sets using the tidyverse
meta-package. You learned how to direct the result of one function to
another using the pipe – |> — and how to make a plot
using the ggplot() function.
Download answers
- Click the button to download a file containing your answers.
- Save the file onto your computer in a convenient location.
(If no file seems to download, try right-clicking the download button and choose "Save link as...")