Lesson 4: Data Wrangling

Learning Objectives

  • Use the dplyr package to manipulate your data
    • Introduce logical operators
  • The standard R methods for selecting data
  • Some of our favourite (data wrangling) things
    • Wrangling strings
    • Wrangling dates
    • Columns to rows and back again

Prerequisite

  • Install the packages using install.packages():
    • dplyr
    • lubridate
  • Load the libraries
  • Import the RCT dataset into an object called RCT
install.packages("tidyverse")
library(tidyverse)
RCT <- read.csv("../data/course_exemplar2.csv")

First wrangle!

library(dplyr)
filter(RCT, age >= 65)
##    id     recruit age gender random ps0h ps3h ps12 ps24h ps168h move12h
## 1   1  1 Jan 2014  80      F  drain    0    0    0     2      2       2
## 2   2  2 Jan 2014  72      f  drain    1    2    2     1      2       2
## 3   3  3 Jan 2014  72      F  drain    0    1    0     0      0       0
## 4   5  5 Jan 2014  84      f   skin    0    0    2     3     NA       1
## 5   6  6 Jan 2014  72      F   skin    1    2    2     2     NA       1
## 6   7  7 Jan 2014  65      F   skin    0    1    0     0      0       0
## 7   8  8 Jan 2014  75      F  drain    0    0    0     0      0       0
## 8  12 12 Jan 2014  72      F   skin    0    1    2     2      1       2
## 9  13 13 Jan 2014  75      f   skin    0    2    1     0      1       2
## 10 15 15 Jan 2014  95      f   skin    0    1    1     1     NA       1
## 11 18 18 Jan 2014  73      F  drain    0    0    0     0      0       0
## 12 19 19 Jan 2014  73      f  drain    0    0    0     0      0       0
## 13 21 21 Jan 2014  86      M  drain    0    2    2     2      1       3
## 14 23 23 Jan 2014  66      f   skin    0    1    1     1     NA       1
## 15 24 24 Jan 2014  73      f  drain    0    0    0     0     NA       0
## 16 26 26 Jan 2014  80      f  drain    0    1    0     1      0       1
## 17 27 27 Jan 2014  85      f  drain    0    1    1     1     NA       1
## 18 31 31 Jan 2014  73      f  drain    0    0    0     0     NA       0
## 19 33  2 Feb 2014  79      f  drain    0    1    4     1      1       2
## 20 34  3 Feb 2014  77      f  drain    0    0    0     0      0       0
## 21 35  4 Feb 2014  81      F   skin    0    0    5     5     NA       5
## 22 37  6 Feb 2014  67      f  drain    1    0    0     1     NA       2
## 23 38  7 Feb 2014  80      F  drain    0    0    0     1      0       3
## 24 39  8 Feb 2014  66      F   skin    0    3    4     2      3       1
## 25 40  9 Feb 2014  66      F  drain    0    1    3     3      0       2
## 26 43 12 Feb 2014  79      F   skin    0    0    3     0      0       0
## 27 44 13 Feb 2014  70      F   skin    1    1    1     1      0       1
## 28 46 15 Feb 2014  71      F   skin    0    0    0     0      0       0
## 29 49 18 Feb 2014  70      F   skin    0    1    3     0      1       0
## 30 50 19 Feb 2014  65      F   skin    0    2    3     3      2       3
## 31 52 21 Feb 2014  66      F   skin    0    1    1     0      0       1
## 32 54 23 Feb 2014  76      F  drain    0    3    2     2      0       0
## 33 55 24 Feb 2014  74      F   skin    1    2    5     1      1       0
## 34 57 26 Feb 2014  75      F  drain    0    2    2     0      0       2
## 35 58 27 Feb 2014  77      F  drain    0    2    2     1      0       2
## 36 59 28 Feb 2014  77      F   skin    0    0    1     0      0       1
## 37 60  1 Mar 2014  85      F  drain    0    3    3     4      0       3
## 38 62  3 Mar 2014  79      f  drain    0    1    1     1      1       2
## 39 64  5 Mar 2014  80      f  drain    0    0    0     2      1       0
##    move24h move168h tylenol codeine oramorph
## 1        0        0      8g       0        0
## 2        1        2      8g   210mg        0
## 3        0        0      6g       0        0
## 4        1       NA      8g       0     20mg
## 5        1       NA      8g       0     20mg
## 6        0        0      4g       0        0
## 7        0        0       0       0        0
## 8        2        2      8g       0        0
## 9        0        1      4g       0     10mg
## 10       1       NA      8g       0        0
## 11       0        0      8g       0      5mg
## 12       0        1       0       0        0
## 13       2        1      8g       0        0
## 14       1       NA      6g       0        0
## 15       0       NA      4g    60mg        0
## 16       1        0      4g       0        0
## 17       1       NA      4g       0        0
## 18       0       NA       0       0        0
## 19       4        1      4g       0     50mg
## 20       0        0      8g       0        0
## 21       5       NA      8g       0        0
## 22       2       NA      4g       0        0
## 23       3        0      4g       0        0
## 24       1        3      8g       0     30mg
## 25       2        0      3g       0        0
## 26       0        0      1g       0        0
## 27       1        0      8g       0        0
## 28       0        0      2g       0        0
## 29       0        0      4g       0     20mg
## 30       3        1      4g       0        0
## 31       1        0      8g       0        0
## 32       0        0      4g       0     60mg
## 33       0        0      8g    60mg        0
## 34       0        0      8g    30mg        0
## 35       1        0      8g       0        0
## 36       1        0      4g       0        0
## 37       4        0      4g    60mg        0
## 38       2        0      2g       0     20mg
## 39       0        0       0       0        0
##                                              other los
## 1                                                0   3
## 2                                                0   7
## 3                                 Diclofenac 225mg   4
## 4                                                0   3
## 5             Amitryptaline 10mg, pregabalin 300mg  10
## 6                                                0   4
## 7                                                0   3
## 8                                    Morphine 20mg   4
## 9                                 Diclofenac 225mg   3
## 10                                               0   5
## 11                                               0   5
## 12                                               0   2
## 13                                               0   4
## 14                                               0   5
## 15                                               0   2
## 16                                               0   4
## 17                                               0   1
## 18                                               0   0
## 19                  Morphing 2.5mg, tramadol 100mg   5
## 20                                               0   4
## 21 Oxycontin 240mg, pregabalin 200mg, oxynorm 25mg   8
## 22                                               0  NA
## 23                                               0  NA
## 24                       Morphine Sulphate MR 40mg   4
## 25                                 Ibuprofen 800mg  NA
## 26                                               0   2
## 27                                 Ibuprofen 800mg  NA
## 28                                               0  NA
## 29                                               0   2
## 30                                               0   0
## 31                                Diclofenac 225mg  NA
## 32                                               0   5
## 33                                Ibuprofen 1200mg  NA
## 34                       Morphine Sulphate MR 60mg   4
## 35                       Morphine sulphate MR 50mg   4
## 36                                               0  NA
## 37                                               0  NA
## 38                                               0  NA
## 39                                               0   3
##                                        los_reason satisfaction
## 1                   Continued fluid through drain         Good
## 2  Continued fluid through drain and bowel stasis         Poor
## 3                            Urinary incontinence         Good
## 4                   Continued fluid through drain         Good
## 5                     Comorbidities- recurrent PE         Poor
## 6                   Continued fluid through drain    Excellent
## 7                   Continued fluid through drain    Excellent
## 8                   Continued fluid through drain         Good
## 9                   Continued fluid through drain         Good
## 10                  Continued fluid through drain satisfactory
## 11                  Continued fluid through drain    Excellent
## 12                  Continued fluid through drain         Good
## 13                  Continued fluid through drain         Good
## 14                  Continued fluid through drain         Good
## 15                  Continued fluid through drain    Excellent
## 16                  Continued fluid through drain    Excellent
## 17                  Continued fluid through drain         Good
## 18                  Continued fluid through drain    Excellent
## 19                  Continued fluid through drain         Good
## 20                  Continued fluid through drain         Good
## 21                        Pain due to spinal mets         Poor
## 22                                                            
## 23                                                            
## 24                  Continued fluid through drain satisfactory
## 25                                                            
## 26                  Continued fluid through drain    Excellent
## 27                                                            
## 28                                                            
## 29                  Continued fluid through drain    Excellent
## 30                  Continued fluid through drain         Good
## 31                                                            
## 32                  Continued fluid through drain         Good
## 33                                                            
## 34                  Continued fluid through drain         Good
## 35                  Continued fluid through drain         Good
## 36                                                            
## 37                                                            
## 38                  Continued fluid through drain Satisfactory
## 39                  Continued fluid through drain         Good

Done! It filters rows from the RCT data frame where age is >= 65.

Logical operators

  • Most are obvious:
    • > (greater than), >= (greater than or equal to), and similarly for < and <=.
  • Others are less so:
    • != means Not Equal to…

Logical operators 2

  • Others are confusing:
    • == compares equality. Notice that there are two equal signs. This is because in R = means an assigment, you’re making somethign equal to something else:
    • x = 6 means make the variable called x equal 6. If you then do x == 8 is a question, is x equal to 8? Here, the answer is a FALSE

Second Wrangle!

RCT_1 <- filter(RCT, age >= 65) 
RCT_2 <- select(RCT_1, gender)
table(RCT_2)
## RCT_2
##  f  F  M 
## 15 23  1
#The same as:
table(select(filter(RCT, age >= 65), gender))
## 
##  f  F  M 
## 15 23  1

What’s happening?

  • First we filter the rows by a criterion:
    • we are interested in patients 65 and older (>= 65)
  • We then assign that to an named object RCT_1
  • We then select the gender column and assign that to RCT_2, and create a table from that
  • What we have left is a single column of Male/Female from a subset of our data - namely ones that have an Age and are 65 and older

Is there a better way?

filter(RCT, age >= 65)
RCT %>% filter(age >= 65)

What’s that weird sign?!

The %>% operator (created by the dplyr library) is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line filters the data where the age is >= 65, then selects the Gender.

What now?

Now that we have our data’s subset, we can pass it onto other functions in R:

RCT %>% filter(age >= 65) %>% 
  select(gender) %>% table()
## .
##  f  F  M 
## 15 23  1

This says, grab my data labeled RCT, filter the rows so that we only find patients who have an Age and are 65 and older, select the column called gender. With that column, make me a summary.

Other dplyr methods

  • In Addition to filter and select, there are:
    • arrange –> sorts rows
    • distinct –> finds unique values
    • mutate –> creates a new column based on some parameters. Hint: you can use other column names here, useful for finding out the Age of patients, length of time they were in hospital…etc
    • recode –> helps you recode values which might have been mislabelled
    • group_by –> This one is great!!

Let’s practice

RCT %>% arrange(age)
RCT %>% arrange(desc(age))

RCT %>% select(gender) %>% distinct()

RCT %>% filter(age >= 65) %>% 
  select(gender) %>% 
  mutate(gender = recode(gender, f = "F"))

Group-by

It breaks down a dataset into specified groups of rows. When you then apply the functions above on the resulting object they’ll be automatically applied by group. Most importantly, all this is achieved by using the same exact syntax you’d use with an ungrouped object.

RCT %>% group_by(gender) %>% summarise(age_avg = mean(age))
## # A tibble: 3 x 2
##   gender age_avg
##   <fct>    <dbl>
## 1 f           NA
## 2 F           NA
## 3 M           86

Sadly this won’t work because mean has a little hissy fit if there are NA’s in the data.

Fix

To fix this we add the argument na.rm = TRUE:

RCT %>% group_by(gender) %>% 
  summarise(age_avg = mean(age, na.rm = TRUE))
## # A tibble: 3 x 2
##   gender age_avg
##   <fct>    <dbl>
## 1 f         66.2
## 2 F         65.8
## 3 M         86
RCT %>% mutate(gender = recode(gender, f = "F")) %>%
  group_by(gender) %>% 
  summarise(age_avg = mean(age, na.rm = TRUE))
## # A tibble: 2 x 2
##   gender age_avg
##   <fct>    <dbl>
## 1 F         65.9
## 2 M         86

Putting things together

How do we combine all these steps to get insight into the data?

Since this RCT looking at the effects of having a drain vs having skin infiltration on postop pain, let’s see what the mean change in pain scores is at 24h vs baseline:

RCT %>% mutate(ps_change = ps24h - ps0h) %>% 
  group_by(random) %>%
  summarise(mean_ps_change = mean(ps_change, na.rm = TRUE))

# If not using pipes:
RCT$ps_change <- RCT$ps24h - RCT$ps0h

drain <- filter(RCT, random == "drain")
mean(drain$ps_change, na.rm = TRUE)

skin <- filter(RCT, random == "skin")
mean(skin$ps_change, na.rm = TRUE)

Complex tables

RCT %>% group_by(random) %>%
  summarise(mean_ps0h = mean(ps0h, na.rm = TRUE),
            mean_ps3h = mean(ps3h, na.rm = TRUE),
            mean_ps12h = mean(ps12, na.rm = TRUE),
            mean_ps24h = mean(ps24h, na.rm = TRUE))
## # A tibble: 2 x 5
##   random mean_ps0h mean_ps3h mean_ps12h mean_ps24h
##   <fct>      <dbl>     <dbl>      <dbl>      <dbl>
## 1 drain     0.0882     0.886      0.886      0.971
## 2 skin      0.310      1.41       2.03       1.59

Other resources