Linear Regression in R Friday. March 18, 2016 - 9 mins I recently attended an excellent Introduction to Regression Analysis course run by the Centre for Applied Statistics Courses at UCL. They taught the course with example output from SPSS, so here I try to replicate their steps in R.

The data they provided is simulated and in an Microsoft Excel (.xls) spreadsheet format and can be found here . So let’s start by loading it into R.

#We will use the readxl package to read the data into R
library ( readxl )
data <- read_excel ( "../data/DataSetFile.xls" )
#The last 2 rows need to be removed, because they are actually annotations describing the dataset. The first column can be removed because they are just observation numbers.
data <- data [ 1 : 42 , 2 : 9 ]
#We need to tell R that there are some categorical variables
data $ female <- as.factor ( as.integer ( data $ female ))
data $ drugstat <- as.factor ( as.integer ( data $ drugstat ))
data $ ethnic <- as.factor ( as.integer ( data $ ethnic ))

So now we have a dataframe of 42 observations and 8 variables:

`frcmax`

= VmaxFRC(L/s), a measure of small airway function of the baby at birth
`female`

= 0 (male), 1 (female)
`drugstat`

= Whether the mother was administered pethidine during labour: 0 (No), 1 (Yes)
`heellen`

= Crown-Heel length (cm) of the baby
`gestag`

= Gestational age (weeks) at birth
`ethnic`

= Ethnicity: 0 (White Caucasian), 1 (Black African), 2 (Asian)
`cd4`

= CD4 count (per mm^3) of the baby
`bwt`

= Birthweight (grams) of the baby We will use regression to answer the question: “Is there any association between the use of pethidine during labour and VmaxFRC of the baby at birth?”

We start by examining the variables, getting a general feel of the data.

library ( Hmisc ) #We will use a function from this package which contains many useful functions for data analysis
library ( car ) #We will use this package to run a scatterplot matrix
describe ( data )

```
## data
##
## 8 Variables 42 Observations
## ---------------------------------------------------------------------------
## frcmax
## n missing unique Info Mean .05 .10 .25 .50
## 41 1 39 1 100.7 33.2 48.4 69.7 92.9
## .75 .90 .95
## 123.0 152.0 181.0
##
## lowest : 17.5 31.0 33.2 47.9 48.4
## highest: 152.0 175.0 181.0 182.0 220.0
## ---------------------------------------------------------------------------
## female
## n missing unique
## 40 2 2
##
## 0 (20, 50%), 1 (20, 50%)
## ---------------------------------------------------------------------------
## drugstat
## n missing unique
## 41 1 2
##
## 0 (31, 76%), 1 (10, 24%)
## ---------------------------------------------------------------------------
## heellen
## n missing unique Info Mean .05 .10 .25 .50
## 41 1 12 0.98 46.1 43 43 45 46
## .75 .90 .95
## 48 49 51
##
## 40 41 43 44 45 46 47 48 49 50 51 52
## Frequency 1 1 5 3 8 5 7 4 3 1 2 1
## % 2 2 12 7 20 12 17 10 7 2 5 2
## ---------------------------------------------------------------------------
## gestag
## n missing unique Info Mean .05 .10 .25 .50
## 42 0 12 0.99 36.88 32.05 33.10 35.00 37.00
## .75 .90 .95
## 39.00 40.00 41.00
##
## 31 32 33 34 35 36 37 38 39 40 41 42
## Frequency 1 2 2 3 4 6 8 4 3 5 3 1
## % 2 5 5 7 10 14 19 10 7 12 7 2
## ---------------------------------------------------------------------------
## ethnic
## n missing unique
## 42 0 3
##
## 0 (7, 17%), 1 (21, 50%), 2 (14, 33%)
## ---------------------------------------------------------------------------
## cd4
## n missing unique Info Mean .05 .10 .25 .50
## 20 22 20 1 3.389 2.523 2.658 2.783 3.542
## .75 .90 .95
## 4.080 4.435 4.507
##
## lowest : 0.300 2.640 2.660 2.719 2.734
## highest: 4.147 4.364 4.427 4.500 4.634
## ---------------------------------------------------------------------------
## bwt
## n missing unique Info Mean .05 .10 .25 .50
## 42 0 40 1 1994 1385 1433 1778 2042
## .75 .90 .95
## 2260 2457 2526
##
## lowest : 1100 1190 1385 1390 1420, highest: 2460 2501 2527 2580 2590
## ---------------------------------------------------------------------------
```

scatterplotMatrix ( ~ frcmax + heellen + gestag + cd4 + bwt , diagonal = "density" , smoother = FALSE , data = data )

stripchart ( frcmax ~ drugstat , vertical = TRUE , offset = .5 , pch = 19 , ylab = "FRC Max" , xlab = "Whether Pethidine was given" , data = data )

stripchart ( frcmax ~ ethnic , vertical = TRUE , offset = .5 , pch = 19 , ylab = "FRC Max" , xlab = "Ethnicity" , data = data )

stripchart ( frcmax ~ female , vertical = TRUE , offset = .5 , pch = 19 , ylab = "FRC Max" , xlab = "Sex" , data = data )

We can now try to fit a linear model, our outcome variable of interest is `frcmax`

, and we need to include the predictor `drugstat`

in our model. But let’s first fit this model `frcmax = a + b(bwt)`

model1 <- lm ( frcmax ~ bwt , data = data )
summary ( model1 )

```
##
## Call:
## lm(formula = frcmax ~ bwt, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.689 -25.798 -4.778 28.878 103.291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.62897 34.78117 0.766 0.4485
## bwt 0.03722 0.01716 2.169 0.0363 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.58 on 39 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1076, Adjusted R-squared: 0.08474
## F-statistic: 4.703 on 1 and 39 DF, p-value: 0.03626
```

plot ( frcmax ~ bwt , pch = 19 , data = data )
abline ( model1 )

Now let’s add a few more predictors and include the `drugstat`

variable.

model2 <- lm ( frcmax ~ bwt + gestag + factor ( ethnic ) + drugstat , data = data )
summary ( model2 )

```
##
## Call:
## lm(formula = frcmax ~ bwt + gestag + factor(ethnic) + drugstat,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.614 -18.792 -3.731 17.483 80.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 226.13882 92.83294 2.436 0.0202 *
## bwt 0.07550 0.02323 3.250 0.0026 **
## gestag -6.48755 3.32436 -1.952 0.0593 .
## factor(ethnic)1 -46.91154 19.53192 -2.402 0.0219 *
## factor(ethnic)2 -27.22016 18.62705 -1.461 0.1531
## drugstat1 -17.43254 15.62062 -1.116 0.2722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.42 on 34 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4108, Adjusted R-squared: 0.3241
## F-statistic: 4.741 on 5 and 34 DF, p-value: 0.002158
```

So what does this mean?

In this model, the equation is `frcmax = 226.14 + 0.08(bwt) - 6.49(gestag) - 49.91(ethnic = African) - 27.22(ethnic = Asian) - 17.43(drugstat = 1)`

.

The Adjusted R^2 is 0.3241, meaning 32.41% of the variability of `frcmax`

can be accounted for by the predictors in the model. `bwt`

and `ethnic = African`

had p-values <0.05, meaning these were significant independent predictors of `frcmax`

. For every 1 unit increase in `bwt`

, there is a 6.49 unit increase in `frcmax`

, and `ethnic = African`

resulted in a mean `frcmax`

of -46.91 compared to `ethnic = White Caucasian`

.

Since the p-value for `drugstat = 1`

is 0.2722, it means that after adjusting for the other variables `bwt`

, `gestag`

, `ethnic`

in this model, the null hypothesis is true and there is no statistically significantassociation between the use of pethidine during labour and VmaxFRC of the baby at birth.

Danny Wong Anaesthetist & Health Services Researcher at the NIAA-HSRC & UCL-DAHR