If in doubt, aim for consistency in every column. Never try to record more than one “type” of thing in a column.
...,-3,-2,0,1,2,3,...
versus 3.141529
A string is any sequence of characters.
180mg
is not a number!TRUE
or FALSE
statements.1
or 0
is a common shorthand- apples
- oranges
- pears
1,2,3
for convenience but not because 1<2<3
.- Strongly disagree
- Disagree
- Neither agree/disagree
- Agree
- Strongly agree
1,2,3,4,5
for convenience and understands that 1<2<3<4<5
In R studio,
integers <- as.integer(c(1, 3, 15, 16))
decimals <- c(1.4, 3.5, 15.55, 16.4)
bools <- c(T, T, F, T)
dates <- as.Date(c("22/04/2016", "13/05/1997"), format = "%d/%m/%Y")
strings <- c("These are", "Strings")
factors <- as.factor(c("Apples", "Pears", "Lemons"))
ordered_factors <- factor(c("Good", "Better", "Best"), ordered = TRUE)
then use str()
to see the data ‘structure’
str(integers)
## int [1:4] 1 3 15 16
str(decimals)
## num [1:4] 1.4 3.5 15.6 16.4
str(bools)
## logi [1:4] TRUE TRUE FALSE TRUE
str(dates)
## Date[1:2], format: "2016-04-22" "1997-05-13"
str(strings)
## chr [1:2] "These are" "Strings"
str(factors)
## Factor w/ 3 levels "Apples","Lemons",..: 1 3 2
str(ordered_factors)
## Ord.factor w/ 3 levels "Best"<"Better"<..: 3 2 1
Put all your variables in columns - the thing you’re measuring, like ‘weight’, ‘temperature’ or ‘SBP’. Break things down into their most basic constituents, and keep units in your headers only.
Put each observation in its own row. Think very carefully about what constitutes your basic observation. Often it’s your patient, but it may not be as intuitive as you think.
Have a single ‘header’ row to label your columns
Don’t combine multiple pieces of information in one cell.
Leave the raw data raw - don’t mess with it! That means no formulas anywhere in your spreadsheet!
Export the cleaned data to a text based format like CSV. This ensures that anyone can use the data, and is the format required by most data repositories.
just add some white space and dividers
And all excel does is present it to us in an easy to use format.
But you always need to remember that you need to go back and forth between both formats.
So merged cells, colours, comments will both be lost and confuse.
Identify and fix these common mistakes