Chapter 18 Organizing your research data

It is very important to start the data curation process as you prepare to close data collection for your project. This helps to ensure that no data are lost, that any paperwork and digital data are consistent with each other, and crucially, that the PI knows what materials are available and how they have been archived.

18.1 Is data collection closed?

Because many of the analyses we carry out are based on Bayesian statistics, we do not always need to recruit to a particular, pre-registered N. So in some cases we may stop-and-restart data collection for a particular project. At each of these potential stopping moments though, it is important to log what data we have, what decisions we have made about it, and archive consent forms. When data collection is rolling and managed by one constant person, we might take a less formal approach. But usually, the close of data collection, even if it might be restarted again, should be accompanied by a Data Round-up Meeting.

18.2 The Data Round-up Meeting

It’s often the case that the person collecting the data is not the only person analyzing it. But the person responsible for data collection is the only person who knows details about how each participants’ sessions progressed, whether anything strange happened during a session that merits exclusion, and where the physical and digital data are. We do not want to lose this knowledge. The Data Round-up Meeting ensures that we preserve important information the data collector has as soon as possible after data collection is over.

Everyone involved in data collection should attend the Data Round-up Meeting, because we will at the same time make decisions about participant exclusions. Everyone who collected data should bring their lab notebook. If a face-to-face meeting including all personnel is impossible, data collectors should at least make their lab notebook available to the PI for this meeting, and have prepared a summary of which participants they think must be left out and why.

18.2.1 Preparing for the Data Round-up Meeting

Everyone attending the meeting should prepare in advance, so that it can progress smoothly. If you were responsible for data collection, you must:

  1. Gather all paperwork participants completed and bring it to the meeting. If any data was recorded on paper, there should be instructions for putting it into a digital format. This should also be done ahead of the meeting.
  2. Scan all consent forms and send the .pdfs to the PI.
  3. Make sure all the data you collected or coded is saved to its designated directory on labBackUp_cmorey.
  4. If you are analyzing data in software other than R, think about what output you need from the raw data set, and what form it should take.

All of these actions must be finished before we can begin.

18.2.2 Pull all the data together

First, the team shall compile the data associated with the project from the labBackUp_cmorey folder. It is essential that this folder is up-to-date ahead of the meeting - data collectors, please ensure that you have copied all data to the back-up ahead of time.

At the meeting, we will use this code to compile each individual backed-up data file into a single data frame, called d in this example:

d = map_df(list.files(path = "", full.names = TRUE), 
                 read_csv, col_names = T)

This code creates a dataframe called d that includes all the .csv files in the project directory, which will be entered into the path argument.

18.2.3 Anonymization

We shall then examine d’s headings for information that could identify the participant. Anything that has to do with the date and time of the session must be omitted. Recorded birthdates must be omitted. Usually, we would not have recorded anything more personal than that, but if so, it must be omitted at this stage.

Obviously, we do need the participants’ ages to describe our sample. We can calculate that. First, it is a good idea to first look at the entries for any year-of-birth or birthdate variable to make sure they are plausible.

unique(d$YoB)

This code will produce a list of each unique value at YoB in the data frame. Check that they all seem plausible. If they are not, consider whether a weird value might have been a typo by figuring out who interacted with that participant and consulting the relevant notebook. The people in the room must reach an agreement on what to do in the event of a weird value. Options include: considering the age missing for that participant, omitting the participant if they fall outside the recruited age range, judging that a transposition was entered (e.g., entering a YoB of 2020 rather than 2002 was an error because the participant was definitely an adult, or that a 2-digit value like 98 actually meant 1998). In cases where the typo seems obvious, we can decide to fix it, but this must be documented in the script.

If we recorded Year of Birth, we can now create a new variable converting birth year to age at the time of the study. For most purposes, this does not need to be very precise. We can do this with a simple mutate.

d = d %>% mutate(age = 2018 - YoB) # current year - column name containing the birth year

If we recorded the full birthdate, we probably need age in more precision. The R package lubridate is useful for sorting out birthdates and calculating precise ages from birthdates and run dates.

Once we have created whatever new, anonymous variables we need from the potentially identifying data, we can create a new data set with the identifiable data omitted. We would do that with script like this, using the select function:

# Remove columns that contain ids or timestamps (times could limit the pool to specific people given that the location of the test is known)
symsp = symsp %>% select(-c(Clock.Information, Email, SessionDate, SessionTime)) # The "-" negates the select, meaning that everything *except* the named columns remain.

Include all the variables you want to omit within the -c() list. The resulting data frame will include everything except those variables. View the resulting data frame to check that this indeed worked as intended.

This new data frame is the one we will analyze. Save it now to the PI’s local directory, and also upload it to the OSF page for this project so all of the lab personnel who need to use it can access it.

# Save it - this is the version that goes on OSF
write_csv(symsp, "symSpan_RuG_anon.csv")

We may occasionally need the identifiable data, particularly the timestamps in it. If we need to prove that the data were collected after a certain date, these timestamps are strong evidence. But the original data need to be kept only in a secure place that is never accessible publicly and that is regularly automatically backed up. Currently the only available spaces that meet this requirement are the Shared and Home drives. Our policy: the compiled, non-anonymized data are stored on Candice Morey’s Home drive. She will store the script we write to anonymize the data and the original data on a sub-directory of My Documents/dataSets_cardiff on her Home Drive. But ideally, after the Data Round-up Meeting, the data do not need to be re-compiled again, and all analysts have the same document to work from via OSF.

18.2.4 Participant exclusions

We need to confirm that all of the data we just rounded-up are valid for analysis. Before looking at individuals’ performance, we may already know that some participants must be excluded. Depending on the study, we might have recorded responses to eligibility criteria in the data, or may have this information recorded on paper. At this point, we go through this information and compile a list of participant id numbers to leave out of analyses. We want to compile this list as early as possible in the data analysis script, so that any sub-selecting we do occurs to a data frame that already excludes ineligible participants. We also want to keep documentation of why each participant was omitted. We do this in an R markdown notebook. Here is an example code chunk documenting reasons for exclusion and carrying out the exclusions:

# The participants we are excluding based on inability to label one or more pictures: 007, 015, 018, 044

# Checking your lab's log or notes, describe any participant number who should be excluded based on those notes here.
# 37 was videoed but mouth was not visible (and there was no coder in real-time)
# 54: program malfunction
# 061: experimenter (CCM) accidentally modelled speech rehearsal during practice
# 095, 96, 97 were siblings or friends who were run, but not members of the intended age groups. These participants do not need to be counted as "excluded" for the exclusions in Table B, because they were not sampled according to our plan.
# Q: Actually 96 is 6 years-old - can we include, or was there another problem? A: There is ambiguity about 96's age. Value in lab js and log are different. Experimenter is sure that the offline recording of 96's age is the correct one.

# Put all the participant ids who are to be excluded inside this list
exSubs = c("37", "007", "015", "018", "044", "54","061", "095", "96", "97")

d = d %>% filter(!(participant_id%in%(exSubs)))

# Note excluded participants ages for inclusion in Table B
# 007 - 5 years
# 015 - 6 years
# 018 - 6 years
# 044 - 6 years
# 54 - 5 years
# 061 - 6 years
# 37 - 6 years
# Totals:
# 5 years: 2
# 6 years: 5

The object exSubs lists each participant id to be omitted. It is important that this list reflects how the data were recorded. We should check at this point whether the participant ids were recorded as characters (and should therefore be inside ““s) or as numbers (no quotes). When they are characters, we should confirm that each participant id is unique and always recorded with the exact same character string (and that that character string is the one we put in exSubs).

The filter function performed on data frame data instructs R to keep all rows where the participant id does not match one from the exSubs list.

We will also conduct any performance-based checks that may be needed at this point, and add participants who do not meet them to the exSubs list and run the filtering step.

18.2.5 Data wrangling

For lab members who are not using R for their analysis, we will discuss the form in which data are needed for your analysis. What variables do you need? Are there dependent variables that need to be calculated? Is any additional filtering (e.g., excluding fast or slow RTs) required? Should it be in long or wide format? To ensure that you exclude the same data that we agreed should be excluded in this meeting, we shall output the data for your analysis from the working R script.

18.2.6 Milestone: Data posted on OSF

At the end of this meeting, the anonymized data file that we begin all scripted analyses with should be posted to OSF. This helps to prevent inadvertent data loss, and means we start all analyses with an identical data file. The exclusion script should be made available to everyone as well. Any R analyses starting from the anonymized file should preserve the participant exclusion code agreed in the meeting.