forked from manybabies/mb1-analysis-public
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01_participant_merge.Qmd
374 lines (263 loc) · 14.7 KB
/
01_participant_merge.Qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
---
title: "MB1 Data Reading and Merge"
author: "The ManyBabies Analysis Team"
date: '`r format(Sys.time(), "%a %b %d %X %Y")`'
output:
html_document:
toc: true
toc_float: true
number_sections: yes
---
# Intro
This Rmd is the first preprocessing file for the primary ManyBabies 1 (IDS Preference) dataset. The goal is to get everything into a single datafile that can be used for subsequent analyses.
These data are **extremely messy**. Every single variable has a variety of deviations from format in unpredictable ways. Thus, no property of the dataset can be taken for granted; everything must be carefully tested.
Data analytic decisions:
* Download and clean all data in a local copy - pulling from drive is impossible because there are so many messy aspects of the data that need to be manually corrected.
* Try and fix as many things as possible programmatically, e.g. so that we can reproduce it from the raw data.
* There are some things that are really hard to fix programmatically, examples include duplicate subject IDs, misnumbering, etc. These have been fixed in the raw data, but are documented in the relevant spot (in the issues page of the mb1-analysis repository! Issues & contact to labs are documented, and issues are closed & updated when the copy of the data in `processed_data/participants_cleaned/` or `processed_data/trials_cleaned/` has been modified.)
* Test each variable to ensure that it has the properties we want; correct errors; retest.
Outline of data analysis:
* `01_read_and_merge` reads and merges the data. The goal of this file is to create a single inclusive file that has all data from all labs.
* `02_variable_validation` corrects errors in variables and ensures that formats are correct.
* `03_exclusion` reports exclusions and creates diff files.
* `04_confirmatory_analysis` is the set of confirmatory analyses that were preregisted (see [here](https://osf.io/grqau/))
* `05_exploratory_analysis` contains other, non-preregistered analyses for mb1
```{r setup, echo=FALSE, message=FALSE}
source("helper/common.R")
```
Data import functions are factored into a helper functions file.
```{r}
source("helper/preprocessing_helper.R")
```
# Participant Import
Participant data were reported in very non-uniform formats. It is quite painful to coerce these surprisingly variant datafiles into a single file, so we take a lot of care here to check that these steps didn't introduce issues into the participant data. Lots of hand-checking is useful.
Note that so many columns were misnamed or renamed that we output a set of columns and hand-map them to their correct equivalents. This is done by reading in all files, munging col names and then outputting the unique ones.
```{r}
participant_files <- dir("processed_data/participants_cleaned/",pattern = "*")
col_names <- map_df(participant_files, function(fname) {
pd <- read_multiformat_file(path = "processed_data/participants_cleaned/",
fname = fname)
return(data_frame(column = names(pd), file = fname))
}) %>%
mutate(column = clean_names(column)) %>%
group_by(column) %>%
mutate(file = ifelse(n() > 1, "many", file)) %>%
distinct()
write_csv(col_names, "metadata/participants_columns_used.csv")
```
Now we create the hand-mapped column key, read it in, and re-read the files using this.
```{r}
participants_columns <- read_csv("metadata/participants_columns_remapping.csv")
#validate that all columns used by participants are in the remapping file
# see_if(all(col_names$column %in% participants_columns$column))
pd_raw <- map_df(participant_files, clean_participant_file)
```
Visualize dataset. (Transposed so that we can see the variable labels).
```{r}
pd_raw %>%
vis_dat() +
coord_flip()
```
```{r}
n_participant_rows = nrow(pd_raw)
unique_participants_by_lab = pd_raw %>%
group_by(lab) %>%
summarize(participants = n_distinct(subid))
```
Right after trial import, there are `r n_participant_rows` rows in the pd_raw dataframe; this includes data from `r length(unique_participants_by_lab$lab)` labs. The number of participants coming from each lab is as follows:
```{r}
print(unique_participants_by_lab, n=1e4)
```
# Trial Import
Do the reading.
```{r}
trial_files <- dir("processed_data/trials_cleaned/", pattern = "*")
td_raw <- map_df(trial_files, read_trial_file)
```
Visualization of data.
```{r}
vis_dat(td_raw)
```
```{r}
n_trial_rows = nrow(td_raw)
```
Directly after read-in, there are `r n_trial_rows` rows in td_raw, representing data from `r length(unique(td_raw$lab))` labs.
# Programmatic Pre-Merge Adjustments
There are many merge problems due to inconsistencies in lab/subid across the two datasets that are reported. Therefore, we do a bunch of adjustments to name formatting - as much as possible is done here, and anything done by hand is documented via github issues (the files in `processed_data/participants_cleaned` and `trials_cleaned` are a result of that work.)
```{r}
td <- td_raw
pd <- pd_raw
# brcl is mb_0101 for trials and MB_0101 for participants
# leeds lcdu uses mb01 in trials, MB01 in participants
# POCD-Northwestern is lowercase in participants and uppercase in trials
td$lab <- tolower(td$lab)
pd$lab <- tolower(pd$lab)
td$subid <- tolower(td$subid)
pd$subid <- tolower(pd$subid)
# konstanz is MB01 in trials and mb_01 in participants
pd$subid[pd$lab == "babylab-konstanz"] <- str_replace(pd$subid[pd$lab == "babylab-konstanz"],
"_", "")
# POCD-Northwestern didn't zero-pad their subids between mb1 and mb9
pd$subid[pd$lab == "pocd-northwestern"] <-
str_replace(pd$subid[pd$lab == "pocd-northwestern"], "mb0","mb")
# nijmegen need "_9-12" appended to all subids 9-12mos in the trials file.
affected_nijmegen <- td$lab == "babylab_nijmegen" & !str_detect(td$subid,"_6-9")
td$subid[affected_nijmegen] <- str_c(td$subid[affected_nijmegen], "_9-12")
# essex need transformation of subid in trial data: 9_0* -> 9_12_0* and 12_0* -> 12_15_0*
essex_add12 <- pd$lab == "babylablang-essex" & str_detect(pd$subid, "mb9_")
pd$subid[essex_add12] <- gsub("mb9_", "mb9_12_", pd$subid[essex_add12])
essex_add15 <- pd$lab == "babylablang-essex" & str_detect(pd$subid, "mb12_")
pd$subid[essex_add15] <- gsub("mb12_", "mb12_15_", pd$subid[essex_add15])
# brcl-unlv has a problem in the trial data where one participant was labelled with consecutive numbers
td$subid[td$lab == "bcrl-unlv" & str_detect(td$subid, "mb_36")] <- "mb_3604"
# fixing a typo
td$subid[td$lab == "ethos-rennes"& td$subid=="mu850"] <- "mu805"
# louisville also has a numbering issue
td$subid[td$lab == "infantcoglab-louisville" & str_detect(td$subid, "234580-")] <- "234580-2"
# lscp-psl has "lscp-psl" attached in the trial data
pd$subid[pd$lab == "lscp-psl"] <- str_c(pd$subid[pd$lab == "lscp-psl"], "lscp-psl")
# lancaster is "lancslab" in participants
pd$lab[pd$lab == "lancslab"] <- "lancaster"
# koku-hamburg is "hamburg" in participants
pd$lab[pd$lab == "hamburg"] <- "koku-hamburg"
# commenting these lines out for now, as original data files have been replace, and this code no longer applies
# # "paris descartes_manybabies1" is "lpp_parisdescartes2" in participants
# td$lab[td$lab == "paris descartes_manybabies1"] <- "lpp_parisdescartes2"
#
# # paris descartes has suffix on subid names in trial, remove these
# td$subid[td$lab == "lpp_parisdescartes2"] <-
# str_extract(td$subid[td$lab == "lpp_parisdescartes2"], "[A-z]+")
# td$subid[td$lab == "lpp_parisdescartes2"] <-
# str_replace(td$subid[td$lab == "lpp_parisdescartes2"], "_","")
#
# # paris descartes has prefix on subid names in participants, remove these since not present in trials
# pd$subid[pd$lab == "lpp_parisdescartes2"] <-
# str_replace(pd$subid[pd$lab == "lpp_parisdescartes2"], "mb[0-9]+_","")
# cfn-uofn is cfn-uon in participants
pd$lab[pd$lab == "cfn-uon"] <- "cfn_uofn"
#babylab_kingswood is babylabwesternsydney
pd$lab[pd$lab == "babylab-westernsydney"] <- "babylab_kingswood"
# babylabkingswood has 'mb' prefix on subject names in pd only; add them to td
td$subid[td$lab == "babylab_kingswood"] <- str_c("mb", td$subid[td$lab == "babylab_kingswood"])
```
Dropping % from language exposure columns. This was originally part of variable_validation, but was causing an issue when reading the data file generated by this read_and_merge script (numeric values were being changed to NAs).
```{r}
pd$lang1_exposure <- lang_exp_to_numeric(pd$lang1_exposure)
pd$lang2_exposure <- lang_exp_to_numeric(pd$lang2_exposure)
pd$lang3_exposure <- lang_exp_to_numeric(pd$lang3_exposure)
pd$lang4_exposure <- lang_exp_to_numeric(pd$lang4_exposure)
```
Values in langX_exposure with decimals were being changed to NAs when reading the data into 02_variable_validation. Rounding here to avoid that issue.
```{r}
pd$lang1_exposure <- trunc(pd$lang1_exposure)
pd$lang2_exposure <- trunc(pd$lang2_exposure)
pd$lang3_exposure <- trunc(pd$lang3_exposure)
pd$lang4_exposure <- trunc(pd$lang4_exposure)
```
General sanitization of lab and subid variables:
```{r}
pd_len = nrow(pd)
td_len = nrow(td)
pd <- pd %>%
mutate(lab = str_replace_all(lab, '[^[:alnum:]]',''))%>%
mutate(subid = str_replace_all(subid, '[^[:alnum:]]',''))
td <- td %>%
mutate(lab = str_replace_all(lab, '[^[:alnum:]]',''))%>%
mutate(subid = str_replace_all(subid, '[^[:alnum:]]',''))
assert_that(pd_len == nrow(pd), td_len == nrow(td))
```
# Pre-Merge Checking
In this section, we first need to do some formatting/data cleaning to ensure that merging participant and trial level data works right; note that there should *not* be any dropping of participants/trials for 'real' reasons (ie fussout) in this section (that happens in `03_exlusion`.)
To document any inconsistencies that remain, here is a table of all participant IDs that
are missing from one or another of the data files. Each of these should either eventually be
removed from this list (via better data cleaning above), or documented after the merge (saved to metadata in some form).
```{r}
participants_premerge_td <- td %>%
group_by(lab, subid) %>%
summarize(trialcount = n_distinct(trial_num),
rowcount = length(trial_num), trial_error = last(trial_error_type),present_td=TRUE)
participants_premerge_pd <- pd %>%
group_by(lab, subid) %>%
summarize(lines_in_pd = n(), age_days = first(age_days), notes = first(notes), session_error_type = first(session_error_type), present_pd=TRUE)
all_participants_premerge <- merge(participants_premerge_td, participants_premerge_pd, all.x=TRUE, all.y=TRUE) %>%
replace_na(list(present_pd = FALSE, present_td=FALSE))
unmatched_participants_premerge = filter(all_participants_premerge, !(present_td & present_pd))
#Check if any of these are in the metadata that records resolved-unmatched records (pilot, no-trial real kids, subIDs not actually used...)
conf <- read_csv('metadata/true_unmatched_participants.csv') %>%
filter(Confirmed == 'X')%>%
select(subid, lab, Confirmed)
unmatched_participants_premerge <- merge(unmatched_participants_premerge, conf, all.x=TRUE) %>%
replace_na(list(Confirmed = FALSE))%>%
filter(Confirmed != 'X')
write_csv(unmatched_participants_premerge, 'metadata/unconfirmed_unmatched_participants.csv')
```
Here come some tables for summarizing types of errors that occur (helpful for knowing which labs will need checking/contacting by hand):
<!-- TODO: Add diagnostics: Is it **ever** okay for a participant to disappear during the merge? If so, what are the cases? Make sure the dropped participants fit those criteria. -->
```{r}
unique_participants_by_lab = pd %>%
group_by(lab) %>%
summarize(participants = n_distinct(subid))
unique_participants_by_lab_from_trials <- td %>%
group_by(lab) %>%
summarize(participants = n_distinct(subid))
lab_checker <- merge(unique_participants_by_lab,
unique_participants_by_lab_from_trials,
by="lab",
all.x=TRUE, all.y=TRUE,
suffixes = c(".pd",".td")) %>%
mutate(participant_concord = participants.pd == participants.td)
```
Looking at participants by lab, there are a few kinds of inconsistencies.
* These labs have more participants in trial than in participants - particularly confusing! The following issues remain.
```{r}
filter(lab_checker, participants.td > participants.pd)
```
* These labs have fewer participants in trial than in participants. This may be valid if the 'extras' are babies who came to lab but didn't finish any trials; however, this is something to be more extensively checked.
```{r}
filter(lab_checker, participants.td < participants.pd)
```
* Finally, these are labs are missing a data source altogether (or are bad lab-id values). As of 8/28, no labs are here, suggesting we have all the datafiles!
```{r}
filter(lab_checker, is.na(participants.td) | is.na(participants.pd))
```
These next two tests should be passed if the tables above looked OK.
```{r}
validate_that(length(unique(pd$lab)) == length(unique(td$lab)))
validate_that(all(lab_checker$participant_concord))
```
# Merge
Use `inner_join` to get matching participants. Our process is then to use testing to identify where we are missing participants in this join.
```{r}
d <- inner_join(td %>% select(-file), pd %>% select(-file))
```
# Test Merge
```{r}
participants_postmerge <- d %>%
group_by(lab, subid) %>%
summarize(trialcount = n_distinct(trial_num),
rowcount = length(trial_num))
```
Compare the number of rows and trials to pre-merge values! In the resulting dataframe, we have `r nrow(setdiff(select(participants_premerge_td, lab, subid, trialcount, rowcount), participants_postmerge))` participants missing from TD and `r nrow(setdiff(select(participants_premerge_pd, lab, subid), select(participants_postmerge, lab, subid)))` from PD. Where did they go, and why did they drop?
Here's a full list of participants (and labs) who get lost during merge. Use `anti_join` to detect lost participants. (For sanity, this should be the same as the pre-merged list of 'problem' subids)
```{r}
lost_participants_td <- anti_join(participants_premerge_td, participants_postmerge)
lost_participants_pd <- anti_join(participants_premerge_pd, participants_postmerge)
validate_that(nrow(lost_participants_td) + nrow(lost_participants_pd) == nrow(unmatched_participants_premerge))
datatable(lost_participants_td)
datatable(lost_participants_pd)
```
When the participants/trials merge is working perfectly, the following tests will be passed.
```{r}
validate_that(nrow(lost_participants_td) == 0)
validate_that(nrow(lost_participants_pd) == 0)
```
# Uniqify subid
Note that `subid` is not unqiue except within lab, this will compromise our regression models in the final paper [issue](https://github.com/manybabies/mb1-analysis-public/issues/8). We fix this here by creating a new variable.
```{r}
d$subid_unique <- str_c(d$lab, ":", d$subid)
```
# Output
Output intermediate files.
```{r}
write_csv(d, "processed_data/01_merged_ouput.csv")
```