Simply fine-tuning ETL

Instead of manipulating data during preprocessing to identify cases and controls, I think it would be a lot simpler to optionally supply a list of patient IDs and labels to retrieve.

In a Jupyter notebook (say), we could build a suitable cohort of cases and controls, and then select for only these IDs during chunk iteration. I think this could speed up ETL significantly.