- Quickstart Guide
- Download Data from Manifest File Using the GDC Client
- Run Processing Pipeline
- Sample Subtype Classification using Gene Expression Data
- Sample Subtype Classification using DNA Methylation Data
Install requirements - detailed instructions are found on the Requirements page:
- Install Python 3+
- Install GDC Data Transfer Tool Client
Ensure that steps are completed on the Requirements page - (includes creating working environment, signining in, and manually downloading required data)
Download Gene Expression Data
bash scripts/gdc_download.sh PAAD
This will create subfolders in dat
a-raw/GEXP` and place GDC molecular matrices here.
Options for cancer cohort includes
ALL
,BLCA
,BRCA
,COADREAD
,ESO
,HNSC
,KID
,LGGGBM
,LIHCCHOL
,LUNG
,OV
,PAAD
,SARC
,SKCM
,UCEC
For more details on each cancer cohort option see Cohort Options Page
Example shown for running PAAD cohort
bash scripts/process.sh PAAD data/prep
Creates file
data/prep/<CANCER>_GEXP/<CANCER>_GEXP_prep2_<TYPE>.tsv
that is prepped for distance calculations
Options for cancer cohort includes
ALL
,BLCA
,BRCA
,COADREAD
,ESO
,HNSC
,KID
,LGGGBM
,LIHCCHOL
,LUNG
,OV
,PAAD
,SARC
,SKCM
,UCEC
For more details on each cancer cohort option see Cohort Options Page
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).
The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).
Run gene expression classifier pipeline:
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_GEXP.sh \
PAAD \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Tumor.tsv \
data/prep/PAAD_GEXP/PAAD_GEXP_prep2_Model.tsv \
data/classifier_gexp/ml_ready_qrank
Results can found in data/classifier_gexp/ml_predictions_qrank/combo/HCMI_TMPsubtype_qRank_<CANCER>.tsv
Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary
The goal of this analysis is to get cancer subtype predictions for HCMI samples (organoids, cell cultures , xenografts, etc). To accomplish this we will use the top performing pre-trained machine learning models (dockerized TMP models that were trained using TCGA data that has been pre-proccessed). Specifically we are interested in using gene expression from the HCMI samples and eventually compare primary tumors to their corresponding models (organoids, cell cultures , xenografts, etc).
The TMP models (pre-trained models) are specific to TCGA cancer cohorts (TCGA abbreviations), therefore we will split HCMI data into TCGA cancer cohorts(based on sample metadata).
Run DNA methylation classifier pipeline:
# where specify cancer, tumor-file, model-file, transformed-dir
bash scripts/run_classify_METHYL.sh \
SKCM \
data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_SKCM.tsv
Results can found in data/classifier_methyl/ml_predictions/combo/HCMI_METH_TMPsubtypes.<CANCER>.tsv
Note: LUNG (includes LUAD and LUSC), ESO (includes GEA and ESCC) during transformation and classification, then is merged in post-classification summary
Second Example for Combination Cohort
bash scripts/run_classify_METHYL.sh \ LUNG \ data/classifier_methyl/processed/20231211_HCMI_TMP_subtype_prediction_feature_matrix_LUNG.tsv