diff --git a/DOCUMENTATION.md b/DOCUMENTATION.md index a3e1e6e..f0528ac 100644 --- a/DOCUMENTATION.md +++ b/DOCUMENTATION.md @@ -70,7 +70,7 @@ The convention for this, when updating docs.yml is the following: - : 'index.md' example: -- fa-file-pdf-o Download PDF /API/PDF/API_UG.pdf: 'index.md' +- fa-file-pdf-o Download PDF /output/dist/API/PDF/fafiletext_Download_PDF_/API/PDF/API_UG.pdf: 'index.md' ### Documentation Conventions diff --git a/css/styles.css b/css/styles.css index c3156e9..3b18856 100755 --- a/css/styles.css +++ b/css/styles.css @@ -117,6 +117,7 @@ h5 { } h1.hero-heading { font-size: 40px; + color: #fff; } .hero-search-container { display: flex; @@ -223,7 +224,7 @@ h1.hero-heading { } .my-container { - width: 80%; + width: 92%; display: block; } @@ -453,9 +454,6 @@ footer a:visited { background-color: rgb(206, 202, 202); } -.navbar-nav li a:hover { - color: white; -} #navbarDropdownMenuLink:hover { color: white; diff --git a/docs/API/PDF/index.md b/docs/API/PDF/index.md index 5865d0b..0138107 100644 --- a/docs/API/PDF/index.md +++ b/docs/API/PDF/index.md @@ -1 +1 @@ -A PDF version of this document is available [here](/API/PDF/API_UG.pdf). +A PDF version of this document is available [here](/output/dist/API/PDF/fafiletext_Download_PDF_/API/PDF/API_UG.pdf). diff --git a/docs/Data/PDF/index.md b/docs/Data/PDF/index.md index 123aca4..34f728c 100644 --- a/docs/Data/PDF/index.md +++ b/docs/Data/PDF/index.md @@ -1 +1 @@ -A PDF version of this document is available [here](/Data/PDF/Data_UG.pdf). +A PDF version of this document is available [here](/output/dist/Data_Submission/PDF/fafiletext_Download_PDF_/Data_Submission_Portal/PDF/Data_Submission_Portal_UG.pdf). diff --git a/docs/Data_Portal/PDF/index.md b/docs/Data_Portal/PDF/index.md index b6489f7..ceb16b3 100644 --- a/docs/Data_Portal/PDF/index.md +++ b/docs/Data_Portal/PDF/index.md @@ -1 +1 @@ -A PDF version of this document is available [here](/Data_Portal/PDF/Data_Portal_UG.pdf). +A PDF version of this document is available [here](/output/dist/Data_Portal/PDF/fafiletext_Download_PDF_/Data_Portal/PDF/Data_Portal_UG.pdf). diff --git a/docs/Data_Submission_Portal/PDF/index.md b/docs/Data_Submission_Portal/PDF/index.md index 5ecab0e..1a5b034 100644 --- a/docs/Data_Submission_Portal/PDF/index.md +++ b/docs/Data_Submission_Portal/PDF/index.md @@ -1,4 +1,4 @@ -A PDF version of this document is available [here](/Data_Submission_Portal/PDF/Data_Submission_Portal_UG.pdf). +A PDF version of this document is available [here](/output/dist/Data_Submission/PDF/fafiletext_Download_PDF_/Data_Submission_Portal/PDF/Data_Submission_Portal_UG.pdf). diff --git a/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_DNASeq_Analysis/index.html b/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_DNASeq_Analysis/index.html index 2c49e26..ce5790f 100644 --- a/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_DNASeq_Analysis/index.html +++ b/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_DNASeq_Analysis/index.html @@ -652,7 +652,7 @@

Reference Genome

-

DNA-Seq Alignment Pipeline

+

DNA-Seq Alignment Pipeline

DNA-Seq Alignment Command Line Parameters

Note that version numbers may vary in files downloaded from the GDC Portal due to ongoing pipeline development and improvement.

Step 1: Converting BAMs to FASTQs with Biobambam - biobambam2

diff --git a/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_mRNA_Analysis/index.html b/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_mRNA_Analysis/index.html index 1004125..b0c11f8 100644 --- a/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_mRNA_Analysis/index.html +++ b/output/dist/Data/Bioinformatics_Pipelines/Bioinformatics_Pipeline:_mRNA_Analysis/index.html @@ -27,623 +27,1048 @@ /> - + - + - + - - - - + +
-

mRNA Analysis Pipeline


-

Introduction

-

The GDC mRNA quantification analysis pipeline measures gene level expression with STAR as raw read counts. Subsequently the counts are augmented with several transformations including Fragments per Kilobase of transcript per Million mapped reads (FPKM), upper quartile normalized FPKM (FPKM-UQ), and Transcripts per Million (TPM). These values are additionally annotated with the gene symbol and gene bio-type. These data are generated through this pipeline by first aligning reads to the GRCh38 reference genome and then by quantifying the mapped reads. To facilitate harmonization across samples, all RNA-Seq reads are treated as unstranded during analyses.

-

Data Processing Steps

-

RNA-Seq Alignment Workflow

-

The mRNA Analysis pipeline begins with the Alignment Workflow, which is performed using a two-pass method with STAR. STAR aligns each read group separately and then merges the resulting alignments into one. Following the methods used by the International Cancer Genome Consortium ICGC (github), the two-pass method includes a splice junction detection step, which is used to generate the final alignment. This workflow outputs a genomic BAM file, which contains both aligned and unaligned reads. Quality assessment is performed pre-alignment with FASTQC and post-alignment with Picard Tools.

-

Files that were processed after Data Release 14 have associated transcriptomic and chimeric alignments in addition to the genomic alignment detailed above. This only applies to aliquots with at least one set of paired-end reads. The chimeric BAM file contains reads that were mapped to different chromosomes or strands (fusion alignments). The genomic alignment files contain chimeric and unaligned reads to facilitate the retrieval of all original reads. The transcriptomic alignment reports aligned reads with transcript coordinates rather than genomic coordinates. The transcriptomic alignment is also sorted differently to facilitate downstream analyses. BAM index file pairing is not supported by this method of sorting, which does not allow for BAM slicing on these alignments. The splice-junction file for these alignments are also available.

-

Files that were processed after Data Release 25 will have associated gene fusion files.

-

As of Data Release 32 the reference annotation will be updated to GENCODE v36 and HT-Seq will no longer be used.

-

RNA Alignment Pipeline

- - - - - - - - - - - - - - - - - - - - -
I/OEntityFormat
InputSubmitted Unaligned Reads or Submitted Aligned ReadsFASTQ or BAM
OutputAligned ReadsBAM
-

RNA-Seq Alignment Command Line Parameters

-

Note that version numbers may vary in files downloaded from the GDC Data Portal due to ongoing pipeline development and improvement.

-
# STAR-2
+        
+

mRNA Analysis Pipeline

+
+

Introduction

+

+ The GDC mRNA quantification analysis pipeline measures gene level + expression with + STAR + as raw read counts. Subsequently the counts are augmented with + several transformations including Fragments per Kilobase of + transcript per Million mapped reads (FPKM), upper quartile + normalized FPKM (FPKM-UQ), and Transcripts per Million (TPM). These + values are additionally annotated with the gene symbol and gene + bio-type. These data are generated through this pipeline by first + aligning reads to the GRCh38 + reference genome + and then by quantifying the mapped reads. To facilitate + harmonization across samples, all RNA-Seq reads are treated as + unstranded during analyses. +

+

Data Processing Steps

+

RNA-Seq Alignment Workflow

+

+ The mRNA Analysis pipeline begins with the + Alignment Workflow, which is performed using a two-pass method with + STAR. STAR aligns each + read group + separately and then merges the resulting alignments into one. + Following the methods used by the International Cancer Genome + Consortium ICGC (github), the two-pass method includes a splice junction detection step, + which is used to generate the final alignment. This workflow outputs + a genomic BAM file, which contains both aligned and unaligned reads. + Quality assessment is performed pre-alignment with + FASTQC + and post-alignment with + Picard Tools. +

+

+ Files that were processed after Data Release 14 have associated + transcriptomic and chimeric alignments in addition to the genomic + alignment detailed above. This only applies to aliquots with at + least one set of paired-end reads. The chimeric BAM file contains + reads that were mapped to different chromosomes or strands (fusion + alignments). The genomic alignment files contain chimeric and + unaligned reads to facilitate the retrieval of all original reads. + The transcriptomic alignment reports aligned reads with transcript + coordinates rather than genomic coordinates. The transcriptomic + alignment is also sorted differently to facilitate downstream + analyses. BAM index file pairing is not supported by this method of + sorting, which does not allow for BAM slicing on these alignments. + The splice-junction file for these alignments are also available. +

+

+ Files that were processed after Data Release 25 will have associated + gene fusion files. +

+

+ As of Data Release 32 the reference annotation will be updated to + GENCODE v36 and HT-Seq will no longer be used. +

+

+ RNA Alignment Pipeline +

+ + + + + + + + + + + + + + + + + + + + +
I/OEntityFormat
Input + Submitted Unaligned Reads + or + Submitted Aligned Reads + FASTQ or BAM
Output + Aligned Reads + BAM
+

RNA-Seq Alignment Command Line Parameters

+

+ Note that version numbers may vary in files downloaded from the + GDC Data Portal due to ongoing pipeline development and + improvement. +

+
# STAR-2
 
 ### For users with access to the ICGC pipeline:
 
@@ -740,7 +1165,7 @@ 

RNA-Seq Alignment Command Line Parameters

--outSAMheaderHD @HD VN:1.4 --outSAMattrRGline <formatted RG line provided by wrapper>
-
# STAR-2
+          
# STAR-2
 
 STAR \
 --readFilesIn <fastq_files> \
@@ -775,7 +1200,7 @@ 

RNA-Seq Alignment Command Line Parameters

--runThreadN <threads> \ --twopassMode Basic
-
# STAR Genome Index
+          
# STAR Genome Index
 STAR
 --runMode genomeGenerate
 --genomeDir <star_index_path>
@@ -820,47 +1245,92 @@ 

RNA-Seq Alignment Command Line Parameters

--chimMainSegmentMultNmax 1 \ --outSAMattributes NH HI AS nM NM ch
-

*These indices are available for download at the GDC Website and do not need to be built again.

-

mRNA Expression Workflow

-

The primary counting data is generated by STAR and includes a gene ID, unstranded, and stranded counts data. Following alignment, the raw counts files produced by STAR are augmented with commonly used counts transformations (FPKM, FPKM-UQ, and TPM) along with basic annotations as part of the RNA Expression Workflow. These data are provided in a tab-delimited format. GENCODE v36 was used for gene annotation.

-

Note that the STAR counting results will not count reads that are mapped to more than one different gene. Below are two files that list genes that are completely encompassed by other genes and will likely display a value of zero.

- - - - - - - - - - - - - - - - - - - - - -
I/OEntityFormat
InputAligned ReadsBAM
OutputGene ExpressionTXT
-

mRNA Quantification Command Line Parameters

-

HTSeq

-
Counts are produced by STAR concurrent with alignment.
+          

+ *These indices are available for download at the + GDC Website + and do not need to be built again. +

+

mRNA Expression Workflow

+

+ The primary counting data is generated by STAR and includes a gene + ID, unstranded, and stranded counts data. Following alignment, the + raw counts files produced by STAR are augmented with commonly used + counts transformations (FPKM, FPKM-UQ, and TPM) along with basic + annotations as part of the + RNA Expression Workflow. These data are provided in a tab-delimited format. + GENCODE v36 + was used for gene annotation. +

+

+ Note that the STAR counting results will not count reads that are + mapped to more than one different gene. Below are two files that + list genes that are completely encompassed by other genes and will + likely display a value of zero. +

+ + + + + + + + + + + + + + + + + + + + + +
I/OEntityFormat
Input + Aligned Reads + BAM
Output + Gene Expression + TXT
+

mRNA Quantification Command Line Parameters

+

HTSeq

+
Counts are produced by STAR concurrent with alignment.
 
-
htseq-count \
+          
htseq-count \
 -m intersection-nonempty \
 -i gene_id \
 -r pos \
 -s no \
 - gencode.v22.annotation.gtf
 
-
htseq-count \
+          
htseq-count \
 -f bam \
 -r name \
 -s no \
@@ -871,80 +1341,197 @@ 

mRNA Quantification Command Line Parameters

<input_bam> \ <gtf_file> > <counts_file>
-

mRNA Expression Transformation

-

RNA-Seq expression level read counts produced by the workflow are normalized using three commonly used methods: FPKM, FPKM-UQ, and TPM. Normalized values should be used only within the context of the entire gene set. Users are encouraged to normalize raw read count values if a subset of genes is investigated.

-

FPKM

-

The fragments per kilobase of transcript per million mapped reads (FPKM) calculation aims to control for transcript length and overall sequencing quantity.

-

Upper Quartile FPKM

-

The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in which the protein coding gene in the 75th percentile position is substituted for the sequencing quantity. This is thought to provide a more stable value than including the noisier genes at the extremes.

-

TPM

-

The transcripts per million calculation is similar to FPKM, but the difference is that all transcripts are normalized for length first. Then, instead of using the total overall read count as a normalization for size, the sum of the length-normalized transcript values are used as an indicator of size.

-

Calculations

-

FPKM Calculations

-

Note: The read count is multiplied by a scalar (109) during normalization to account for the kilobase and 'million mapped reads' units.

-

Examples

-

Sample 1: Gene A

-
    -
  • Gene length: 3,000 bp
  • -
  • 1,000 reads mapped to Gene A
  • -
  • 1,000,000 reads mapped to all protein-coding regions
  • -
  • Read count in Sample 1 for 75th percentile gene: 2,000
  • -
  • Number of protein coding genes on autosomes: 19,029
  • -
  • Sum of length-normalized transcript counts: 9,000,000
  • -
-

FPKM for Gene A = 1,000 * 10^9 / (3,000 * 50,000,000) = 6.67

-

FPKM-UQ for Gene A = 1,000) * 10^9 / (3,000 * 2,000 * 19,029) = 8.76

-

TPM for Gene A = (1,000 _ 1,000 / 3,000) _ 1,000,000 / (9,000,000) = 37.04

-

Fusion Pipelines

-

The GDC uses two pipelines for the detection of gene fusions.

-

STAR-Fusion Pipeline

-

The GDC gene fusion pipeline uses the STAR-Fusion v1.6 algorithm to generate gene fusion data. -STAR-Fusion pipeline processes the output generated by STAR aligner to map junction reads -and spanning reads to a junction annotation set. It utilizes a chimeric junction file from -running the STAR aligner and produces a tab-limited gene fusion prediction file. -The prediction file provides fused gene names, junction read count and breakpoint information.

-

Arriba Fusion Pipeline

-

The Arriba gene fusion pipeline uses Arriba v1.1.0 to detect gene fusions from the RNA-Seq data of tumor samples.

-

scRNA-Seq Pipeline (single-nuclei)

-

The GDC processes single-cell RNA-Seq (scRNA-Seq) data using the Cell Ranger pipeline to calculate gene expression followed by Seurat for secondary expression analysis.

-

scRNA Gene Expression Pipeline

-

The gene expression pipeline, which uses Cell Ranger, generates three files:

-
    -
  • Aligned reads file (BAM)
  • -
  • Raw counts matrix - contains all barcodes in Market Exchange Format (MEX)
  • -
  • Filtered counts matrix - contains only detected cellular barcodes (MEX)
  • -
-

scRNA Analysis Pipeline

-

The analysis pipeline, which uses the Seurat software, generates three files from an input of Filtered counts matrix:

-
    -
  • Analysis - PCA, UMAP, tSNE values, and graph-based clustering results with associated metadata (TSV).
  • -
  • Differential gene expression - DEG information comparing cells from one cluster to the rest of the cells (TSV).
  • -
  • Full Seurat analysis log as a loom object in HDF5 format.
  • -
-

When the input RNA was extracted from nuclei instead of cytoplasm, a slightly modified quantification method is implemented to include introns. Currently, these single-nuclei RNA-Seq (snRNA-Seq) analyses share the same experimental strategy (scRNA-Seq) in the Data Portal, and can be filtered by querying for aliquot.analyte_type = "Nuclei RNA".

-

File Access and Availability

-

To facilitate the use of harmonized data in user-created pipelines, RNA-Seq gene expression is accessible in the GDC Data Portal at several intermediate steps in the pipeline. Below is a description of each type of file available for download in the GDC Data Portal.

- - - - - - - - - - - - - - - - - - - - -
TypeDescriptionFormat
RNA-Seq AlignmentRNA-Seq reads that have been aligned to the GRCh38 build. Reads that were not aligned are included to facilitate the availability of raw read sets.BAM
STAR Read CountsThe number of reads aligned to each gene, calculated by STAR, along with values using common normalization methods.TSV
+

+ mRNA Expression Transformation +

+

+ RNA-Seq expression level read counts produced by the workflow are + normalized using three commonly used methods: FPKM, FPKM-UQ, and + TPM. Normalized values should be used only within the context of the + entire gene set. Users are encouraged to normalize raw read count + values if a subset of genes is investigated. +

+

FPKM

+

+ The fragments per kilobase of transcript per million mapped reads + (FPKM) calculation aims to control for transcript length and overall + sequencing quantity. +

+

Upper Quartile FPKM

+

+ The upper quartile FPKM (FPKM-UQ) is a modified FPKM calculation in + which the protein coding gene in the 75th percentile position is + substituted for the sequencing quantity. This is thought to provide + a more stable value than including the noisier genes at the + extremes. +

+

TPM

+

+ The transcripts per million calculation is similar to FPKM, but the + difference is that all transcripts are normalized for length first. + Then, instead of using the total overall read count as a + normalization for size, the sum of the length-normalized transcript + values are used as an indicator of size. +

+

Calculations

+

+ FPKM Calculations +

+

+ Note: The read count is multiplied by a scalar + (109) during normalization to account for the kilobase + and 'million mapped reads' units. +

+

Examples

+

Sample 1: Gene A

+
    +
  • Gene length: 3,000 bp
  • +
  • 1,000 reads mapped to Gene A
  • +
  • 1,000,000 reads mapped to all protein-coding regions
  • +
  • Read count in Sample 1 for 75th percentile gene: 2,000
  • +
  • Number of protein coding genes on autosomes: 19,029
  • +
  • Sum of length-normalized transcript counts: 9,000,000
  • +
+

+ FPKM for Gene A = 1,000 * 10^9 / (3,000 * + 50,000,000) = 6.67 +

+

+ FPKM-UQ for Gene A = 1,000) * 10^9 / (3,000 * 2,000 + * 19,029) = 8.76 +

+

+ TPM for Gene A = (1,000 _ 1,000 / 3,000) _ + 1,000,000 / (9,000,000) = 37.04 +

+

Fusion Pipelines

+

The GDC uses two pipelines for the detection of gene fusions.

+

STAR-Fusion Pipeline

+

+ The GDC gene fusion pipeline uses the STAR-Fusion v1.6 algorithm to + generate gene fusion data. STAR-Fusion pipeline processes the output + generated by STAR aligner to map junction reads and spanning reads + to a junction annotation set. It utilizes a chimeric junction file + from running the STAR aligner and produces a tab-limited gene fusion + prediction file. The prediction file provides fused gene names, + junction read count and breakpoint information. +

+

Arriba Fusion Pipeline

+

+ The + Arriba gene fusion pipeline + uses Arriba v1.1.0 to detect gene fusions from the RNA-Seq data of + tumor samples. +

+

+ scRNA-Seq Pipeline (single-nuclei) +

+

+ The GDC processes single-cell RNA-Seq (scRNA-Seq) data using the + Cell Ranger pipeline + to calculate gene expression followed by + Seurat for secondary + expression analysis. +

+

scRNA Gene Expression Pipeline

+

+ The gene expression pipeline, which uses Cell Ranger, generates + three files: +

+
    +
  • Aligned reads file (BAM)
  • +
  • + Raw counts matrix - contains all barcodes in + Market Exchange Format + (MEX) +
  • +
  • + Filtered counts matrix - contains only detected cellular barcodes + (MEX) +
  • +
+

scRNA Analysis Pipeline

+

+ The analysis pipeline, which uses the Seurat software, generates + three files from an input of Filtered counts matrix: +

+
    +
  • + Analysis - PCA, UMAP, tSNE values, and graph-based clustering + results with associated metadata (TSV). +
  • +
  • + Differential gene expression - DEG information comparing cells + from one cluster to the rest of the cells (TSV). +
  • +
  • + Full Seurat analysis log as a loom object in + HDF5 + format. +
  • +
+

+ When the input RNA was extracted from nuclei instead of cytoplasm, a + slightly modified quantification method is implemented to include + introns. Currently, these single-nuclei RNA-Seq (snRNA-Seq) analyses + share the same experimental strategy (scRNA-Seq) in the Data Portal, + and can be filtered by querying for aliquot.analyte_type = "Nuclei + RNA". +

+

+ File Access and Availability +

+

+ To facilitate the use of harmonized data in user-created pipelines, + RNA-Seq gene expression is accessible in the GDC Data Portal at + several intermediate steps in the pipeline. Below is a description + of each type of file available for download in the GDC Data Portal. +

+ + + + + + + + + + + + + + + + + + + + +
TypeDescriptionFormat
RNA-Seq Alignment + RNA-Seq reads that have been aligned to the GRCh38 build. + Reads that were not aligned are included to facilitate the + availability of raw read sets. + BAM
STAR Read Counts + The number of reads aligned to each gene, calculated by STAR, + along with values using common normalization methods. + TSV
+
@@ -1017,4 +1604,4 @@

File Access and Availability

crossorigin="anonymous" > - \ No newline at end of file + diff --git a/output/dist/Data_Portal/PDF/fafiletext_Download_PDF_/Data_Portal/PDF/Data_Portal_UG.pdf/index.html b/output/dist/Data_Portal/PDF/fafiletext_Download_PDF_/Data_Portal/PDF/Data_Portal_UG.pdf/index.html deleted file mode 100644 index e66d8cd..0000000 --- a/output/dist/Data_Portal/PDF/fafiletext_Download_PDF_/Data_Portal/PDF/Data_Portal_UG.pdf/index.html +++ /dev/null @@ -1,686 +0,0 @@ - - - - - - - fa-file-text Download PDF /Data_Portal/PDF/Data_Portal_UG.pdf - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/output/dist/index.html b/output/dist/index.html index 5213f9b..dc1bd9a 100644 --- a/output/dist/index.html +++ b/output/dist/index.html @@ -27,578 +27,807 @@ /> - + - + - + - - - - + +
@@ -775,15 +1004,13 @@

For Developers

- - + - + - + - - \ No newline at end of file +