Oncogenic Results of Analyzing the Genome

ORANGE summarizes the key outputs from all algorithms in the Hartwig suite into a single PDF and JSON file ( see orange-datamodel).

  1. The algo depends exclusively on config and data produced by the Hartwig genomic pipeline and hence can always be run as final step without any additional local data or config required.
  2. ORANGE respects the mode in which the pipeline has been run (tumor-only vs tumor-reference, targeted (panel) vs whole genome). In case RNA data is provided, the algo combines the RNA and DNA data to present an integrated DNA/RNA analysis of a tumor sample.
  3. ORANGE can be configured to convert all germline driver variants to somatic driver variants, thereby obfuscating the germline driver part of the analysis without actually loosing this data.
  4. Every event that is labeled as a driver by any of the Hartwig algorithms is displayed in the PDF along with the driver likelihood.
  5. An additional exhaustive WGS and WTS scan is performed for anything interesting that may be potentially relevant but not picked up as a driver. Details of what is considered interesting are described in below.
  6. A comprehensive range of QC measures and plots is displayed which provides in-depth details about the data quality of the samples provided.

Example reports based on the publicly available melanoma cell line COLO829 can be found here:

Type File Note
WGTS Tumor-reference (please inquire at Hartwig) We have no RNA data for COLO829
WGS Tumor-reference COLO829_WGS_TumorReference.pdf
Targeted Tumor-only COLO829_Targeted_TumorOnly.pdf (Derived from the tumor-reference data)

Note that neither this readme nor the report itself contains any documentation about the Hartwig algorithms and output. For questions in this area please refer to the specific algorithm documentation present on

The front page of the ORANGE report lists all high-level stats about the sample along with genome-wide visualisations of all mutations and SNV/Indel clonality. In addition to this front page, the following chapters are present in the ORANGE report:

  • Somatic Findings: What potentially relevant mutations have been found in the tumor specifically?
  • Germline Findings: What potentially relevant mutations have been found in the germline data?
  • Immunology: What can we tell about the immunogenicity of the tumor sample?
  • RNA Findings: What potentially relevant findings have we detected in RNA?
  • Cohort Comparison: How do the various properties of this tumor compare to existing cancer cohorts?
  • Quality Control: Various stats and graphs regarding the quality of the data and interpretation thereof.

Note that the JSON file contains every mutation found in the analysis and hence is much more extensive than the PDF. The JSON file is meant to be used by downstream applications who wish to further interpret the results of the molecular analysis.

Running ORANGE

ORANGE requires the output of various Hartwig algorithms, along with some resource files (doid json, cohort distributions, driver genes, known fusions and ensembl data cache). The resource files required to run ORANGE can be found here for either 37 or 38 reference genome version.

Base (targeted/panel) tumor-only DNA mode

java -jar orange.jar \
    -experiment_type "PANEL"
    -tumor_sample_id tumor_sample \
    -primary_tumor_doids "doid1;doid2" \
    -ref_genome_version "37" \
    -output_dir /path/to/where/to/write/output \
    -doid_json /path/to/input_doid_tree.json \
    -cohort_mapping_tsv /path/to/input_cohort_mapping.tsv \
    -cohort_percentiles_tsv /path/to/input_cohort_percentiles.tsv \
    -driver_gene_panel /path/to/driver_gene_panel.tsv \
    -known_fusion_file /path/to/known_fusion_file.tsv \
    -ensembl_data_dir /path/to/ensembl_data_directory \
    -tumor_sample_wgs_metrics_file /path/to/tumor_sample_wgs_metrics \
    -tumor_sample_flagstat_file /path/to/tumor_sample_flagstats \
    -sage_dir /path/to/sage_somatic_output \
    -purple_dir /path/to/purple_output \
    -purple_plot_dir /path/to/purple_plots \
    -linx_dir /path/to/linx_somatic_output \
    -linx_plot_dir /path/to/optional_linx_somatic_output_plots \
    -lilac_dir /path/to/lilac_output 

Note that linx_plot_dir is an optional parameter and can be left out completely in case linx has not generated any plots.

Note that primary_tumor_doids can be left blank (""). This parameter is used to look up cancer-type-specific percentiles for various tumor characteristics. If primary tumor doids are not provided, percentiles are calculated against the full HMF database only.

Additional parameters when whole genome tumor DNA data is available

   -virus_dir /path/to/virus_interpreter_output \
   -chord_dir /path/to/chord_output \
   -cuppa_dir /path/to/cuppa_output \
   -sigs_dir /path/to/sigs_output 

Also, the value of the -experiment_type parameter should be set to WGS for all whole genome configurations.

Additional parameters when whole genome germline DNA data is available

    -reference_sample_id reference_sample \
    -ref_sample_wgs_metrics_file /path/to/reference_sample_wgs_metrics \
    -ref_sample_flagstat_file /path/to/reference_sample_flagstats \
    -sage_germline_dir /path/to/sage_germline_output \
    -linx_germline_dir /path/to/linx_germline_output \
    -peach_dir /path/to/peach_output.tsv 

Additional parameters when whole genome RNA data is available

    -rna_sample_id rna_sample \
    -isofox_gene_distribution /path/to/isofox_gene_distribution.csv \
    -isofox_alt_sj_cohort /path/to/isofox_alt_sj_cohort.csv \
    -isofox_dir /path/to/isofox_output 

Additional optional parameters across all modes

Argument Description
pipeline_version_file Path to the file containing the (platinum) pipeline version used.
sampling_date Sets the sampling date to the specified date if set. Expected format is YYMMDD. If omitted, current date is used as sampling date.
convert_germline_to_somatic If set, converts all germline driver variants to somatic driver variants, thereby obfuscating the germline driver part of the analysis without actually loosing this data. Note that the data in other germline tables, except the pharmacogenetics table, is removed from this page.
add_disclaimer If set, adds a "research use only" disclaimer to the footer of every page.
limit_json_output If set, limits all lists in the JSON output to a single entry to facilitate manual inspection of the JSON output.
log_debug If set, additional DEBUG logging is generated.
log_level If set, overrides the default log level (INFO). Values can be ERROR, WARN, INFO, DEBUG and TRACE

Additional run modes

Instead of individual algo directories, it is possible to configure a single directory in the following modes:

Parameter Description
pipeline_sample_root_dir If this path is set, all individual algo paths are derived from this path, assuming the pipeline has been run using HMF pipeline
sample_data_dir If this path is set, all data is expected to exist in the root of this path

Somatic Findings

In addition to all somatic drivers (SNVs/Indels, copy numbers, structural variants and fusions) the following is considered potentially interesting and added to the report:

  • Other potentially relevant variants:
    1. Variants that are hotspots or near hotspots but not part of the reporting gene panel.
    2. Exonic variants that are not reported but are phased with variants that are reported.
    3. Variants that are considered relevant for tumor type classification according to Cuppa.
    4. Variants with synonymous impact on the canonical transcript of a reporting gene but with a reportable worst impact
    5. Variants in splice regions that are not reported in genes with splice variant reporting enabled.
  • Other regions with amps, or with deletions in other autosomal regions:
    1. Gains in genes for which we report amplifications with a relative minimum copy number between 2.5 and 3 times ploidy.
    2. Any chromosomal band location with at least one gene lost or fully amplified is considered potentially interesting. A maximum of 10 additional gains (sorted by minimum copy number) and 10 additional losses are reported as potentially interesting:
      • For a band with more than one gene amplified, the gene with the highest minimum copy number is picked.
      • For a band with a loss that has no losses reported in this band already, an arbitrary gene is picked.
  • Potentially interesting chromosomal rearrangements:
    1. 1q trisomy: In case 98% of 1q has copy number > 2.8 AND 90% of 1q has copy number < 3.5
    2. 1p19q co-deletion: In case 98% of 1p and 98% of 19q have MACN < 0.2
  • Other potentially relevant fusions. A maximum of 10 additional fusions (picked arbitrarily) are reported as potentially interesting:
    1. Any fusion that is not reported and has a reported type other than NONE.
    2. Any fusion in a gene that is configured as an oncogene in the driver gene panel.
  • Other potentially interesting in-frame fusions in case no high drivers events are detected
    1. In case no high driver events are detected, any in-frame non chain terminated fusion that is not already reported
  • Other viral presence:
    1. Any viral presence that is not otherwise reported.
  • Potentially interesting gene disruptions:
    1. Any unreported but disruptive gene disruption that is disrupting an exon which lies within a promiscuous exon range based on the fusion knowledgebase.
  • Potentially interesting LOH events:
    1. In case MSI is detected, LOH (if present) is shown for the following genes: MLH1, MSH2, MSH6, PMS2, EPCAM
    2. In case HRD (based on CHORD) is detected, LOH (if present) is shown for the following genes: BRCA1, BRCA2, RAD51C, PALB2

In case ORANGE was run in DNA+RNA mode, DNA findings will be annotated with RNA:

  • Drivers and potentially interesting variants are annotated with RNA depth
  • Drivers and potentially interesting amps/dels are annotated with TPM, and corresponding percentile and fold change for database and applicable tumor type
  • Drivers and potentially interesting fusions are annotated depending on fusion type:
    1. EXON_DEL_DUP and other intra-gene fusions are annotated with exon-skipping novel splice junctions
    2. @IG fusions are annotated with TPM of the 3' fusion gene
    3. Other fusions are annotated with RNA fusion details (detected fusions in RNA, and corresponding fragment support and depth of 5' and 3' junction)

Germline Findings

In addition to all germline SNV/Indel tumor drivers determined by PURPLE, the following is added to the report:

  • Other potentially relevant variants
    1. Any hotspots that are not configured to be reported.
    2. Any hotspots that are filtered based on quality.
  • Potentially pathogenic germline deletions
  • Potentially pathogenic germline LOH events
  • Potentially pathogenic germline homozygous disruptions
  • Potentially pathogenic germline gene disruptions
  • Missed variant likelihood (MVLH) per gene, presenting the likelihood of missing a pathogenic variant in case there would have been one present.
  • (Large-scale) germline CN aberrations.
    • Germline CN aberrations are determined by PURPLE and include aberrations such as klinefelter or trisomy X.
  • Pharmacogenetics (DPYD & UGT1A1 status)


The immunology chapter reports on various immunology properties of the tumor sample.

The chapter presents the following:

  • HLA-A/B/C details
    1. QC Status
    2. Detected alleles, annotated with #total fragments and somatic annotation (tumor copy number, #mutations)

In case ORANGE was run in DNA+RNA mode, the alleles will be annotated by RNA fragment support.

  • Genetic immune escape analysis (inspired by this paper). ORANGE attempts to detect the following mechanisms:
    • HLA-1 loss-of-function, detected in case one of the following mutations is present in either HLA-A, HLA-B or HLA-C:
      • MACN < 0.3 without the presence of a loss (proxy for LOH)
      • A clonal variant with canonical coding effect NONSENSE_OR_FRAMESHIFT or SPLICE
      • A clonal, biallelic variant with canonical coding effect MISSENSE
      • A full or partial loss
      • A homozygous disruption
    • Antigen presentation pathway inactivation, detected in case one of the following mutations is present in either B2M, CALR, TAP1, TAP2, TABBP, NLRC5, CIITA or RFX5:
      • A clonal variant with canonical coding effect NONSENSE_OR_FRAMESHIFT or SPLICE
      • A clonal, biallelic variant with canonical coding effect MISSENSE
      • A full or partial loss
      • A homozygous disruption
    • IFN gamma pathway inactivation, detected in case one of the following mutations is present in either JAK1, JAK2, IRF2, IFNGR1, IFNGR2, APLNR or STAT1
      • A clonal variant with canonical coding effect NONSENSE_OR_FRAMESHIFT or SPLICE
      • A clonal, biallelic variant with canonical coding effect MISSENSE
      • A full or partial loss
      • A homozygous disruption
    • (Potential) PD-L1 overexpression, detected in case CD274 is fully amplified.
    • CD58 inactivation, detected in case any of the following mutations happened in CD58:
      • A clonal variant with canonical coding effect NONSENSE_OR_FRAMESHIFT or SPLICE
      • A clonal, biallelic variant with canonical coding effect MISSENSE
      • A full or partial loss
      • A homozygous disruption
    • Epigenetics driven immune escape via SETDB1, detected in case SETDB1 is fully amplified.

RNA Findings

If run with RNA, this chapter displays potentially interesting RNA details:

  • QC Details
  • Drive gene panel genes with high TPM (>90th percentile database & tumor type) or low TPM (<5th percentile database & tumor type)
  • Potentially interesting support for known or promiscuous fusions not detected in our DNA analysis pipeline
  • Potentially interesting novel splice junctions
    1. Exon-skipping events in EXON_DEL_DUP fusion genes
    2. Novel exon/intron events in driver gene panel genes

Cohort Comparison

The cohort comparison reports all the properties of a tumor sample that Cuppa considers for determining tumor type. The cohort comparison displays the prevalence of the tumor's properties with respect to the cohorts that Cuppa could potentially assign the sample to:

  • Genomic position distribution of SNVs and their tri-nucleotide signature
  • Sample traits of the tumor (for example, number of LINE insertions)
  • (Driver) features of the tumor.

Do note that RNA features and cohort comparison thereof are only included if ORANGE was run in combined DNA/RNA mode.

Quality Control

The quality control chapter provides extensive details that can help with interpreting the overall PURPLE QC status or investigate potential causes for QC failure.

  • The high-level QC from PURPLE
  • Various details from the tumor and reference samples flagstats and coverage stats
  • Various plots from PURPLE
  • BQR plots from both reference and tumor sample from SAGE

