-
Notifications
You must be signed in to change notification settings - Fork 1
3. Running Blessy
blessy performs two tasks:
- Create the DoCo dictionary
blessy intersects a transcript and a domain annotation, to map domains to each transcript based on genomic coordinates. Transcripts and domain annotations can be provided by the user, or automatically fetched from the UCSC Genome Browser. blessy will return a table ('dictionary') mappings gene, DoCo and transcript.
- Aggregate transcript counts to DoCo counts
Once the information of the DoCo dictionary is generated, blessy can aggregate transcript counts into a DoCo-level count table.
Both these tasks may be performed with a single wrapper function of blessy. It requires the transcript annotation version and protein domain name to create the DoCo class. The default mode of blessy uses annotation identifiers of the UCSC Genome Browser, along with the reference genome version. Additionally, an RNA-Seq transcript count must be provided for aggregating counts at transcript-level to DoCo-level.
# Load library
library(longhaul)
# Wrapper function for blessy
blessy_outs <- blessy(genomeAssembly, transcriptAnnotation, domainAnnotation, transcriptCount, unique_domain = FALSE, coordinates = TRUE)genomeAssembly - A string specifying the UCSC reference genome version used for the two transcriptAnnotation and domainAnnotation below. This string corresponds to the identifier which can be found at the end of the 'Assembly' option in the UCSC Table Browser, or the 'Database' seen in the 'Data format description' page of a specific track. (e.g., "hg38").
transcriptAnnotation - A string specifying the UCSC table identifier of a transcript track. This string corresponds to the 'Table' option in the UCSC Table Browser, or the 'Primary Table' seen in the 'Data format description' page of a specific track. (e.g., "wgEncodeGencodeBasicV44"). Note that the transcript annotation needs to match what was used for transcript quantification, as the transcriptIDs in the count matrix are used to assign DoCo.
domainAnnotation - A string specifying the UCSC table identifier of a domain track. This string corresponds to the 'Table' option in the UCSC Table Browser, or the 'Primary Table' seen in the 'Data format description' page of a specific track. (e.g., "unipDomain").
transcriptCount - An R data frame, matrix or sparse matrix containing RNA-Seq transcript identifiers and count. The row names must store transcript identifiers, and the IDs must be compatible with what is found in transcriptAnnotation. The columns in this data frame must contain numeric count across different biological samples or cells.
unique_domain - A logical value indicating whether to deduplicate domains by keeping only unique domain values per transcript. Setting this value to TRUE reduces the number of unique DoCo values in a given annotation. For example, 'Domain A, Domain B' and 'Domain A, Domain A, Domain B' DoCos would be considered the same DoCo, due to deduplication of the latter to simply 'Domain A, Domain B'. Default is set to FALSE.
coordinates - A boolean value to decide if the coordinates of protein domains would be kept in the final DoCo string or not. Generally, setting this value to FALSE further deduplicates and reduces the number of unique DoCo values in a given annotation. For example, 'Domain A:chr1:100-200, Domain B:chr1:300-400' and 'Domain A:chr1:150-250, Domain B:chr1:300-400' would be collapsed together into 'Domain A, Domain B'. Default is set to TRUE.
NOTE: We highly recommend using the GENCODE or NCBI RefSeq tracks for transcript annotation and UniProt or Pfam tracks for domain annotation, as blessy is tailored around these annotations.
For demonstration, we included a small example transcript count table from two cell lines, H9 and HCT116, which can be found in /ins/extdata, and a runnable code example on how to perform differential Doco analysis below. This data was derived from the SG-NEx long-read dataset.
Firstly, blessy is loaded via longhaul. For the transcript count, it must be loaded as an R data frame. The first column must be named 'TranscriptID', and the other columns must store numeric count across different samples.
# Load library
library(longhaul)
# Load transcript count
tx_count_file = system.file("extdata", "example_transcript_count.txt", package = "longhaul")
tx_count <- read.table(tx_count_file,header=TRUE,row.names=1)
tx_count$GENEID<-NULL # Remove the GENEID column
head(tx_count)
sample1.H9 sample2.H9 sample3.H9 sample4.HCT116
ENST00000000233.10 984.1012 960.7484 1108.1225 849.99381
ENST00000000412.8 235.7551 223.9015 221.3439 400.99353
ENST00000001008.6 1515.0000 1544.0000 1504.9986 1356.00000
ENST00000001146.7 0.0000 0.0000 0.0000 41.00000
ENST00000002125.9 146.1039 196.4843 164.6473 29.39952
ENST00000002165.11 366.0000 329.9591 347.0000 321.00000
sample5.HCT116 sample6.HCT116
ENST00000000233.10 845.44985 882.4294
ENST00000000412.8 305.19009 408.9188
ENST00000001008.6 1148.00000 1291.0000
ENST00000001146.7 19.00000 10.0000
ENST00000002125.9 24.85344 31.6166
ENST00000002165.11 293.00000 329.0000
In this example, we perform blessy using the The GENCODE Genes track for transcript annotation and the UniProt/SwissProt for domain annotation. blessy can simply be run as:
# Run blessy
blessy_outs <- blessy(genomeAssembly = "hg38",
transcriptAnnotation = "wgEncodeGencodeCompV47",
domainAnnotation = c("ucscGenePfam","unipInterest","unipDomain",
"unipLocExtra","unipLocTransMemb"),
transcriptCount = tx_count,
unique_domain=TRUE,
coordinates=FALSE)
Pipeline completed. Returning results...
Warning messages:
...
5: In blessy.createDoCoCount(phasing_dict, transcriptCount) :
Number of uncommon transcripts grouped: 5151Typically, a blessy run takes from 3 to 6 minutes depending on the size of the transcript count table and the number of domain annotations selected. As the DoCo count is generated from looking up the transcript IDs in the initial count using the phasing dictionary, there may be transcript not found in our dictionary. These uncommon transcript includes novel transcripts (e.g. from assemblers), or transcripts not present in the used annotation. blessy returns a warning on the total number of uncommon transcripts, and group them to the ';;;' DoCo, so that their counts can still be used for library size normalisation.
You may also get some warnings about alternative chromosome names. These can be ignored.
Once blessy finished, it returns a dictionary showing the hierarchical relationship between gene, DoCo and transcript, as well as the count data now at DoCo level.
# Get phasing dictionary and DoCo count
phasing_dict <- blessy_outs$phasing_dict
doco_count <- blessy_outs$doco_countThe DoCo count can be used to perform a new analysis termed 'Differential DoCo Usage', or DDU for short. Like DTU, DDU helps discovering the domain phasings/combinations with the largest change in proportional usage within a gene between biological conditions. Here, we perform an example DDU analysis on the created DoCo count above.
# Downstream analysis: Differential DoCo Usage
# Load library
library(edgeR)
# Get the GeneIDs from the DoCo name
geneID <- sub(".*;;;\\s*", "", rownames(doco_count))
# Create DGEList object for differential analysis
genes <- data.frame(GeneID = geneID, DoCo=rownames(doco_count))
group <- c(rep(1:2, each=3))
y <- DGEList(counts=doco_count, group=group, genes=genes)
# Filter low expressed DoCo and uncommon/unassigned transcripts
keep <- filterByExpr(y, group=group)
keep[";;;"]=FALSE
table(keep)
y <- y[keep,,keep.lib.sizes=TRUE]
#Normalise the samples
y <- normLibSizes(y)
y$samples
#Plot the samples on an MDS
plotMDS(y)
# Estimate dispersion
design <- model.matrix(~group)
y <- estimateDisp(y,design)
# Fit models
fit <- glmQLFit(y,design)
lrt <- glmQLFTest(fit)
# Perform Differential DoCo Usage
sp <- diffSpliceDGE(fit, geneid="GeneID", exonid ="DoCo")
DDU_top_DoCos <- topSpliceDGE(sp, test = 'exon', n = Inf)
DDU_top_genes <- topSpliceDGE(sp, test = 'gene', n = Inf)
# DDU result at DoCo level
rownames(DDU_top_DoCos)<-NULL
head(DDU_top_DoCos)
GeneID DoCo logFC
1 RPS2 Disordered,Ribosomal_S5_C,Small ribosoma...;;; RPS2 -3.036701
2 NOP10 Disordered;;; NOP10 -4.707515
3 NOP10 Disordered,Nop10p;;; NOP10 4.131922
4 GNAS Disordered,NESP55;;; GNAS -7.625620
5 MYL6 EF-hand 3,EF-hand 2,EF-hand_5,EF-hand 1;;; MYL6 1.999456
6 RPS6 ;;; RPS6 -4.092046
exon.F P.Value FDR
1 537.8736 2.133795e-17 2.027106e-13
2 1095.4168 2.610092e-15 7.681653e-12
3 1064.9546 3.208463e-15 7.681653e-12
4 287.2771 3.779519e-15 7.681653e-12
5 506.2719 4.278777e-15 7.681653e-12
6 993.0467 4.851570e-15 7.681653e-12
# DDU result at gene level
rownames(DDU_top_genes)<-NULL
head(DDU_top_genes)
GeneID NExons gene.F P.Value FDR
1 RPS2 5 175.83241 7.955538e-17 1.647618e-13
2 NOP10 3 1092.92909 7.984578e-17 1.647618e-13
3 GNAS 6 90.91958 3.938746e-15 5.418401e-12
4 RPS6 3 497.38608 2.340121e-14 2.414419e-11
5 MYL6 4 175.77873 5.681085e-14 4.689167e-11
6 MBD2 2 743.82170 3.161236e-13 2.174404e-10
#Plot the top hit
plotSpliceDGE(sp,rank=1)