Skip to content

3. Running Blessy

Nadia Davidson edited this page May 17, 2025 · 4 revisions

Running blessy, inputs and outputs

blessy performs two tasks:

  1. Create the DoCo dictionary

blessy intersects a transcript and a domain annotation, to map domains to each transcript based on genomic coordinates. Transcripts and domain annotations can be provided by the user, or automatically fetched from the UCSC Genome Browser. blessy will return a table ('dictionary') mappings gene, DoCo and transcript.

  1. Aggregate transcript counts to DoCo counts

Once the information of the DoCo dictionary is generated, blessy can aggregate transcript counts into a DoCo-level count table.

Both these tasks may be performed with a single wrapper function of blessy. It requires the transcript annotation version and protein domain name to create the DoCo class. The default mode of blessy uses annotation identifiers of the UCSC Genome Browser, along with the reference genome version. Additionally, an RNA-Seq transcript count must be provided for aggregating counts at transcript-level to DoCo-level.


# Load library
library(longhaul)

# Wrapper function for blessy
blessy_outs <- blessy(genomeAssembly, transcriptAnnotation, domainAnnotation, transcriptCount, unique_domain = FALSE, coordinates = TRUE)

genomeAssembly - A string specifying the UCSC reference genome version used for the two transcriptAnnotation and domainAnnotation below. This string corresponds to the identifier which can be found at the end of the 'Assembly' option in the UCSC Table Browser, or the 'Database' seen in the 'Data format description' page of a specific track. (e.g., "hg38").

transcriptAnnotation - A string specifying the UCSC table identifier of a transcript track. This string corresponds to the 'Table' option in the UCSC Table Browser, or the 'Primary Table' seen in the 'Data format description' page of a specific track. (e.g., "wgEncodeGencodeBasicV44"). Note that the transcript annotation needs to match what was used for transcript quantification, as the transcriptIDs in the count matrix are used to assign DoCo.

domainAnnotation - A string specifying the UCSC table identifier of a domain track. This string corresponds to the 'Table' option in the UCSC Table Browser, or the 'Primary Table' seen in the 'Data format description' page of a specific track. (e.g., "unipDomain").

transcriptCount - An R data frame, matrix or sparse matrix containing RNA-Seq transcript identifiers and count. The row names must store transcript identifiers, and the IDs must be compatible with what is found in transcriptAnnotation. The columns in this data frame must contain numeric count across different biological samples or cells.

unique_domain - A logical value indicating whether to deduplicate domains by keeping only unique domain values per transcript. Setting this value to TRUE reduces the number of unique DoCo values in a given annotation. For example, 'Domain A, Domain B' and 'Domain A, Domain A, Domain B' DoCos would be considered the same DoCo, due to deduplication of the latter to simply 'Domain A, Domain B'. Default is set to FALSE.

coordinates - A boolean value to decide if the coordinates of protein domains would be kept in the final DoCo string or not. Generally, setting this value to FALSE further deduplicates and reduces the number of unique DoCo values in a given annotation. For example, 'Domain A:chr1:100-200, Domain B:chr1:300-400' and 'Domain A:chr1:150-250, Domain B:chr1:300-400' would be collapsed together into 'Domain A, Domain B'. Default is set to TRUE.

NOTE: We highly recommend using the GENCODE or NCBI RefSeq tracks for transcript annotation and UniProt or Pfam tracks for domain annotation, as blessy is tailored around these annotations.

Example of how to run blessy

For demonstration, we included a small example transcript count table from two cell lines, H9 and HCT116, which can be found in /ins/extdata, and a runnable code example on how to perform differential Doco analysis below. This data was derived from the SG-NEx long-read dataset.

Firstly, blessy is loaded via longhaul. For the transcript count, it must be loaded as an R data frame. The first column must be named 'TranscriptID', and the other columns must store numeric count across different samples.

# Load library
library(longhaul)

# Load transcript count
tx_count_file = system.file("extdata", "example_transcript_count.txt", package = "longhaul")
tx_count <- read.table(tx_count_file,header=TRUE,row.names=1)
tx_count$GENEID<-NULL  # Remove the GENEID column

head(tx_count)
                   sample1.H9 sample2.H9 sample3.H9 sample4.HCT116
ENST00000000233.10   984.1012   960.7484  1108.1225      849.99381
ENST00000000412.8    235.7551   223.9015   221.3439      400.99353
ENST00000001008.6   1515.0000  1544.0000  1504.9986     1356.00000
ENST00000001146.7      0.0000     0.0000     0.0000       41.00000
ENST00000002125.9    146.1039   196.4843   164.6473       29.39952
ENST00000002165.11   366.0000   329.9591   347.0000      321.00000
                   sample5.HCT116 sample6.HCT116
ENST00000000233.10      845.44985       882.4294
ENST00000000412.8       305.19009       408.9188
ENST00000001008.6      1148.00000      1291.0000
ENST00000001146.7        19.00000        10.0000
ENST00000002125.9        24.85344        31.6166
ENST00000002165.11      293.00000       329.0000

In this example, we perform blessy using the The GENCODE Genes track for transcript annotation and the UniProt/SwissProt for domain annotation. blessy can simply be run as:

# Run blessy
blessy_outs <- blessy(genomeAssembly = "hg38", 
                      transcriptAnnotation = "wgEncodeGencodeCompV47",
                      domainAnnotation = c("ucscGenePfam","unipInterest","unipDomain",
                                            "unipLocExtra","unipLocTransMemb"),
                      transcriptCount = tx_count,
                      unique_domain=TRUE, 
                      coordinates=FALSE)

Pipeline completed. Returning results...
Warning messages:
...
5: In blessy.createDoCoCount(phasing_dict, transcriptCount) :
  Number of uncommon transcripts grouped: 5151

Typically, a blessy run takes from 3 to 6 minutes depending on the size of the transcript count table and the number of domain annotations selected. As the DoCo count is generated from looking up the transcript IDs in the initial count using the phasing dictionary, there may be transcript not found in our dictionary. These uncommon transcript includes novel transcripts (e.g. from assemblers), or transcripts not present in the used annotation. blessy returns a warning on the total number of uncommon transcripts, and group them to the ';;;' DoCo, so that their counts can still be used for library size normalisation. You may also get some warnings about alternative chromosome names. These can be ignored.

Once blessy finished, it returns a dictionary showing the hierarchical relationship between gene, DoCo and transcript, as well as the count data now at DoCo level.

# Get phasing dictionary and DoCo count 
phasing_dict <- blessy_outs$phasing_dict
doco_count <- blessy_outs$doco_count

The DoCo count can be used to perform a new analysis termed 'Differential DoCo Usage', or DDU for short. Like DTU, DDU helps discovering the domain phasings/combinations with the largest change in proportional usage within a gene between biological conditions. Here, we perform an example DDU analysis on the created DoCo count above.

# Downstream analysis: Differential DoCo Usage

# Load library
library(edgeR)

# Get the GeneIDs from the DoCo name 
geneID <- sub(".*;;;\\s*", "", rownames(doco_count))

# Create DGEList object for differential analysis
genes <- data.frame(GeneID = geneID, DoCo=rownames(doco_count))
group <- c(rep(1:2, each=3))
y <- DGEList(counts=doco_count, group=group, genes=genes)

# Filter low expressed DoCo and uncommon/unassigned transcripts
keep <- filterByExpr(y, group=group)
keep[";;;"]=FALSE
table(keep)
y <- y[keep,,keep.lib.sizes=TRUE]

#Normalise the samples
y <- normLibSizes(y)
y$samples

#Plot the samples on an MDS
plotMDS(y)

# Estimate dispersion
design <- model.matrix(~group)
y <- estimateDisp(y,design)

# Fit models
fit <- glmQLFit(y,design)
lrt <- glmQLFTest(fit)

# Perform Differential DoCo Usage
sp <- diffSpliceDGE(fit, geneid="GeneID", exonid ="DoCo")
DDU_top_DoCos <- topSpliceDGE(sp, test = 'exon', n = Inf)
DDU_top_genes <- topSpliceDGE(sp, test = 'gene', n = Inf)

# DDU result at DoCo level
rownames(DDU_top_DoCos)<-NULL
head(DDU_top_DoCos)

  GeneID                                                DoCo     logFC
1   RPS2 Disordered,Ribosomal_S5_C,Small ribosoma...;;; RPS2 -3.036701
2  NOP10                                 Disordered;;; NOP10 -4.707515
3  NOP10                          Disordered,Nop10p;;; NOP10  4.131922
4   GNAS                           Disordered,NESP55;;; GNAS -7.625620
5   MYL6     EF-hand 3,EF-hand 2,EF-hand_5,EF-hand 1;;; MYL6  1.999456
6   RPS6                                            ;;; RPS6 -4.092046
     exon.F      P.Value          FDR
1  537.8736 2.133795e-17 2.027106e-13
2 1095.4168 2.610092e-15 7.681653e-12
3 1064.9546 3.208463e-15 7.681653e-12
4  287.2771 3.779519e-15 7.681653e-12
5  506.2719 4.278777e-15 7.681653e-12
6  993.0467 4.851570e-15 7.681653e-12
 

# DDU result at gene level
rownames(DDU_top_genes)<-NULL
head(DDU_top_genes)

  GeneID NExons     gene.F      P.Value          FDR
1   RPS2      5  175.83241 7.955538e-17 1.647618e-13
2  NOP10      3 1092.92909 7.984578e-17 1.647618e-13
3   GNAS      6   90.91958 3.938746e-15 5.418401e-12
4   RPS6      3  497.38608 2.340121e-14 2.414419e-11
5   MYL6      4  175.77873 5.681085e-14 4.689167e-11
6   MBD2      2  743.82170 3.161236e-13 2.174404e-10

#Plot the top hit
plotSpliceDGE(sp,rank=1)

Clone this wiki locally