Skip to content

4. Functions and Advanced Options

Nadia Davidson edited this page May 17, 2025 · 1 revision

Here, we outline the component functions of blessy along with use-cases and arguments for each, to assist users in customizing the module to fit their specific needs. The functions are listed based on their order in the blessy pipeline:

Step Function Description
1-7 blessy Wrapper function for whole pipeline
2-7 blessy.usingCustomAnnotation Wrapper function for pipeline with custom annotations
1-6 blessy.getDictionary Wrapper function for steps to make DoCo dictionary
1 blessy.getTranscriptTrack Fetch transcript-type annotation tracks from the UCSC database.
1 blessy.getDomainTrack Fetch domain-type annotation tracks UCSC database.
2 blessy.mapDomainToTranscript Create a data frame of domains mapped to transcripts.
3 blessy.addStartsEnds Add important coordinates columns for domain deduplication.
4 blessy.domainDeduplication Remove non-exact domain mappings.
5 blessy.domainPhasing Create a 'DoCo' column based on domain phasing.
6 blessy.createPhasingDictionary Summarize gene, DoCo and transcript levels from given annotations.
7 blessy.createDoCoCount Aggregate given transcript count to DoCo level.
- blessy.dfToGRangesList Utility for converting BED-like R data frame to GRangesList.
- blessy.GTFtoBEDdf Utility for converting a GTF to a BED-like R data frame.


1. Domain and Transcript Annotations

Automatically Fetch Annotation Tracks

blessy.getTranscriptTrack(genomeAssembly, transcriptAnnotation) and blessy.getDomainTrack(genomeAssembly, domainAnnotation)

The purpose of these two functions is for retrieving UCSC annotations. As transcript and domain tracks often have different syntax on the UCSC Database, each function is designed to a different track type. Both functions require two arguments: an assembly identifier and an annotation identifier, which correspond to the 'Assembly' and 'Table' options in the UCSC Table Browser respectively. Again, we highly recommend using the GENCODE or NCBI RefSeq tracks for transcript annotation and UniProt or Pfam tracks for domain annotation. The output of blessy.getTranscriptTrack() and blessy.getDomainTrack() are annotations stored in BED-like R data frames.

# Fetch annotation tracks
transcript_annotation <- blessy.getTranscriptTrack("hg38", "wgEncodeGencodeBasicV44")
domain_annotation <- blessy.getDomainTrack("hg38", "unipDomain")

# Output visualization
head(transcript_annotation)
  chrom chromStart  chromEnd              name score strand thickStart  thickEnd itemRgb blockCount
1  chr1   67092164  67134970 ENST00000684719.1     0      -   67093004  67127240   0,0,0          8
2  chr1   67092164  67231852 ENST00000371007.6     0      -   67093004  67127240   0,0,0          8
3  chr1   67092175  67127261 ENST00000371006.5     0      -   67093004  67127240   0,0,0          6
4  chr1   67092175  67127261 ENST00000475209.6     0      -   67093579  67127240   0,0,0          7
5  chr1   67092396  67127261 ENST00000621590.4     0      -   67096311  67127240   0,0,0          3
6  chr1  201283451 201332993 ENST00000263946.7     0      +  201283702 201328836   0,0,0         15

There can be several scenarios where users would prefer NOT to use annotations directly fetched from the UCSC Genome Browser:

Using user-provided annotations

To run blessy using annotations NOT from UCSC, the user must ensure that their provided annotations are first read into BED12-like R data frames. The data frame must include the following columns:

chrom- Chromosome or scaffold name of the feature.

chromStart - Start coordinate on the chromosome or scaffold for the feature considered.

chromEnd - End coordinate on the chromosome or scaffold for the feature considered.

blockCount - Number of blocks (exon for transcripts, domain-block for domains) on the line of the BED-like data frame.

blockSizes - List of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the 'blockCount').

blockStarts - List of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the 'blockCount').

thickStart - (for transcript only) Start coordinate of the coding sequence of the transcript.

thickEnd - (for transcript only) End coordinate of the coding sequence of the transcript.

geneName - (for transcript only) Identifier of the gene the transcript belongs to.

For GTF annotations, blessy offers the function blessy.GTFtoBEDdf() to convert a GTF file into the required BED-like format. For example, user can choose a given annotation track from GENCODE and load it as below:

customTranscriptAnnotation <- blessy.GTFtoBEDdf("path/to/gencode.v47.basic.annotation.gtf", geneID = FALSE)

blessy can run on this annotation via the blessy.usingCustomAnnotation() custom wrapper to define the DoCo class and aggregate the provided transcript count:

blessy_outs_custom <- blessy.usingCustomAnnotation(customTranscriptAnnotation, DomainAnnotation, transcriptCount, unique_domain = FALSE, coordinates = TRUE)

customTranscriptAnnotation - A BED-like data frame or a GRangesList representing transcript annotation with required columns/information.

customDomainAnnotation - A BED-like data frame or a GRangesList representing domain annotation with required columns/information.

transcriptCount - An R data frame containing RNA-Seq transcript count. The first column must be named 'TranscriptID' storing string values of transcript identifiers. Other columns are considered numeric count across different biological samples.

unique_domain - A logical value indicating whether to further deduplicate domains by keeping only unique domains per transcript. Default is set to FALSE.

coordinates - A boolean value to decide if the coordinates of protein domains would be kept in the final DoCo string or not. Generally, setting this value to FALSE further deduplicates and reduces the number of DoCo values in a given dataset. Default is set to TRUE.

In this example, we worked with the Basic gene annotation GTF of the GENCODE human annotation Release 47. To acess the GTF, click here.

# Load library
library(longhaul)

# Load count
tx_count <- read.table("example_transcript_count.txt")

# Get domain annotation via blessy
domain_df <- blessy.getDomainTrack("hg38", "unipDomain")

# Load and convert the GTF transcript annotation
tx_df <- blessy.GTFtoDataFrame("gencode.v47.basic.annotation.gtf", geneID = T)

# Run blessy using custom annotation 
gtf_run <- blessy.usingCustomAnnotation(tx_df, domain_df, tx_count)
doco_count <- gtf_run$doco_count
doco_dict <- gtf_run$doco_dict

Modifying and filtering the transcript or domain annotations

Users can choose to retrieve multiple domain annotation tracks via blessy.getDomainTrack() (see Functions and Use Cases), and integrate these domain tracks into one comprehensive domain annotation using simple R commands. Once the final annotation data frame is created, it can be use as an input to blessy.usingCustomAnnotation().

# Load libraries
library(longhaul)
library(dplyr)

# Fetch domain tracks from the UCSC Genome Browser
unipDomain <- blessy.getDomainTrack("hg38", "unipDomain")
unipInterest <- blessy.getDomainTrack("hg38", "unipInterest")

# Combine the two data frames
combined_df <- bind_rows(unipDomain, unipInterest)

# Filter out rows containing "Disordered"
filtered_df <- combined_df %>% 
  filter(!grepl("Disordered", name)) 

# Sort 
domain_df <- filtered_df %>% 
  arrange(chrom, chromStart) 

# Load transcript count
tx_count <- read.table("example_transcript_count.txt")

# Fetch transcript annotation from the UCSC browser
tx_df <- blessy.getTranscriptTrack("hg38", "wgEncodeGencodeCompV44")

# Perform blessy using custom annotations
blessy_outs_custom <- blessy.usingCustomAnnotation(tx_df, domain_df, tx_count)


2. Map Domain to Transcript

blessy.mapDomainToTranscript(transcript_GRL, domain_GRL, transcript_annotation, domain_annotation)

This function serves to map the domains that intersect each transcript based on genomic coordinates. The function takes in the two GRangesLists of domain and transcript, and find their coordinate overlaps using the findOverlaps() function of GenomicRanges. findOverlaps() returns index pairs of matching domains and transcripts, which is used to join columns of the initial transcript and domain annotations based on matching features. The return object of this function is a mapping data frame with information on transcript and its matched domains.

# Example BED-like annotation data frames
transcript_annotation <- data.frame(
  chrom = c("chr1", "chr1", "chr2"),
  chromStart = c(1000, 2000, 3000),
  chromEnd = c(1500, 2500, 3500),
  name = c("tx1", "tx2", "tx3"),
  strand = c("+", "-", "+"),
  thickStart = c(1000, 2000, 3000),
  thickEnd = c(1500, 2500, 3500),
  blockCount = c(2, 2, 2),
  blockSizes = c("100,200", "150,250", "200,300"),
  blockStarts = c("0,400", "0,500", "0,600"),
  geneName = c("geneA", "geneB", "geneC"),
  stringsAsFactors = FALSE
)


domain_annotation <- data.frame(
  chrom = c("chr1", "chr1", "chr2"),
  chromStart = c(1200, 2100, 3100),
  chromEnd = c(1300, 2200, 3200),
  name = c("domainA", "domainB", "domainC"),
  strand = c("+", "-", "+"),
  thickStart = c(1200, 2100, 3100),
  thickEnd = c(1300, 2200, 3200),
  blockStarts = c("0", "0", "0"),
  blockSizes = c("100", "100", "100"),  # Assuming block size is 100 for each domain
  stringsAsFactors = FALSE
)

# Create domain mapping 
mapping_df <- blessy.mapDomainToTranscript(transcript_annotation, domain_annotation)

# Output visualization
head(mapping_df)
  chrom txStart txEnd Transcript strand cdsStart cdsEnd exonCount exonSizes exonRelativeStarts  Gene chromStart chromEnd  Domain
1  chr1    1000  1500        tx1      +     1000   1500         2   100,200              0,400 geneA       1200     1300 domainA
2  chr1    2000  2500        tx2      -     2000   2500         2   150,250              0,500 geneB       2100     2200 domainB
3  chr2    3000  3500        tx3      +     3000   3500         2   200,300              0,600 geneC       3100     3200 domainC
  chromStarts blockSizes
1           0        100
2           0        100
3           0        100


3. Add Block Coordinates to Mapping Data Frame

blessy.addStartsEnds(mapping_df)

This function adds several columns needed for the domain deduplication process to the input mapping data frame, based on existing columns. More specifically, this functions creates exonStarts and exonEnds columns corresponding to the start and end genomic coordinates of exons within a transcript, and likewise for its domains with blockStarts and blockEnds.

# Add block coordinates
mapping_df <- blessy.addStartsEnds(mapping_df)

# Output
head(mapping_df)

 chrom txStart txEnd Transcript strand cdsStart cdsEnd exonCount exonSizes
1  chr1    1000  1500        tx1      +     1000   1500         2   100,200
2  chr1    2000  2500        tx2      -     2000   2500         2   150,250
3  chr2    3000  3500        tx3      +     3000   3500         2   200,300
  exonRelativeStarts  Gene chromStart chromEnd  Domain thickStart thickEnd
1              0,400 geneA       1200     1300 domainA       1200     1300
2              0,500 geneB       2100     2200 domainB       2100     2200
3              0,600 geneC       3100     3200 domainC       3100     3200
  domainBlockStarts blockSizes exonStarts  exonEnds blockStarts blockEnds
1                 0        100  1000,1400 1100,1600        1200      1300
2                 0        100  2000,2500 2150,2750        2100      2200
3                 0        100  3000,3600 3200,3900        3100      3200


4. Domain Mapping Deduplication

blessy.domainDeduplication(tx_domain_df, unique_domain = FALSE)

While the blessy.mapDomainToTranscript function allows for finding coordinate overlaps between domains and transcripts, non-exact matches that incorrectly assigns a domain to a transcript can happen. In the figure below, given a hypothetical transcript A, only domain 1 to domain 4 (D1-D4) should be considered correct matches for transcript A. For D5, we observe a non-consecutive exon match. The first D6 block does not align to an exon and while both blocks aligns with consecutive exons, the second block of D7 exhibited a gap to the left-hand side. Similarly, first block alignment of D8 showed overflown. The blessy.domainDeduplication function thus aims to remove D5 to D8 from this hypothetical domain-transcript mapping, using the 'exonStarts', 'exonEnds', 'blockStarts', 'blockEnds' columns created by blessy.addStartsEnds.

domainDedup_concept

These hypothetical transcripts and domains have been converted to BED files, which can be found as data/example_domains.bed or data/example_transcript.bed. By default, we expect D1, D2, D3 and D4 to remain after deduplication. The use case for this process is as follows:

# Read and rename hypothetical transcript A BED into an R BED-like data frame
tx_df_file = system.file("extdata", "example_transcript.bed", package = "longhaul")
tx_df <- read.table(tx_df_file)
colnames(tx_df) <- c("chrom", "chromStart", "chromEnd", "name", "score", "strand",
                        "thickStart", "thickEnd", "itemRgb", "blockCount", "blockSizes", "blockStarts", "geneName")

# Read and rename hypothetical domains BED into an R BED-like data frame
domain_df_file = system.file("extdata", "example_domains.bed", package = "longhaul")
domain_df <- read.table(domain_df_file)
colnames(domain_df) <- c("chrom", "chromStart", "chromEnd", "name", "score", "strand",
                     "thickStart", "thickEnd", "itemRgb", "blockCount", "blockSizes", "blockStarts")
  
# Map domains to transcripts
mapped_df <- blessy.mapDomainToTranscript(tx_df, domain_df)
  
# Add exon and block starts/ends
starts_ends_df <- blessy.addStartsEnds(mapped_df)
  
# Deduplicate domain mappings
deduplicated_df <- blessy.domainDeduplication(starts_ends_df, unique_domain = FALSE)

> unique(deduplicated_df$Domain)
[1] "D1" "D2" "D3" "D4"

Additionally, the function provides the parameter 'unique_domain' which is set to FALSE by default. By switching unique_domain to TRUE, the function removes repetitive domain names in each transcript so that the domain values are now unique for a transcript. For example, if transcript X is mapped to domain A twice, only one domain A is kept in this mapping. Because of this, unique_domain = TRUE is preferably used in tandem with coordinates = FALSE (see below)



5. Domain Phasing

blessy.domainPhasing(mapping_df, coordinates = TRUE)

The blessy.domainPhasing() generates the DoCo string of transcripts with matched domains and includes them to the input mapping data frame. The function iterates through the domains in each transcript based on their coordinates and strand direction, and output the DoCo string in a new 'DoCo' column. The coordinates boolean parameter decides if the coordinates are kept in the DoCo string or not. Switching coordinates to FALSE will remove the genomic coordinates from a DoCo string, grouping transcripts with domains in different locations but similar names. For example, transcript X with DoCo "D1::chr1:1500-2000;;geneZ" and transcript Y with DoCo "D1::chr1:2000-2500;;geneZ" will both belong to DoCo "D1;;geneZ" when coordinates is set to FALSE.

# Apply domain phasing and create a new 'DoCo' column
phased_df <- blessy.domainPhasing(deduplicated_df)

# Output
head(phased_df)
# A tibble: 3 × 8
  Transcript Domain  chrom chromStart chromEnd strand Gene  DoCo                                                
  <chr>      <chr>   <chr>      <dbl>    <dbl> <chr>  <chr> <chr>                                               
1 tx1        domainA chr1        1000     1100 +      geneA domainA::chr1:1000-1100(+),domainB::chr1:2000-2100(…
2 tx1        domainB chr1        2000     2100 -      geneA domainA::chr1:1000-1100(+),domainB::chr1:2000-2100(…
3 tx2        NA      chr2        3000     3100 +      geneB ;;; geneB                                           


6. Create Phasing Dictionary

blessy.createPhasingDictionary(mapping_df, transcript_annotation)

This function summarizes the hierarchical relationship between Gene, DoCo and transcript for transcripts with and without matched domains, in the chosen annotation. The function requires the data frame consisting mapped transcripts, and the initial transcript annotation for handling unmapped transcripts. The output of the function is a data frame with 'Gene', 'DoCo' and 'Transcript' columns, which will be used to aggregate transcripts belonging to the same DoCo in a given RNA-Seq transcript count.

# Create the phasing dictionary
final_dict <- blessy.createPhasingDictionary(phased_df, transcript_annotation)

# Output 
phasing_dict
# A tibble: 4 × 3
  Transcript  DoCo                                                         Gene
  <chr>       <chr>                                                        <chr>
1 TranscriptA D1::chr1:1250-2250(+),D2::chr1:1250-1750(+),D4::chr1:1250-1GeneX
2 tx1         ;;; geneA                                                    geneA
3 tx2         ;;; geneB                                                    geneB
4 tx3         ;;; geneC                                                    geneC

final_dict$DoCo
[1] "D1::chr1:1250-2250(+),D2::chr1:1250-1750(+),D4::chr1:1250-1350(+),D3::chr1:1750-2050(+);;; GeneX"
[2] ";;; geneA"
[3] ";;; geneB"
[4] ";;; geneC"


7. Create DoCo Count from Transcript Count

blessy.createDoCoCount(phasing_dict, transcriptCount)

This function aggregates an input transcript count to acquire DoCo-level count. Transcripts are grouped based on their DoCo assignment from an input dictionary data frame, and counts are summed across biological samples. Transcripts not found in the dictionary are grouped into the DoCo class ";;;". For the transcript count input, the first column must be named 'TranscriptID' storing string values of transcript identifiers. Other columns are considered numeric count across different biological samples.

# Example dictionary data frame
dict <- data.frame(
  Transcript = c("Tx1", "Tx2", "Tx3"),
  DoCo = c("D1,D2;;; GeneA", "D3;;; GeneB", ";;; GeneC"),
  stringsAsFactors = FALSE
)

# Example count data frame
count_df <- data.frame(
  row.names = c("Tx1", "Tx2", "Tx4", "Tx5"),
  Sample1 = c(10, 20, 5, 7),
  Sample2 = c(15, 25, 8, 10),
  stringsAsFactors = FALSE
)

# Create DoCo-level counts
doco_count <- blessy.createDoCoCount(dict, count_df)
Warning messages:
1: In blessy.createDoCoCount(dict, count_df) :
  Uncommon transcripts found, grouping them into the empty ';;;' DoCo group.
2: In blessy.createDoCoCount(dict, count_df) :
  Number of uncommon transcripts grouped: 2

# Output visualization
head(doco_count)
            DoCo       Sample1     Sample2
1           ;;;            12          18
2     D1,D2;;; GeneA       10          15
3      D3;;; GeneB         20          25

Auxiliary Functions

Convert BED-like R Data Frame to GRangesList

blessy.dfToGRangesList(annotation_df)

blessy.dfToGRangesList takes in a BED-like annotation data frame and convert it into a big GRanges object, based on the columns 'chrom', 'chromStart', 'chromEnd', and 'strand'. Next, this single GRanges objects will be splitted into a GRangesList where each GRanges object correspond to a transcript or a domain in the initial annotation using this function.

# Example BED-like annotation data frame
transcript_annotation <- data.frame(
  chrom = c("chr1", "chr1", "chr2"),
  chromStart = c(1000, 2000, 3000),
  chromEnd = c(1500, 2500, 3500),
  name = c("tx1", "tx2", "tx3"),
  strand = c("+", "-", "+"),
  thickStart = c(1000, 2000, 3000),
  thickEnd = c(1500, 2500, 3500),
  blockCount = c(2, 2, 2),
  blockSizes = c("100,200", "150,250", "200,300"),
  blockStarts = c("0,400", "0,500", "0,600"),
  geneName = c("geneA", "geneB", "geneC"),
  stringsAsFactors = FALSE
)

# Convert the data frame to a GRangesList
transcript_GRL <- blessy.dfToGRangesList(transcript_annotation)

# Output visualization
head(transcript_GRL)

GRangesList object of length 3:
$`1`
GRanges object with 1 range and 7 metadata columns:
      seqnames    ranges strand |        name thickStart  thickEnd blockCount  blockSizes blockStarts    geneName
         <Rle> <IRanges>  <Rle> | <character>  <numeric> <numeric>  <numeric> <character> <character> <character>
  [1]     chr1 1000-1500      + |         tx1       1000      1500          2     100,200       0,400       geneA
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$`2`
GRanges object with 1 range and 7 metadata columns:
      seqnames    ranges strand |        name thickStart  thickEnd blockCount  blockSizes blockStarts    geneName
         <Rle> <IRanges>  <Rle> | <character>  <numeric> <numeric>  <numeric> <character> <character> <character>
  [1]     chr1 2000-2500      - |         tx2       2000      2500          2     150,250       0,500       geneB
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$`3`
GRanges object with 1 range and 7 metadata columns:
      seqnames    ranges strand |        name thickStart  thickEnd blockCount  blockSizes blockStarts    geneName
         <Rle> <IRanges>  <Rle> | <character>  <numeric> <numeric>  <numeric> <character> <character> <character>
  [1]     chr2 3000-3500      + |         tx3       3000      3500          2     200,300       0,600       geneC
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths