Skip to content

Clarification on Sequence Count in Rfam novel Dataset #2

Open
@ZhiyuanChen

Description

@ZhiyuanChen

Thank you for presenting this work!

I noticed a discrepancy between the sequence counts mentioned in the paper and the provided RDa file for the ncRNA_deep-Rfam dataset.

The paper outlines the following pipeline for data processing:
1. Extracted 650,790 ncRNA sequences distributed across 2,570 functional families.
2. Removed sequences with non-canonical bases (e.g., letters other than A, C, G, and U).
3. Excluded classes corresponding to long non-coding RNAs or classes with average sequence lengths >200 nucleotides.
4. Filtered out classes strongly dependent on sequence length, identified using a C5.0 decision tree model, to ensure robustness against trivial classification signals.
5. Removed underrepresented classes with <400 sequences, resulting in 306,016 sequences across 88 Rfam functional families.

According to the paper, after processing, the final dataset should contain 306,016 sequences. However, in the provided RDa file, I found the following sequence counts:
• Training: 105,864 sequences
• Validation: 17,324 sequences
• Testing: 25,342 sequences

This adds up to 148,530 sequences, which is less than half of the number reported in the paper.

Could you kindly clarify why there is this difference between the expected and actual number of sequences in the dataset?

Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions