Description
Thank you for presenting this work!
I noticed a discrepancy between the sequence counts mentioned in the paper and the provided RDa file for the ncRNA_deep-Rfam dataset.
The paper outlines the following pipeline for data processing:
1. Extracted 650,790 ncRNA sequences distributed across 2,570 functional families.
2. Removed sequences with non-canonical bases (e.g., letters other than A, C, G, and U).
3. Excluded classes corresponding to long non-coding RNAs or classes with average sequence lengths >200 nucleotides.
4. Filtered out classes strongly dependent on sequence length, identified using a C5.0 decision tree model, to ensure robustness against trivial classification signals.
5. Removed underrepresented classes with <400 sequences, resulting in 306,016 sequences across 88 Rfam functional families.
According to the paper, after processing, the final dataset should contain 306,016 sequences. However, in the provided RDa file, I found the following sequence counts:
• Training: 105,864 sequences
• Validation: 17,324 sequences
• Testing: 25,342 sequences
This adds up to 148,530 sequences, which is less than half of the number reported in the paper.
Could you kindly clarify why there is this difference between the expected and actual number of sequences in the dataset?
Thank you for your help!