You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed a discrepancy between the sequence counts mentioned in the paper and the provided RDa file for the ncRNA_deep-Rfam dataset.
The paper outlines the following pipeline for data processing:
1. Extracted 650,790 ncRNA sequences distributed across 2,570 functional families.
2. Removed sequences with non-canonical bases (e.g., letters other than A, C, G, and U).
3. Excluded classes corresponding to long non-coding RNAs or classes with average sequence lengths >200 nucleotides.
4. Filtered out classes strongly dependent on sequence length, identified using a C5.0 decision tree model, to ensure robustness against trivial classification signals.
5. Removed underrepresented classes with <400 sequences, resulting in 306,016 sequences across 88 Rfam functional families.
According to the paper, after processing, the final dataset should contain 306,016 sequences. However, in the provided RDa file, I found the following sequence counts:
• Training: 105,864 sequences
• Validation: 17,324 sequences
• Testing: 25,342 sequences
This adds up to 148,530 sequences, which is less than half of the number reported in the paper.
Could you kindly clarify why there is this difference between the expected and actual number of sequences in the dataset?
Thank you for your help!
The text was updated successfully, but these errors were encountered:
ZhiyuanChen
changed the title
Number of data mismatch
Clarification on Sequence Count in Rfam novel Dataset
Dec 17, 2024
If you ignore the final task of balancing the data, the reported numbers align with those presented in the paper. It was wrongly retained in the data preparation script, despite data balancing typically being performed during the training phase.
Thank you for presenting this work!
I noticed a discrepancy between the sequence counts mentioned in the paper and the provided RDa file for the ncRNA_deep-Rfam dataset.
The paper outlines the following pipeline for data processing:
1. Extracted 650,790 ncRNA sequences distributed across 2,570 functional families.
2. Removed sequences with non-canonical bases (e.g., letters other than A, C, G, and U).
3. Excluded classes corresponding to long non-coding RNAs or classes with average sequence lengths >200 nucleotides.
4. Filtered out classes strongly dependent on sequence length, identified using a C5.0 decision tree model, to ensure robustness against trivial classification signals.
5. Removed underrepresented classes with <400 sequences, resulting in 306,016 sequences across 88 Rfam functional families.
According to the paper, after processing, the final dataset should contain 306,016 sequences. However, in the provided RDa file, I found the following sequence counts:
• Training: 105,864 sequences
• Validation: 17,324 sequences
• Testing: 25,342 sequences
This adds up to 148,530 sequences, which is less than half of the number reported in the paper.
Could you kindly clarify why there is this difference between the expected and actual number of sequences in the dataset?
Thank you for your help!
The text was updated successfully, but these errors were encountered: