Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Sequence Count in Rfam novel Dataset #2

Open
ZhiyuanChen opened this issue Dec 17, 2024 · 1 comment
Open

Clarification on Sequence Count in Rfam novel Dataset #2

ZhiyuanChen opened this issue Dec 17, 2024 · 1 comment

Comments

@ZhiyuanChen
Copy link

Thank you for presenting this work!

I noticed a discrepancy between the sequence counts mentioned in the paper and the provided RDa file for the ncRNA_deep-Rfam dataset.

The paper outlines the following pipeline for data processing:
1. Extracted 650,790 ncRNA sequences distributed across 2,570 functional families.
2. Removed sequences with non-canonical bases (e.g., letters other than A, C, G, and U).
3. Excluded classes corresponding to long non-coding RNAs or classes with average sequence lengths >200 nucleotides.
4. Filtered out classes strongly dependent on sequence length, identified using a C5.0 decision tree model, to ensure robustness against trivial classification signals.
5. Removed underrepresented classes with <400 sequences, resulting in 306,016 sequences across 88 Rfam functional families.

According to the paper, after processing, the final dataset should contain 306,016 sequences. However, in the provided RDa file, I found the following sequence counts:
• Training: 105,864 sequences
• Validation: 17,324 sequences
• Testing: 25,342 sequences

This adds up to 148,530 sequences, which is less than half of the number reported in the paper.

Could you kindly clarify why there is this difference between the expected and actual number of sequences in the dataset?

Thank you for your help!

@ZhiyuanChen ZhiyuanChen changed the title Number of data mismatch Clarification on Sequence Count in Rfam novel Dataset Dec 17, 2024
@luucee
Copy link
Contributor

luucee commented Dec 23, 2024

If you ignore the final task of balancing the data, the reported numbers align with those presented in the paper. It was wrongly retained in the data preparation script, despite data balancing typically being performed during the training phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants