Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in gapped kmer #55

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thomasopsomer
Copy link

Hello @jmschrei,

I had issues in the function that computes scores for the gapped kmers: _extract_gkmers in _seqlet_to_gkmers. It seems that if there is less gapped kmers extracted than the parameter max_entries there is a bug. It's just a quick fix that ensure we don't have size issues but I didn't really think this a lot :)

@jmschrei
Copy link
Owner

jmschrei commented Oct 9, 2024

Thanks for the contribution. Would you mind undoing the formatting changes? It's making it challenging to see what the actual contribution is.

@thomasopsomer
Copy link
Author

Ah yes sorry it was ruff stuff I removed the formatting

@jmschrei
Copy link
Owner

My recollection is that the number of gapped k-mers that could be considered for a sequence is not data-dependant. The only filtering being done here is to make sure that each considered gkmer is a valid gkmer. Why wouldn't you just reduce max_entries to a value more appropriate for your data? It is supposed to be much smaller than the number of potential gkmers.

@thomasopsomer
Copy link
Author

I went back to see the data that caused the bug, and I understand better now:

I used a sliding_window_size of 20 and flank_size of 10, which creates to seqlets of length 40. However when computing the affinity matrix (in seqlets_to_patterns), it keeps only the positions of the topn (=20 by default) best positions considering their attribution. By default the max gap size is 15. For sequence of 40 there could be gaps of more than 15 and so it could produce much less gkmers than max_entries...

Yes it's possible get ride of the bug by decreasing flank_size. Or by tweaking the parametersmax_entries, top_n or max_gap in cosine_similarity_from_seqlets, however these parameters are not exposed to the main TFMoDISco method.

So finally I'm wondering if there is much to do ^^. Maybe raising an Error explaining what parameters to change in order to avoid the bug ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants