-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#43 - Reproducibility #45
base: main
Are you sure you want to change the base?
Conversation
43 reprod issue
Hello, I will go quickly through the changes and check if I everything seems ok - see code comments. I cannot fully test the thing right now, maybe in the following days. I will post new updates under issue: #43. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments and suggestions. However, please note that I'm not familiar with the overall code structure, so some comments may not be relevant.
src/autofeat/featsel.py
Outdated
@@ -223,36 +242,40 @@ def select_features( | |||
|
|||
# select good features in k runs in parallel | |||
# by doing sort of a cross-validation (i.e., randomly subsample data points) | |||
def run_select_features(i: int): | |||
def run_select_features(i: int, seed:int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use space after colon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for consistency maybe call it random_seed
here as well. but actually I think that setting the same random seed here might be problematic since then all runs could yield in the same result, defeating the purpose of doing this multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I fixed formatting.
@cod3licious I fixed the issue like this, let me know whether it makes sense to you
np.random.seed(random_seed)
loop_seed = np.random.randint(10**6) # Added to random_seed to make sure that the 1run seed is different for each run, but globally reproducible
seed = random_seed + loop_seed if random_seed is not None else loop_seed
Thanks for looking into this! I think you definitely found some places where setting random seeds should help, but I think there are also some places where it is not necessary. To avoid overcomplicating the implementation, it would be great if in the end you could also check whether some of your additions could be removed again while still keeping it reproducible. I'd be happy to later comment on the places where I think it might be possible to keep the existing implementation while still getting reproducible results. |
Hello @janezlapajne and @cod3licious, thanks a lot for your comments! I will try to continue with it and incorporate the changes next week. |
Hello, thanks for your suggestions. As I mentioned, unfortunately, the implementation is not fully reproducible yet. There is for sure improvement from the last time, but there is probably one more randomness that I didn't capture. However, once I will be looking at it next week, I will do my best to remove extra random seeds and discover the remaining unsolved randomness. |
43 reprod issue
@janezlapajne and @cod3licious, Thanks again for your great comments. I did my best to implement all of those. As I mentioned before, we have made some improvements, but the feature selector is not fully reproducible yet. And unfortunately, I don't have time to dive into this deeper again. Once you re-review the PR, we can agree on the next steps. We can either keep the PR open, or merge at least the current changes. Whatever you think is better. |
Mod - using KFold with all CV models; move random_seed_generator to u…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work so far! However, I'm against merging this before it is truly ready. This also needs some proper tests (beyond a manually run notebook) to ensure that future changes don't cause a regression.
Mod - replaced custom function by np.random.default_rng(); fixed the …
I agree, it will be better to fully finalize, rather than merge half done. As I said, I am quite busy those days, but I will try to find time for it once possible :) |
Hello everyone, I spent another time with debugging today. However, I didn't manage to identify the remaining issue - there is still some randomness, even when running FeatureSelector only for featsel_run=1. Therefore, I am not sure whether I will be able to finish. In case anyone would like to take over, I can explain/summarize what I have done so far, or we can team up and look at it together. All the best, |
Fix several reproducibility issues.
However, I didn't manage to fix it in the whole code; there is still some uncaptured randomness that I haven't been able to solve so far.