Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#43 - Reproducibility #45

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft

#43 - Reproducibility #45

wants to merge 24 commits into from

Conversation

jtimko16
Copy link

Fix several reproducibility issues.

However, I didn't manage to fix it in the whole code; there is still some uncaptured randomness that I haven't been able to solve so far.

@jtimko16 jtimko16 mentioned this pull request Jul 25, 2024
@janezlapajne
Copy link

Hello, I will go quickly through the changes and check if I everything seems ok - see code comments. I cannot fully test the thing right now, maybe in the following days. I will post new updates under issue: #43.

Copy link

@janezlapajne janezlapajne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments and suggestions. However, please note that I'm not familiar with the overall code structure, so some comments may not be relevant.

src/autofeat/featsel.py Outdated Show resolved Hide resolved
src/autofeat/featsel.py Outdated Show resolved Hide resolved
src/autofeat/featsel.py Outdated Show resolved Hide resolved
@@ -223,36 +242,40 @@ def select_features(

# select good features in k runs in parallel
# by doing sort of a cross-validation (i.e., randomly subsample data points)
def run_select_features(i: int):
def run_select_features(i: int, seed:int):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use space after colon.

Copy link
Owner

@cod3licious cod3licious Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for consistency maybe call it random_seed here as well. but actually I think that setting the same random seed here might be problematic since then all runs could yield in the same result, defeating the purpose of doing this multiple times

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I fixed formatting.

@cod3licious I fixed the issue like this, let me know whether it makes sense to you

 np.random.seed(random_seed)
 loop_seed = np.random.randint(10**6)  # Added to random_seed to make sure that the 1run seed is different for each run, but globally reproducible
 seed = random_seed + loop_seed if random_seed is not None else loop_seed

src/autofeat/featsel.py Outdated Show resolved Hide resolved
@cod3licious
Copy link
Owner

Thanks for looking into this! I think you definitely found some places where setting random seeds should help, but I think there are also some places where it is not necessary. To avoid overcomplicating the implementation, it would be great if in the end you could also check whether some of your additions could be removed again while still keeping it reproducible. I'd be happy to later comment on the places where I think it might be possible to keep the existing implementation while still getting reproducible results.

@jtimko16
Copy link
Author

Hello @janezlapajne and @cod3licious, thanks a lot for your comments! I will try to continue with it and incorporate the changes next week.

@jtimko16
Copy link
Author

Thanks for looking into this! I think you definitely found some places where setting random seeds should help, but I think there are also some places where it is not necessary. To avoid overcomplicating the implementation, it would be great if in the end you could also check whether some of your additions could be removed again while still keeping it reproducible. I'd be happy to later comment on the places where I think it might be possible to keep the existing implementation while still getting reproducible results.

Hello, thanks for your suggestions. As I mentioned, unfortunately, the implementation is not fully reproducible yet. There is for sure improvement from the last time, but there is probably one more randomness that I didn't capture.

However, once I will be looking at it next week, I will do my best to remove extra random seeds and discover the remaining unsolved randomness.

@jtimko16
Copy link
Author

jtimko16 commented Aug 5, 2024

@janezlapajne and @cod3licious, Thanks again for your great comments. I did my best to implement all of those.

As I mentioned before, we have made some improvements, but the feature selector is not fully reproducible yet. And unfortunately, I don't have time to dive into this deeper again.

Once you re-review the PR, we can agree on the next steps. We can either keep the PR open, or merge at least the current changes. Whatever you think is better.

src/autofeat/featsel.py Outdated Show resolved Hide resolved
src/autofeat/featsel.py Outdated Show resolved Hide resolved
jtimko16 and others added 2 commits August 6, 2024 18:47
Mod - using KFold with all CV models; move random_seed_generator to u…
Copy link
Owner

@cod3licious cod3licious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work so far! However, I'm against merging this before it is truly ready. This also needs some proper tests (beyond a manually run notebook) to ensure that future changes don't cause a regression.

src/autofeat/utils.py Outdated Show resolved Hide resolved
src/autofeat/featsel.py Outdated Show resolved Hide resolved
jtimko16 and others added 2 commits August 8, 2024 17:24
Mod - replaced custom function by np.random.default_rng(); fixed the …
@jtimko16
Copy link
Author

jtimko16 commented Aug 8, 2024

Thank you for your work so far! However, I'm against merging this before it is truly ready. This also needs some proper tests (beyond a manually run notebook) to ensure that future changes don't cause a regression.

I agree, it will be better to fully finalize, rather than merge half done. As I said, I am quite busy those days, but I will try to find time for it once possible :)

@jtimko16
Copy link
Author

Hello everyone,

I spent another time with debugging today. However, I didn't manage to identify the remaining issue - there is still some randomness, even when running FeatureSelector only for featsel_run=1.

Therefore, I am not sure whether I will be able to finish. In case anyone would like to take over, I can explain/summarize what I have done so far, or we can team up and look at it together.

All the best,
J.

@jtimko16 jtimko16 marked this pull request as draft August 27, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants