Shapley values #335

Alex6022 · 2024-08-05T06:05:01Z

Introduced SHAP (SHapley Additive exPlanations) analysis of the surrogate model to analyze the feature importance of finished campaigns. This is especially interesting in combination with the molecular encodings that are already built-in into BayBE.

In a previous project from the AC-BO-Hackathon, different molecular encodings were previously tested to screen molecules for high corrosion inhibition. Analyzing the highly succesful MORDRED campaign with the new SHAP functionality yields the following summary plot:

Besides the measurement parameters "Time_h" and "Salt_Concentrat_M", the Mordred-specific features "SMILES_MORDRED_NdS" and "SMILES_MORDRED_nS" suggest the importance of sulphur groups for corrosion inhibition. Interestingly, this is in agreement with previous literature in the field. I hope that this new feature will be of interest for many other applications in the future.

…yBE to generalize with different surrogates.

…022/baybe into feature/feature_importance

…ng simulations instead of plotting them.

… requirements.

…n attribute.

… feature/feature_importance

CLAassistant · 2024-08-05T06:05:07Z

All committers have signed the CLA.

AdrianSosic · 2024-08-05T06:39:55Z

Hi @Alex6022, it's absolutely great to see that you took the time to contribute to BayBE 😃🥇 In fact, the feature importance part has been requested since quite a while (see #78) and there are currently quite a few changes (i.e. a major refactoring of the surrogate modules) on the way that prepare our code for proper integration of this feature (and many others). The SHAP integration was one of the first things I've planned for after the refactoring has been completed. (Ping: @brandon-holt, who specifically asked about SHAP)

That said, I'm glad to see that you already took the first step. And I'd be more than happy if we could finalize the PR together. However, I think a proper integration requires the refactoring to be finished first and the dev/surrogates branch to be merged. I've actually wanted to do so by the end of the last week where we also released version 0.10.0, but there were some unexpected hurdles, which forced me to postpone it to this week.

So if you don't mind: let me finish the refactoring first and then ping you once done, so that we can proceed with your PR? In the meantime, I can already highlight some points that we'd need to address with your PR before it can be merged. Perhaps, if you have the time, you can already start to work on it / think about the necessary design changes:

We need (a) test(s) for your new feature
Could it be that you've locally tested your code only with parameters that have a 1-d computational representation / no encoding, such as NumericalContinuousParameter/NumericalDiscreteParameter and CategoricalParameter with integer encoding? I'm asking because I think your call of the surrogate model would break for other parameter types, since it is defined on the computational data representation while you pass it the experimental representation. That is, by the way, one of the things that will be changed in my current refactorings, making the necessary function calls much simpler. So the entire torch-to-numpy conversion will no longer be required afterwards, nor will any experimental-to-computational transformation.
For faster package import times, we make use of torch lazy loading, i.e. top-level imports of torch in are forbidden in modules that are not torch-centric (such as campaign.py). The CI pipeline will also fail. Will probably become obsolete after my refactoring anyway, as explained above)
Because SHAP will very likely not remain the only "feature importance" method in the future, we need to ensure that things are properly decoupled from the rest of the baybe code. That is, I'd prefer that the computation is not directly embedded into existing classes like Campaign but we somehow make clever use of composition / dependency injection, so that we can also run the same code with other feature importance analyzers later – just like you can swap out other model parts against each other (keep in mind: baybe is designed with modularity as a primary goal). Also things like plotting need to be decoupled from computational code.
Related to the above point: the design should be such that the shap package can stay an optional dependency.

I can gladly give you guidance on individual points if you need more information 👍🏼 And I'll keep you updated on the ongoing refactoring work (here the current open PR: #325)

Once again, thanks for your contribution!!

brandon-holt · 2024-08-14T20:03:37Z

@AdrianSosic @Alex6022 Hey just checking the status of this! Also if you could include a quick example of how to perform the SHAP analysis on a baybe surrogate, that would be much appreciated

Alex6022 · 2024-08-15T04:46:42Z

Hi @AdrianSosic, thank you for the great feedback! To respect the modularity of BayBE, implementing the SHAP calculation as a hook and then providing a plot utility probably makes a lot of sense. This should solve issues 4 and 5, would you agree on this? As you suggested, addressing the other points in detail may make more sense once the refactoring of the surrogate model has been completed. Please feel free to take your time for this, I am currently also on vacation until the end of August.

@brandon-holt, the current way to use my (arguably slightly hacked) implementation would be to call
'shap_feature_importance(plot=True)' on your 'Campaign' object. This method also returns the SHAP values provided by the model-agnostic Kernel Explainer within the SHAP package, in case you want to do more than just display the summary plot.

I really like the overall idea and vision for this package, so I am really excited to contribute with the implementation. Looking forward to continuing this 😃!

brandon-holt · 2024-08-15T09:32:10Z

@Alex6022 amazing thank you! ~~Would this work on a campaign object that was trained on a different version of baybe? But then it was saved and reloaded, and then that function was called in your version of baybe?~~

~~Or would it have to be trained on your version?~~

Edit: It appears that you cannot reload a campaign object that was saved via pickle using an earlier version of baybe which doesn't include your injected code. The following error appears: UnpicklingError: NEWOBJ class argument must be a type, not function. By contrast, if I load the model using the older version of baybe, there is no error.

Would it be possible to update the code to handle this scenario? Perhaps some of the modularity changes suggested by @AdrianSosic would address this? Let me know your thoughts, these campaign objects took 1 week+ to train so recreating them every time there is a new version of baybe is not ideal.

AdrianSosic · 2024-08-26T06:52:15Z

Hi @brandon-holt, let me shed some light on what is happening in the back:

Pickling is nice since it's generic and easy, but it doesn't replace a proper serialization mechanism. In essence, the problem you observe is this one here, copied directly from the Python docs:

So when the class definitions change, you might no longer be able to unpickle your objects.
So instead of pickling objects, I'd generally recommend to make use our serialization engine, which exports your BayBE objects to JSON.
Right now, we can only "guarantee" that serialization roundtrips work as long as you stay within the same version of BayBE, simply because implementing the necessary tests and adapters would slow us down too much in the current state of the project. However, once we get closer to version 1.0.0 and our APIs becomes more stable, we'll eventually also try to support cross-version serialization – at least to a certain extent.
Regarding your current problem, there might still be a comparable easy manual workaround, though. When you serialize your campaign to JSON, you'll see that it the produced string is basically just a collection of the corresponding campaign fields, i.e. there will be an entry for your measurements, one for your searchspace, etc. When you try to deserialize it using a newer BayBE version and run into errors, this is most likely because some of the internal variable might have changed (e.g., I think the measurements dataframe was previously called just data). So the serializer will probably complain about missing/additional unknown fields. A simple renaming in your JSON will likely solve the problem.

Let me know if I can help you any further.

Scienfitz · 2024-08-30T17:23:09Z

Hello @Alex6022 there was some movement in this issue and I wanted to reach out

Expose surrogate #355 is adding functionality to directly access the last fitted surrogate based on campaign (and possible recommender) level
in the PR, @AdrianSosic has already provided a nice self contained example on how that could be used to do SHAP explanation
In Upcoming Insights Package #357 I propose how it could be integrated into the package. Short; we prefer a diagnostics package instead of the direct addition to the campaign you chose in this PR. Also, the newly required dependencies should be made optional (we can provide guidance for this as its been done for other parts of the code)

Do you think thats reasonable? Do you want to have a go at this? Even if not, your response here would be appreciated. I'd judge the effort to do this as fairly small, perhaps even less code than this PR here

Scienfitz · 2024-09-09T14:57:16Z

Closing this as the way forward is not compatible with the PR

In the meantime, easy access to the surrogate model has been provided in #355 via get_surrogate
SHAP integration will be done via the diagnostics package discussed here #357

Alex6022 · 2024-10-04T16:08:31Z

Hello @Alex6022 there was some movement in this issue and I wanted to reach out

Expose surrogate #355 is adding functionality to directly access the last fitted surrogate based on campaign (and possible recommender) level

in the PR, @AdrianSosic has already provided a nice self contained example on how that could be used to do SHAP explanation

In Upcoming Diagnostics Package #357 I propose how it could be integrated into the package. Short; we prefer a diagnostics package instead of the direct addition to the campaign you chose in this PR. Also, the newly required dependencies should be made optional (we can provide guidance for this as its been done for other parts of the code)

Do you think thats reasonable? Do you want to have a go at this? Even if not, your response here would be appreciated. I'd judge the effort to do this as fairly small, perhaps even less code than this PR here

Dear @Scienfitz,

I am afraid my previous response did not get through, hopefully this did not cause any inconveniences. I recently made a suggestion for this feature in the PR #391, as mentioned in issue #357.

Alex6022 and others added 13 commits July 23, 2024 11:50

1st commit in new branch

887195e

First implementation of shap feature importance on campaign class level

c5b2e6e

Merge branch 'emdgroup:main' into feature/feature_importance

bc4783b

SHAP feature importance function now uses the eval function within Ba…

20d52d7

…yBE to generalize with different surrogates.

Merge branch 'feature/feature_importance' of https://github.com/Alex6…

8b1deb3

…022/baybe into feature/feature_importance

SHAP feature importance ranking now compatible with hybrid spaces

07af121

Fixed bug with SHAP plot function

bd64d6b

Changed SHAP feature implementation to include results in report duri…

7d8b2c4

…ng simulations instead of plotting them.

Merge branch 'emdgroup:main' into feature/feature_importance

9a7b386

Added exception handling for SHAP analysis in case of exceeded memory…

3c6f6ee

… requirements.

Refactored shap api and added previously calculated values as campaig…

0906199

…n attribute.

Merge current branch 'main' of https://github.com/emdgroup/baybe into…

d9ffa74

… feature/feature_importance

Deleted unnecessary files for pull request

e4d3334

Alex6022 requested review from Scienfitz, AdrianSosic and AVHopp as code owners August 5, 2024 06:05

AdrianSosic added the new feature New functionality label Aug 5, 2024

AdrianSosic changed the title ~~Feature/feature importance~~ Shapley values Aug 5, 2024

Scienfitz closed this Sep 9, 2024

Alex6022 mentioned this pull request Oct 4, 2024

Insights module and SHAPInsight #391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shapley values #335

Shapley values #335

Alex6022 commented Aug 5, 2024

CLAassistant commented Aug 5, 2024 •

edited

Loading

AdrianSosic commented Aug 5, 2024

brandon-holt commented Aug 14, 2024 •

edited

Loading

Alex6022 commented Aug 15, 2024

brandon-holt commented Aug 15, 2024 •

edited

Loading

AdrianSosic commented Aug 26, 2024

Scienfitz commented Aug 30, 2024 •

edited

Loading

Scienfitz commented Sep 9, 2024

Alex6022 commented Oct 4, 2024

Shapley values #335

Shapley values #335

Conversation

Alex6022 commented Aug 5, 2024

CLAassistant commented Aug 5, 2024 • edited Loading

AdrianSosic commented Aug 5, 2024

brandon-holt commented Aug 14, 2024 • edited Loading

Alex6022 commented Aug 15, 2024

brandon-holt commented Aug 15, 2024 • edited Loading

AdrianSosic commented Aug 26, 2024

Scienfitz commented Aug 30, 2024 • edited Loading

Scienfitz commented Sep 9, 2024

Alex6022 commented Oct 4, 2024

CLAassistant commented Aug 5, 2024 •

edited

Loading

brandon-holt commented Aug 14, 2024 •

edited

Loading

brandon-holt commented Aug 15, 2024 •

edited

Loading

Scienfitz commented Aug 30, 2024 •

edited

Loading