Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shapley values #335

Closed
wants to merge 13 commits into from
Closed

Conversation

Alex6022
Copy link
Contributor

@Alex6022 Alex6022 commented Aug 5, 2024

Introduced SHAP (SHapley Additive exPlanations) analysis of the surrogate model to analyze the feature importance of finished campaigns. This is especially interesting in combination with the molecular encodings that are already built-in into BayBE.

In a previous project from the AC-BO-Hackathon, different molecular encodings were previously tested to screen molecules for high corrosion inhibition. Analyzing the highly succesful MORDRED campaign with the new SHAP functionality yields the following summary plot:
output

Besides the measurement parameters "Time_h" and "Salt_Concentrat_M", the Mordred-specific features "SMILES_MORDRED_NdS" and "SMILES_MORDRED_nS" suggest the importance of sulphur groups for corrosion inhibition. Interestingly, this is in agreement with previous literature in the field. I hope that this new feature will be of interest for many other applications in the future.

@CLAassistant
Copy link

CLAassistant commented Aug 5, 2024

CLA assistant check
All committers have signed the CLA.

@AdrianSosic
Copy link
Collaborator

Hi @Alex6022, it's absolutely great to see that you took the time to contribute to BayBE 😃🥇 In fact, the feature importance part has been requested since quite a while (see #78) and there are currently quite a few changes (i.e. a major refactoring of the surrogate modules) on the way that prepare our code for proper integration of this feature (and many others). The SHAP integration was one of the first things I've planned for after the refactoring has been completed. (Ping: @brandon-holt, who specifically asked about SHAP)

That said, I'm glad to see that you already took the first step. And I'd be more than happy if we could finalize the PR together. However, I think a proper integration requires the refactoring to be finished first and the dev/surrogates branch to be merged. I've actually wanted to do so by the end of the last week where we also released version 0.10.0, but there were some unexpected hurdles, which forced me to postpone it to this week.

So if you don't mind: let me finish the refactoring first and then ping you once done, so that we can proceed with your PR? In the meantime, I can already highlight some points that we'd need to address with your PR before it can be merged. Perhaps, if you have the time, you can already start to work on it / think about the necessary design changes:

  • We need (a) test(s) for your new feature
  • Could it be that you've locally tested your code only with parameters that have a 1-d computational representation / no encoding, such as NumericalContinuousParameter/NumericalDiscreteParameter and CategoricalParameter with integer encoding? I'm asking because I think your call of the surrogate model would break for other parameter types, since it is defined on the computational data representation while you pass it the experimental representation. That is, by the way, one of the things that will be changed in my current refactorings, making the necessary function calls much simpler. So the entire torch-to-numpy conversion will no longer be required afterwards, nor will any experimental-to-computational transformation.
  • For faster package import times, we make use of torch lazy loading, i.e. top-level imports of torch in are forbidden in modules that are not torch-centric (such as campaign.py). The CI pipeline will also fail. Will probably become obsolete after my refactoring anyway, as explained above)
  • Because SHAP will very likely not remain the only "feature importance" method in the future, we need to ensure that things are properly decoupled from the rest of the baybe code. That is, I'd prefer that the computation is not directly embedded into existing classes like Campaign but we somehow make clever use of composition / dependency injection, so that we can also run the same code with other feature importance analyzers later – just like you can swap out other model parts against each other (keep in mind: baybe is designed with modularity as a primary goal). Also things like plotting need to be decoupled from computational code.
  • Related to the above point: the design should be such that the shap package can stay an optional dependency.

I can gladly give you guidance on individual points if you need more information 👍🏼 And I'll keep you updated on the ongoing refactoring work (here the current open PR: #325)

Once again, thanks for your contribution!!

@AdrianSosic AdrianSosic added the new feature New functionality label Aug 5, 2024
@AdrianSosic AdrianSosic changed the title Feature/feature importance Shapley values Aug 5, 2024
@brandon-holt
Copy link
Contributor

brandon-holt commented Aug 14, 2024

@AdrianSosic @Alex6022 Hey just checking the status of this! Also if you could include a quick example of how to perform the SHAP analysis on a baybe surrogate, that would be much appreciated

@Alex6022
Copy link
Contributor Author

Hi @AdrianSosic, thank you for the great feedback! To respect the modularity of BayBE, implementing the SHAP calculation as a hook and then providing a plot utility probably makes a lot of sense. This should solve issues 4 and 5, would you agree on this? As you suggested, addressing the other points in detail may make more sense once the refactoring of the surrogate model has been completed. Please feel free to take your time for this, I am currently also on vacation until the end of August.

@brandon-holt, the current way to use my (arguably slightly hacked) implementation would be to call
'shap_feature_importance(plot=True)' on your 'Campaign' object. This method also returns the SHAP values provided by the model-agnostic Kernel Explainer within the SHAP package, in case you want to do more than just display the summary plot.

I really like the overall idea and vision for this package, so I am really excited to contribute with the implementation. Looking forward to continuing this 😃!

@brandon-holt
Copy link
Contributor

brandon-holt commented Aug 15, 2024

@Alex6022 amazing thank you! Would this work on a campaign object that was trained on a different version of baybe? But then it was saved and reloaded, and then that function was called in your version of baybe?

Or would it have to be trained on your version?

Edit: It appears that you cannot reload a campaign object that was saved via pickle using an earlier version of baybe which doesn't include your injected code. The following error appears: UnpicklingError: NEWOBJ class argument must be a type, not function. By contrast, if I load the model using the older version of baybe, there is no error.

Would it be possible to update the code to handle this scenario? Perhaps some of the modularity changes suggested by @AdrianSosic would address this? Let me know your thoughts, these campaign objects took 1 week+ to train so recreating them every time there is a new version of baybe is not ideal.

@AdrianSosic
Copy link
Collaborator

Hi @brandon-holt, let me shed some light on what is happening in the back:

  • Pickling is nice since it's generic and easy, but it doesn't replace a proper serialization mechanism. In essence, the problem you observe is this one here, copied directly from the Python docs:
    image
    So when the class definitions change, you might no longer be able to unpickle your objects.
  • So instead of pickling objects, I'd generally recommend to make use our serialization engine, which exports your BayBE objects to JSON.
  • Right now, we can only "guarantee" that serialization roundtrips work as long as you stay within the same version of BayBE, simply because implementing the necessary tests and adapters would slow us down too much in the current state of the project. However, once we get closer to version 1.0.0 and our APIs becomes more stable, we'll eventually also try to support cross-version serialization – at least to a certain extent.
  • Regarding your current problem, there might still be a comparable easy manual workaround, though. When you serialize your campaign to JSON, you'll see that it the produced string is basically just a collection of the corresponding campaign fields, i.e. there will be an entry for your measurements, one for your searchspace, etc. When you try to deserialize it using a newer BayBE version and run into errors, this is most likely because some of the internal variable might have changed (e.g., I think the measurements dataframe was previously called just data). So the serializer will probably complain about missing/additional unknown fields. A simple renaming in your JSON will likely solve the problem.

Let me know if I can help you any further.

@Scienfitz
Copy link
Collaborator

Scienfitz commented Aug 30, 2024

Hello @Alex6022 there was some movement in this issue and I wanted to reach out

  • Expose surrogate #355 is adding functionality to directly access the last fitted surrogate based on campaign (and possible recommender) level
  • in the PR, @AdrianSosic has already provided a nice self contained example on how that could be used to do SHAP explanation
  • In Upcoming Insights Package #357 I propose how it could be integrated into the package. Short; we prefer a diagnostics package instead of the direct addition to the campaign you chose in this PR. Also, the newly required dependencies should be made optional (we can provide guidance for this as its been done for other parts of the code)

Do you think thats reasonable? Do you want to have a go at this? Even if not, your response here would be appreciated. I'd judge the effort to do this as fairly small, perhaps even less code than this PR here

@Scienfitz
Copy link
Collaborator

Closing this as the way forward is not compatible with the PR

In the meantime, easy access to the surrogate model has been provided in #355 via get_surrogate
SHAP integration will be done via the diagnostics package discussed here #357

@Alex6022
Copy link
Contributor Author

Alex6022 commented Oct 4, 2024

Hello @Alex6022 there was some movement in this issue and I wanted to reach out

  • Expose surrogate #355 is adding functionality to directly access the last fitted surrogate based on campaign (and possible recommender) level
  • in the PR, @AdrianSosic has already provided a nice self contained example on how that could be used to do SHAP explanation
  • In Upcoming Diagnostics Package #357 I propose how it could be integrated into the package. Short; we prefer a diagnostics package instead of the direct addition to the campaign you chose in this PR. Also, the newly required dependencies should be made optional (we can provide guidance for this as its been done for other parts of the code)

Do you think thats reasonable? Do you want to have a go at this? Even if not, your response here would be appreciated. I'd judge the effort to do this as fairly small, perhaps even less code than this PR here

Dear @Scienfitz,

I am afraid my previous response did not get through, hopefully this did not cause any inconveniences. I recently made a suggestion for this feature in the PR #391, as mentioned in issue #357.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants