Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the feature contribution argument output as an option at predict #39

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Fish-Soup
Copy link

Add the option to call lightGBM.Booster.predict(...., pred_contrib=True)
This generates an output with the number of columns = the number distribution arguments * (number of features + 1).

Output is converted to a multi-index column of two levels, distribution args and feature contributions (and Constant)

Unit test added for pred_contributions to test if when you sum up all contributions and apply response function you get the same result as when predicting the parameters.

I also noticed that when predicting sampling is always applied even if we are returning an output that does not require sampling. This must make predictions a little slower on larger data sets. As such I moved the sampling code so it only gets called when required.

@StatMixedML
Copy link
Owner

StatMixedML commented Sep 25, 2024

Thanks for opening the PR and for your interest in the proect, very much appreciated!

I`d need some time, though, to look into it in detail.

May I ask you to also give an example of how to use and interpret it. That would help, thanks!

@Fish-Soup
Copy link
Author

Hi I added an example in the examples section. There is lots more you can use the output for. At a very high level it provides SHAP like information but directly from lightGBM's internal calculations. When a distribution_arg is used we can also use it to get the actual contribution to the final parameter value

@Fish-Soup
Copy link
Author

Fish-Soup commented Oct 3, 2024

I've added a little code to give the pandas columns a level name based on the pred_type argument. This is helpful when doing pandas operations like stack.

For example when pred_type="quantiles", the pandas output columns will have name "quantiles". This means we can

pred_samples.stack("quantiles") to create a multi index series.

Ive also changed the names for the multi-index with pred_type="contributions" to ["parameters", "feature_contributions"] from ["distribution_args", "FeatureContributions"] to alligh with the pred_type naming convention

@StatMixedML
Copy link
Owner

Thanks for your changses. I am currently occupied with the Hyper-Tree paper, so please do expect some delay in my review.

@Fish-Soup
Copy link
Author

Is there anything I can do to help speed it along? The PR is unit tested and is essentially just passing arguments to lightgbm.booster then doing some reshaping of the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants