-
Notifications
You must be signed in to change notification settings - Fork 457
TabPFNClassifier now natively handles missing values—no manual preprocessing needed, safer and more user-friendly #507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve NA handling and prevent pandas' SettingWithCopyWarning
. However, the implementation has some significant issues. A new preprocess_input
function is added but remains unused within the classifier, making it dead code. Consequently, the core logic of the classifier's data handling is not changed. The accompanying test file, tests/test_na_handling.py
, does not validate the classifier's ability to handle missing values internally, as it manually preprocesses the data before fitting. Furthermore, the test is structured as a script without assertions, which is not a robust way to verify functionality. My review includes suggestions to properly integrate the NA handling logic, improve the test structure, and fix code style issues.
src/tabpfn/classifier.py
Outdated
def preprocess_input(X): | ||
X = X.copy() | ||
|
||
for col in X.columns: | ||
if X[col].dtype == 'object': # Categorical/text columns | ||
X[col] = X[col].fillna('missing').astype(str) | ||
else: # Numeric columns | ||
X[col] = X[col].fillna(0).astype(float) | ||
|
||
return X | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This preprocess_input
function is defined but never called within the TabPFNClassifier
or elsewhere in the project. As a result, it is dead code and does not contribute to the NA handling logic. The classifier's existing preprocessing pipeline should be modified if the intent is to change how missing values are handled. I recommend removing this function until it is properly integrated.
Integrated native NA handling into TabPFNClassifier (fit, predict, predict_proba) – no manual preprocessing required. Added a dedicated pytest test (tests/test_na_handling.py) to validate that the classifier handles missing values internally. Verified locally: all tests passed successfully (python -m pytest -v). |
Hi @VIVEK-MARRI , thanks for the contribution. There seem to be a few things that can be improved in the PR. If you could have a look at the issues pointed out and also fix the styling issues it would be great. We will then be able to review. Thanks! |
hey, thank you for this contribution! TabPFN should already handle missing values, so would it be possible for you to open an issue with a small example dataset that shows when this doesn't work for you? Then we can look into what's going on. |
Thanks for the feedback! I've opened an issue with a reproducible example as requested: #545 . |
As discussed in #545 we think tabpfn is currentlying working as intended, so closing this for now. |
Problem
SettingWithCopyWarning
appears when filling missing values in TabPFNClassifier.Solution
'missing'
.0
.tests/test_na_handling.py
to pass raw data with NAs and added assertions usingpytest
to verify correct predictions.Benefits
Test
python -m pytest tests/test_na_handling.py -v
→ test passed successfully, no errors or warnings.