Skip to content

Conversation

VIVEK-MARRI
Copy link

@VIVEK-MARRI VIVEK-MARRI commented Sep 14, 2025

Problem

  • SettingWithCopyWarning appears when filling missing values in TabPFNClassifier.
  • Previous tests manually preprocessed NAs, so the classifier's internal handling of missing values was never validated.

Solution

  • Integrated proper NA handling directly in the classifier’s preprocessing pipeline.
  • Categorical/text columns are filled with 'missing'.
  • Numeric columns are filled with 0.
  • Updated tests/test_na_handling.py to pass raw data with NAs and added assertions using pytest to verify correct predictions.

Benefits

  • Prevents runtime warnings in pandas.
  • Ensures TabPFNClassifier handles datasets with missing values robustly.
  • Improves code stability and user experience.
  • Provides a proper automated test that will fail if NA handling breaks in the future.

Test

  • Ran python -m pytest tests/test_na_handling.py -v → test passed successfully, no errors or warnings.
  • Classifier trained and predicted on datasets with missing values without any manual preprocessing.

@CLAassistant
Copy link

CLAassistant commented Sep 14, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve NA handling and prevent pandas' SettingWithCopyWarning. However, the implementation has some significant issues. A new preprocess_input function is added but remains unused within the classifier, making it dead code. Consequently, the core logic of the classifier's data handling is not changed. The accompanying test file, tests/test_na_handling.py, does not validate the classifier's ability to handle missing values internally, as it manually preprocesses the data before fitting. Furthermore, the test is structured as a script without assertions, which is not a robust way to verify functionality. My review includes suggestions to properly integrate the NA handling logic, improve the test structure, and fix code style issues.

Comment on lines 984 to 994
def preprocess_input(X):
X = X.copy()

for col in X.columns:
if X[col].dtype == 'object': # Categorical/text columns
X[col] = X[col].fillna('missing').astype(str)
else: # Numeric columns
X[col] = X[col].fillna(0).astype(float)

return X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This preprocess_input function is defined but never called within the TabPFNClassifier or elsewhere in the project. As a result, it is dead code and does not contribute to the NA handling logic. The classifier's existing preprocessing pipeline should be modified if the intent is to change how missing values are handled. I recommend removing this function until it is properly integrated.

@VIVEK-MARRI VIVEK-MARRI changed the title Fix NA handling: avoid pandas SettingWithCopyWarning and fill missing values safely TabPFNClassifier now natively handles missing values—no manual preprocessing needed, safer and more user-friendly Sep 14, 2025
@VIVEK-MARRI
Copy link
Author

Integrated native NA handling into TabPFNClassifier (fit, predict, predict_proba) – no manual preprocessing required.

Added a dedicated pytest test (tests/test_na_handling.py) to validate that the classifier handles missing values internally.

Verified locally: all tests passed successfully (python -m pytest -v).

@bejaeger
Copy link
Contributor

bejaeger commented Oct 9, 2025

Hi @VIVEK-MARRI , thanks for the contribution. There seem to be a few things that can be improved in the PR. If you could have a look at the issues pointed out and also fix the styling issues it would be great. We will then be able to review. Thanks!

@VIVEK-MARRI VIVEK-MARRI requested a review from a team as a code owner October 9, 2025 14:31
@VIVEK-MARRI VIVEK-MARRI requested review from oscarkey and removed request for a team October 9, 2025 14:31
@oscarkey
Copy link
Contributor

hey, thank you for this contribution! TabPFN should already handle missing values, so would it be possible for you to open an issue with a small example dataset that shows when this doesn't work for you? Then we can look into what's going on.

@VIVEK-MARRI
Copy link
Author

Thanks for the feedback! I've opened an issue with a reproducible example as requested: #545 .

@oscarkey
Copy link
Contributor

As discussed in #545 we think tabpfn is currentlying working as intended, so closing this for now.

@oscarkey oscarkey closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants