-
Notifications
You must be signed in to change notification settings - Fork 460
Closed
Labels
bug 💣Something isn't workingSomething isn't working
Description
Describe the bug
I wanted to test TabPFNRegressor on the workers compensation dataset, see https://lorentzenchr.github.io/model-diagnostics/examples/regression_on_workers_compensation/.
For some reason unclear to me, I get a InvalidIndexError.
Steps/Code to Reproduce
import pandas as pd
from sklearn import set_config
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from tabpfn import TabPFNRegressor
df_original = fetch_openml(data_id=42876, parser="auto").frame
df = df_original.query("WeeklyPay >= 200 and HoursWorkedPerWeek >= 20")
df_train, df_test = train_test_split(df, train_size=0.75, random_state=1234321)
X_train = df_train.drop(columns="UltimateIncurredClaimCost", inplace=False)
y_train = df_train["UltimateIncurredClaimCost"]
m = TabPFNRegressor()
m.fit(X_train[:100], y_train[:100])Expected Results
Successful fit.
Actual Results
Short version
InvalidIndexError: (slice(0, 1, None), slice(None, None, None))
inside RemoveConstantFeaturesStep._fit(self, X, categorical_features)
Long version
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File [~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=3804), in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas[/_libs/hashtable_class_helper.pxi:7081](http://localhost:8888/_libs/hashtable_class_helper.pxi#line=7080), in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas[/_libs/hashtable_class_helper.pxi:7089](http://localhost:8888/_libs/hashtable_class_helper.pxi#line=7088), in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: (slice(0, 1, None), slice(None, None, None))
During handling of the above exception, another exception occurred:
InvalidIndexError Traceback (most recent call last)
Cell In[12], line 11
7 m = TabPFNRegressor(
8 # categorical_features_indices=np.arange(len(x_continuous), len(x_vars)),
9 )
10 #m.fit(X_train.iloc[idx, 1:2], y_train.iloc[idx])
---> 11 m.fit(df_train.copy().drop(columns="UltimateIncurredClaimCost").iloc[:100], y_train.iloc[:100])
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/regressor.py:504](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/regressor.py#line=503), in TabPFNRegressor.fit(self, X, y)
499 self.renormalized_criterion_ = FullSupportBarDistribution(
500 self.bardist_.borders * self.y_train_std_ + self.y_train_mean_,
501 ).float()
503 # Create the inference engine
--> 504 self.executor_ = create_inference_engine(
505 X_train=X,
506 y_train=y,
507 model=self.model_,
508 ensemble_configs=ensemble_configs,
509 cat_ix=self.inferred_categorical_indices_,
510 fit_mode=self.fit_mode,
511 device_=self.device_,
512 rng=rng,
513 n_jobs=self.n_jobs,
514 byte_size=byte_size,
515 forced_inference_dtype_=self.forced_inference_dtype_,
516 memory_saving_mode=self.memory_saving_mode,
517 use_autocast_=self.use_autocast_,
518 )
520 return self
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/base.py:213](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/base.py#line=212), in create_inference_engine(X_train, y_train, model, ensemble_configs, cat_ix, fit_mode, device_, rng, n_jobs, byte_size, forced_inference_dtype_, memory_saving_mode, use_autocast_)
200 engine = InferenceEngineOnDemand.prepare(
201 X_train=X_train,
202 y_train=y_train,
(...)
210 save_peak_mem=memory_saving_mode,
211 )
212 elif fit_mode == "fit_preprocessors":
--> 213 engine = InferenceEngineCachePreprocessing.prepare(
214 X_train=X_train,
215 y_train=y_train,
216 cat_ix=cat_ix,
217 ensemble_configs=ensemble_configs,
218 n_workers=n_jobs,
219 model=model,
220 rng=rng,
221 dtype_byte_size=byte_size,
222 force_inference_dtype=forced_inference_dtype_,
223 save_peak_mem=memory_saving_mode,
224 )
225 elif fit_mode == "fit_with_cache":
226 engine = InferenceEngineCacheKV.prepare(
227 X_train=X_train,
228 y_train=y_train,
(...)
238 autocast=use_autocast_,
239 )
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/inference.py:269](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/inference.py#line=268), in InferenceEngineCachePreprocessing.prepare(cls, X_train, y_train, cat_ix, model, ensemble_configs, n_workers, rng, dtype_byte_size, force_inference_dtype, save_peak_mem)
243 """Prepare the inference engine.
244
245 Args:
(...)
258 The prepared inference engine.
259 """
260 itr = fit_preprocessing(
261 configs=ensemble_configs,
262 X_train=X_train,
(...)
267 parallel_mode="block",
268 )
--> 269 configs, preprocessors, X_trains, y_trains, cat_ixs = list(zip(*itr))
270 return InferenceEngineCachePreprocessing(
271 X_trains=X_trains,
272 y_trains=y_trains,
(...)
279 save_peak_mem=save_peak_mem,
280 )
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py:664](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py#line=663), in fit_preprocessing(configs, X_train, y_train, random_state, cat_ix, n_workers, parallel_mode)
661 worker_func = joblib.delayed(func)
663 seeds = rng.integers(0, np.iinfo(np.int32).max, len(configs))
--> 664 yield from executor( # type: ignore
665 [
666 worker_func(config, X_train, y_train, seed)
667 for config, seed in zip(configs, seeds)
668 ],
669 )
File [~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py:1918](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py#line=1917), in Parallel.__call__(self, iterable)
1916 output = self._get_sequential_output(iterable)
1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1920 # Let's create an ID that uniquely identifies the current call. If the
1921 # call is interrupted early and that the same instance is immediately
1922 # re-used, this id will be used to prevent workers that were
1923 # concurrently finalizing a task from the previous call to run the
1924 # callback.
1925 with self._lock:
File [~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py:1847](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py#line=1846), in Parallel._get_sequential_output(self, iterable)
1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
1848 self.n_completed_tasks += 1
1849 self.print_progress()
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py:571](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py#line=570), in fit_preprocessing_one(config, X_train, y_train, random_state, cat_ix)
568 y_train = y_train.copy()
570 preprocessor = config.to_pipeline(random_state=static_seed)
--> 571 res = preprocessor.fit_transform(X_train, cat_ix)
573 # TODO(eddiebergman): Not a fan of this, wish it was more transparent, but we want
574 # to distuinguish what to do with the `ys` based on the ensemble config type
575 if isinstance(config, RegressorEnsembleConfig):
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:398](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=397), in SequentialFeatureTransformer.fit_transform(self, X, categorical_features)
391 """Fit and transform the data using the fitted pipeline.
392
393 Args:
394 X: 2d array of shape (n_samples, n_features)
395 categorical_features: list of indices of categorical features.
396 """
397 for step in self.steps:
--> 398 X, categorical_features = step.fit_transform(X, categorical_features)
399 assert isinstance(categorical_features, list), (
400 f"The {step=} must return list of categorical features,"
401 f" but {type(step)} returned {categorical_features}"
402 )
404 self.categorical_features_ = categorical_features
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:315](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=314), in FeaturePreprocessingTransformerStep.fit_transform(self, X, categorical_features)
310 def fit_transform(
311 self,
312 X: np.ndarray,
313 categorical_features: list[int],
314 ) -> _TransformResult:
--> 315 self.fit(X, categorical_features)
316 # TODO(eddiebergman): If we could get rid of this... anywho, needed for
317 # the AddFingerPrint
318 result = self._transform(X, is_test=False)
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:341](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=340), in FeaturePreprocessingTransformerStep.fit(self, X, categorical_features)
334 def fit(self, X: np.ndarray, categorical_features: list[int]) -> Self:
335 """Fits the preprocessor.
336
337 Args:
338 X: 2d array of shape (n_samples, n_features)
339 categorical_features: list of indices of categorical feature.
340 """
--> 341 self.categorical_features_after_transform_ = self._fit(X, categorical_features)
342 assert self.categorical_features_after_transform_ is not None, (
343 "_fit should have returned a list of the indexes of the categorical"
344 "features after the transform."
345 )
346 return self
File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:453](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=452), in RemoveConstantFeaturesStep._fit(self, X, categorical_features)
451 @override
452 def _fit(self, X: np.ndarray, categorical_features: list[int]) -> list[int]:
--> 453 sel_ = ((X[0:1, :] == X).mean(axis=0) < 1.0).tolist()
455 if not any(sel_):
456 raise ValueError(
457 "All features are constant and would have been removed!"
458 " Unable to predict using TabPFN.",
459 )
File [~/github/python3_general/lib/python3.12/site-packages/pandas/core/frame.py:4102](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/pandas/core/frame.py#line=4101), in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File [~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py:3811](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=3810), in Index.get_loc(self, key)
3806 except KeyError as err:
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
-> 3811 raise InvalidIndexError(key)
3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
InvalidIndexError: (slice(0, 1, None), slice(None, None, None))
Versions
Collecting system and dependency information...
PyTorch version: 2.2.2
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 15.3 (x86_64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.1
Libc version: N/A
Python version: 3.12.7 (main, Oct 1 2024, 02:05:46) [Clang 15.0.0 (clang-1500.3.9.4)] (64-bit runtime)
Python platform: macOS-15.3-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
Dependency Versions:
--------------------
tabpfn: 2.0.6
torch: 2.2.2
numpy: 1.26.4
scipy: 1.14.1
pandas: 2.2.2
scikit-learn: 1.6.0
typing_extensions: 4.12.2
einops: 0.8.1
huggingface-hub: 0.28.1Metadata
Metadata
Assignees
Labels
bug 💣Something isn't workingSomething isn't working