Skip to content

InvalidIndexError #196

@lorentzenchr

Description

@lorentzenchr

Describe the bug

I wanted to test TabPFNRegressor on the workers compensation dataset, see https://lorentzenchr.github.io/model-diagnostics/examples/regression_on_workers_compensation/.

For some reason unclear to me, I get a InvalidIndexError.

Steps/Code to Reproduce

import pandas as pd
from sklearn import set_config
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from tabpfn import TabPFNRegressor


df_original = fetch_openml(data_id=42876, parser="auto").frame
df = df_original.query("WeeklyPay >= 200 and HoursWorkedPerWeek >= 20")
df_train, df_test = train_test_split(df, train_size=0.75, random_state=1234321)
X_train = df_train.drop(columns="UltimateIncurredClaimCost", inplace=False)
y_train = df_train["UltimateIncurredClaimCost"]

m = TabPFNRegressor()
m.fit(X_train[:100], y_train[:100])

Expected Results

Successful fit.

Actual Results

Short version

InvalidIndexError: (slice(0, 1, None), slice(None, None, None))

inside RemoveConstantFeaturesStep._fit(self, X, categorical_features)

Long version

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=3804), in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas[/_libs/hashtable_class_helper.pxi:7081](http://localhost:8888/_libs/hashtable_class_helper.pxi#line=7080), in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas[/_libs/hashtable_class_helper.pxi:7089](http://localhost:8888/_libs/hashtable_class_helper.pxi#line=7088), in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: (slice(0, 1, None), slice(None, None, None))

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[12], line 11
      7 m = TabPFNRegressor(
      8     # categorical_features_indices=np.arange(len(x_continuous), len(x_vars)),
      9 )
     10 #m.fit(X_train.iloc[idx, 1:2], y_train.iloc[idx])
---> 11 m.fit(df_train.copy().drop(columns="UltimateIncurredClaimCost").iloc[:100], y_train.iloc[:100])

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/regressor.py:504](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/regressor.py#line=503), in TabPFNRegressor.fit(self, X, y)
    499 self.renormalized_criterion_ = FullSupportBarDistribution(
    500     self.bardist_.borders * self.y_train_std_ + self.y_train_mean_,
    501 ).float()
    503 # Create the inference engine
--> 504 self.executor_ = create_inference_engine(
    505     X_train=X,
    506     y_train=y,
    507     model=self.model_,
    508     ensemble_configs=ensemble_configs,
    509     cat_ix=self.inferred_categorical_indices_,
    510     fit_mode=self.fit_mode,
    511     device_=self.device_,
    512     rng=rng,
    513     n_jobs=self.n_jobs,
    514     byte_size=byte_size,
    515     forced_inference_dtype_=self.forced_inference_dtype_,
    516     memory_saving_mode=self.memory_saving_mode,
    517     use_autocast_=self.use_autocast_,
    518 )
    520 return self

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/base.py:213](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/base.py#line=212), in create_inference_engine(X_train, y_train, model, ensemble_configs, cat_ix, fit_mode, device_, rng, n_jobs, byte_size, forced_inference_dtype_, memory_saving_mode, use_autocast_)
    200     engine = InferenceEngineOnDemand.prepare(
    201         X_train=X_train,
    202         y_train=y_train,
   (...)
    210         save_peak_mem=memory_saving_mode,
    211     )
    212 elif fit_mode == "fit_preprocessors":
--> 213     engine = InferenceEngineCachePreprocessing.prepare(
    214         X_train=X_train,
    215         y_train=y_train,
    216         cat_ix=cat_ix,
    217         ensemble_configs=ensemble_configs,
    218         n_workers=n_jobs,
    219         model=model,
    220         rng=rng,
    221         dtype_byte_size=byte_size,
    222         force_inference_dtype=forced_inference_dtype_,
    223         save_peak_mem=memory_saving_mode,
    224     )
    225 elif fit_mode == "fit_with_cache":
    226     engine = InferenceEngineCacheKV.prepare(
    227         X_train=X_train,
    228         y_train=y_train,
   (...)
    238         autocast=use_autocast_,
    239     )

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/inference.py:269](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/inference.py#line=268), in InferenceEngineCachePreprocessing.prepare(cls, X_train, y_train, cat_ix, model, ensemble_configs, n_workers, rng, dtype_byte_size, force_inference_dtype, save_peak_mem)
    243 """Prepare the inference engine.
    244 
    245 Args:
   (...)
    258     The prepared inference engine.
    259 """
    260 itr = fit_preprocessing(
    261     configs=ensemble_configs,
    262     X_train=X_train,
   (...)
    267     parallel_mode="block",
    268 )
--> 269 configs, preprocessors, X_trains, y_trains, cat_ixs = list(zip(*itr))
    270 return InferenceEngineCachePreprocessing(
    271     X_trains=X_trains,
    272     y_trains=y_trains,
   (...)
    279     save_peak_mem=save_peak_mem,
    280 )

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py:664](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py#line=663), in fit_preprocessing(configs, X_train, y_train, random_state, cat_ix, n_workers, parallel_mode)
    661 worker_func = joblib.delayed(func)
    663 seeds = rng.integers(0, np.iinfo(np.int32).max, len(configs))
--> 664 yield from executor(  # type: ignore
    665     [
    666         worker_func(config, X_train, y_train, seed)
    667         for config, seed in zip(configs, seeds)
    668     ],
    669 )

File [~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py:1918](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py#line=1917), in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:

File [~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py:1847](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/joblib/parallel.py#line=1846), in Parallel._get_sequential_output(self, iterable)
   1845 self.n_dispatched_batches += 1
   1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
   1848 self.n_completed_tasks += 1
   1849 self.print_progress()

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py:571](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/preprocessing.py#line=570), in fit_preprocessing_one(config, X_train, y_train, random_state, cat_ix)
    568     y_train = y_train.copy()
    570 preprocessor = config.to_pipeline(random_state=static_seed)
--> 571 res = preprocessor.fit_transform(X_train, cat_ix)
    573 # TODO(eddiebergman): Not a fan of this, wish it was more transparent, but we want
    574 # to distuinguish what to do with the `ys` based on the ensemble config type
    575 if isinstance(config, RegressorEnsembleConfig):

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:398](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=397), in SequentialFeatureTransformer.fit_transform(self, X, categorical_features)
    391 """Fit and transform the data using the fitted pipeline.
    392 
    393 Args:
    394     X: 2d array of shape (n_samples, n_features)
    395     categorical_features: list of indices of categorical features.
    396 """
    397 for step in self.steps:
--> 398     X, categorical_features = step.fit_transform(X, categorical_features)
    399     assert isinstance(categorical_features, list), (
    400         f"The {step=} must return list of categorical features,"
    401         f" but {type(step)} returned {categorical_features}"
    402     )
    404 self.categorical_features_ = categorical_features

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:315](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=314), in FeaturePreprocessingTransformerStep.fit_transform(self, X, categorical_features)
    310 def fit_transform(
    311     self,
    312     X: np.ndarray,
    313     categorical_features: list[int],
    314 ) -> _TransformResult:
--> 315     self.fit(X, categorical_features)
    316     # TODO(eddiebergman): If we could get rid of this... anywho, needed for
    317     # the AddFingerPrint
    318     result = self._transform(X, is_test=False)

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:341](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=340), in FeaturePreprocessingTransformerStep.fit(self, X, categorical_features)
    334 def fit(self, X: np.ndarray, categorical_features: list[int]) -> Self:
    335     """Fits the preprocessor.
    336 
    337     Args:
    338         X: 2d array of shape (n_samples, n_features)
    339         categorical_features: list of indices of categorical feature.
    340     """
--> 341     self.categorical_features_after_transform_ = self._fit(X, categorical_features)
    342     assert self.categorical_features_after_transform_ is not None, (
    343         "_fit should have returned a list of the indexes of the categorical"
    344         "features after the transform."
    345     )
    346     return self

File [~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py:453](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/tabpfn/model/preprocessing.py#line=452), in RemoveConstantFeaturesStep._fit(self, X, categorical_features)
    451 @override
    452 def _fit(self, X: np.ndarray, categorical_features: list[int]) -> list[int]:
--> 453     sel_ = ((X[0:1, :] == X).mean(axis=0) < 1.0).tolist()
    455     if not any(sel_):
    456         raise ValueError(
    457             "All features are constant and would have been removed!"
    458             " Unable to predict using TabPFN.",
    459         )

File [~/github/python3_general/lib/python3.12/site-packages/pandas/core/frame.py:4102](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/pandas/core/frame.py#line=4101), in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File [~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py:3811](http://localhost:8888/~/github/python3_general/lib/python3.12/site-packages/pandas/core/indexes/base.py#line=3810), in Index.get_loc(self, key)
   3806 except KeyError as err:
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
-> 3811         raise InvalidIndexError(key)
   3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.

InvalidIndexError: (slice(0, 1, None), slice(None, None, None))

Versions

Collecting system and dependency information...
PyTorch version: 2.2.2
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.3 (x86_64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.1
Libc version: N/A

Python version: 3.12.7 (main, Oct  1 2024, 02:05:46) [Clang 15.0.0 (clang-1500.3.9.4)] (64-bit runtime)
Python platform: macOS-15.3-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

Dependency Versions:
--------------------
tabpfn: 2.0.6
torch: 2.2.2
numpy: 1.26.4
scipy: 1.14.1
pandas: 2.2.2
scikit-learn: 1.6.0
typing_extensions: 4.12.2
einops: 0.8.1
huggingface-hub: 0.28.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug 💣Something isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions