Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autosklearn Issue could not convert string to float #662

Open
annawiewer opened this issue Dec 2, 2024 · 3 comments
Open

Autosklearn Issue could not convert string to float #662

annawiewer opened this issue Dec 2, 2024 · 3 comments
Labels
framework For issues with frameworks in the current benchmark

Comments

@annawiewer
Copy link

Hello,

I have tried to run autosklearn with the task ID 362234 but I constantly receive the error

"ValueError: could not convert string to float: 'IBRDB0050'":

Could you help me how to fix this please?

Thanks!

`Starting job local.medium.2m8c.Loan_Type.9.autosklearn.
Assigning 8 cores (total=8) for new task Loan_Type.
Assigning 542 MB (total=7866 MB) for new Loan_Type task.
Running task Loan_Type on framework autosklearn with config:
TaskConfig({'framework': 'autosklearn', 'framework_params': {'_save_artifacts': ['models', 'debug_as_files'], 'n_jobs': 1}, 'framework_version': 'stable', 'type': 'classification', 'name': 'Loan_Type', 'openml_task_id': 362234, 'test_server': False, 'fold': 9, 'metric': 'logloss', 'metrics': ['logloss', 'acc', 'balacc'], 'seed': 349662246, 'job_timeout_seconds': 1200, 'max_runtime_seconds': 600, 'cores': 8, 'max_mem_size_mb': 542, 'min_vol_size_mb': -1, 'input_dir': '/home/devcontainers/.cache/openml', 'output_dir': '/home/devcontainers/automlbenchmark/stable/autosklearn.medium.2m8c.local.20241202T130926', 'output_predictions_file': '/home/devcontainers/automlbenchmark/stable/autosklearn.medium.2m8c.local.20241202T130926/predictions/Loan_Type/9/predictions.csv', 'tag': None, 'command': 'runbenchmark.py autosklearn medium 2m8c -m local -p 1 -u ~/dev/null -o ./stable -Xmax_parallel_jobs=12 -Xaws.use_docker=False -Xaws.query_frequency_seconds=300', 'git_info': {'repo': 'https://github.com/openml/automlbenchmark.git', 'branch': 'master', 'commit': '500480923d8f85455958f3c5d620a98cbffb771f', 'tags': [], 'status': ['## master...origin/master [ahead 3, behind 5]', ' M resources/benchmarks/medium.yaml', ' M runstable.sh']}, 'measure_inference_time': False, 'ext': {}, 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'type': 'multiclass', 'output_metadata_file': '/home/devcontainers/automlbenchmark/stable/autosklearn.medium.2m8c.local.20241202T130926/predictions/Loan_Type/9/metadata.json'})
PyOpenML cannot handle string when returning numpy arrays. Use dataset_format="dataframe".
Traceback (most recent call last):
File "/home/devcontainers/automlbenchmark/venv/lib/python3.9/site-packages/openml/datasets/dataset.py", line 629, in _convert_array_format
return np.asarray(data, dtype=np.float32)
File "/home/devcontainers/automlbenchmark/venv/lib/python3.9/site-packages/pandas/core/generic.py", line 2070, in array
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'IBRDB0050'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/devcontainers/automlbenchmark/amlb/benchmark.py", line 605, in run
meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
File "/home/devcontainers/automlbenchmark/frameworks/autosklearn/init.py", line 17, in run
X_enc=dataset.train.X_enc,
File "/home/devcontainers/automlbenchmark/amlb/utils/cache.py", line 77, in decorator
return cache(self, prop_name, prop_fn)
File "/home/devcontainers/automlbenchmark/amlb/utils/cache.py", line 35, in cache
value = fn(self)
File "/home/devcontainers/automlbenchmark/amlb/utils/process.py", line 744, in profiler
return fn(*args, **kwargs)
File "/home/devcontainers/automlbenchmark/amlb/data.py", line 159, in X_enc
return self.data_enc[:, predictors_ind]
File "/home/devcontainers/automlbenchmark/amlb/utils/cache.py", line 77, in decorator
return cache(self, prop_name, prop_fn)
File "/home/devcontainers/automlbenchmark/amlb/utils/cache.py", line 35, in cache
value = fn(self)
File "/home/devcontainers/automlbenchmark/amlb/utils/process.py", line 744, in profiler
return fn(*args, **kwargs)
File "/home/devcontainers/automlbenchmark/amlb/datasets/openml.py", line 275, in data_enc
return self._get_data('array')
File "/home/devcontainers/automlbenchmark/amlb/datasets/openml.py", line 279, in _get_data
self.dataset._load_data(fmt)
File "/home/devcontainers/automlbenchmark/amlb/datasets/openml.py", line 236, in _load_data
train, test = splitter.split()
File "/home/devcontainers/automlbenchmark/amlb/utils/process.py", line 744, in profiler
return fn(*args, **kwargs)
File "/home/devcontainers/automlbenchmark/amlb/datasets/openml.py", line 309, in split
X = self.ds._load_full_data('array')
File "/home/devcontainers/automlbenchmark/amlb/datasets/openml.py", line 241, in load_full_data
X, *
= self._oml_dataset.get_data(dataset_format=fmt)
File "/home/devcontainers/automlbenchmark/venv/lib/python3.9/site-packages/openml/datasets/dataset.py", line 732, in get_data
data = self._convert_array_format(data, dataset_format, attribute_names)
File "/home/devcontainers/automlbenchmark/venv/lib/python3.9/site-packages/openml/datasets/dataset.py", line 631, in _convert_array_format
raise PyOpenMLError(`
image

@PGijsbers
Copy link
Collaborator

Can you let me know the command you used, which Python version you have, and which dependencies are installed in your environment, as well as the virtual environment of auto-sklearn? If you are running locally, the auto-sklearn environment is located it frameworks/autosklearn/venv. Thanks!

@annawiewer
Copy link
Author

annawiewer commented Dec 2, 2024


Thanks for the reply. Below you can find the specific configurations. In general, I would like to understand if I need to modify or preconfigure the dataset on OpenML in a certain way that every framework is able to run the dataset (target is already nominal). For example, I have uploaded several datasets, and it happens that the OpenML task ID 362234 works for MLJAR Supervised but neither for Auto-sklearn nor LightAutoML. Since I am setting up an experiment, I just want to understand on which end I need to adjust something. This would be quite important for the discussion section of the results. Thanks for your support.


Command:
(venv) devcontainers@DESKTOP-OC2G953:~/automlbenchmark$ ./runstable.sh --mode=local

runstable.sh:

#!/usr/bin/env bash

FRAMEWORKS=(
autosklearn
#lightautoml
#h2oautoml
# tpot
# oboe
# autoweka
# hyperoptsklearn
# ranger
#sapientml
#TabPFN
)

BENCHMARKS=(
#example
# test
#validation
# small
medium
)

CONSTRAINTS=(
10m8c
)

MODE=(
local
)

mode='local'

usage() {
    echo "Usage: $0 framework_or_benchmark [-c|--constraint] [-m|--mode=<local|docker|aws>]" 1>&2;
}

POSITIONAL=()

for i in "$@"; do
    case $i in
        -h | --help)
            usage
            exit ;;
        -f=* | --framework=*)
            frameworks="${i#*=}"
            shift ;;
        -b=* | --benchmark=*)
            benchmarks="${i#*=}"
            shift ;;
        -c=* | --constraint=*)
            constraints="${i#*=}"
            shift ;;
        -m=* | --mode=*)
            mode="${i#*=}"
            shift ;;
        -p=* | --parallel=*)
            parallel="${i#*=}"
            shift ;;
        -*|--*=) # unsupported args
            usage
            exit 1 ;;
        *)
            POSITIONAL+=("$i")
            shift ;;
    esac
done

if [[ -z $frameworks ]]; then
  frameworks=${FRAMEWORKS[*]}
fi

if [[ -z $benchmarks ]]; then
  benchmarks=${BENCHMARKS[*]}
fi

if [[ -z $constraints ]]; then
  constraints=${CONSTRAINTS[*]}
fi

if [[ -z $parallel ]]; then
    if [[ $mode == "aws" ]]; then
        parallel=60
    else
        parallel=1
    fi
fi

extra_params="-u ~/dev/null -o ./stable -Xmax_parallel_jobs=12 -Xaws.use_docker=False -Xaws.query_frequency_seconds=300"
#extra_params="$extra_params -v /home/devcontainers/automlbenchmark/resources:/bench/resources -v /home/devcontainers/.cache/openml:/input"

# Run the benchmarks in parallel
for c in ${constraints[*]}; do
    for b in ${benchmarks[*]}; do
        for f in ${frameworks[*]}; do
            echo "Starting benchmark: python runbenchmark.py $f $b $c -m $mode -p $parallel $extra_params"
            python runbenchmark.py $f $b $c -m $mode -p $parallel $extra_params &
        done
    done
done

# Wait for all background processes to complete
wait
echo "All benchmarks completed."

Python Version:
home = /usr/bin
include-system-site-packages = false
version = 3.9.20

(venv) devcontainers@DESKTOP-OC2G953:~/automlbenchmark$ pip list
Package Version


boto3 1.26.98
botocore 1.29.98
certifi 2022.12.7
charset-normalizer 3.1.0
contourpy 1.3.0
cycler 0.12.1
filelock 3.12.0
fonttools 4.55.0
fsspec 2023.6.0
idna 3.4
importlib_resources 6.4.5
jmespath 1.0.1
joblib 1.2.0
kiwisolver 1.4.7
liac-arff 2.5.0
matplotlib 3.9.2
minio 7.1.13
numpy 1.24.2
openml 0.13.1
packaging 24.2
pandas 1.5.3
pillow 11.0.0
pip 24.3.1
psutil 5.9.4
pyarrow 11.0.0
pyparsing 3.2.0
python-dateutil 2.8.2
pytz 2022.7.1
requests 2.28.2
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.7
s3fs 0.4.2
s3transfer 0.6.0
scikit-learn 1.2.2
scipy 1.10.1
setuptools 75.5.0
six 1.16.0
threadpoolctl 3.1.0
urllib3 1.26.15
xmltodict 0.13.0
zipp 3.21.0

(venv) devcontainers@DESKTOP-OC2G953:~/automlbenchmark/frameworks/autosklearn$ pip list
Package Version


argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
auto-sklearn 0.15.0
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.4.0
click 8.1.7
cloudpickle 3.1.0
ConfigSpace 0.4.21
Cython 3.0.11
dask 2024.8.0
distributed 2024.8.0
distro 1.9.0
emcee 3.1.6
fsspec 2024.10.0
idna 3.10
importlib_metadata 8.5.0
Jinja2 3.1.4
joblib 1.4.2
liac-arff 2.5.0
locket 1.0.0
MarkupSafe 3.0.2
minio 7.2.11
msgpack 1.1.0
numpy 1.24.2
openml 0.15.0
packaging 21.3
pandas 1.5.3
partd 1.4.2
pip 24.3.1
psutil 5.8.0
pyarrow 11.0.0
pycparser 2.22
pycryptodome 3.21.0
pynisher 0.6.4
pyparsing 3.2.0
pyrfr 0.8.3
python-dateutil 2.9.0.post0
pytz 2024.2
PyYAML 6.0.2
requests 2.32.3
scikit-learn 0.24.2
scipy 1.13.1
setuptools 75.6.0
six 1.16.0
smac 1.2
sortedcontainers 2.4.0
tblib 3.0.0
threadpoolctl 3.5.0
toolz 1.0.0
tornado 6.4.1
tqdm 4.67.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
wheel 0.45.1
xmltodict 0.14.2
zict 3.0.0
zipp 3.21.0

@PGijsbers
Copy link
Collaborator

PGijsbers commented Dec 3, 2024

It looks like you might also be calling runstable.sh with some arguments? Since I believe task 362234 is not part of the medium benchmark, which is the only benchmark that is not commented out. Nevertheless, I could replicate the error with python runbenchmark.py autosklearn openml/t/362234 test -m docker -f 0.

The issue is that the original OpenML dataset has text features, and the automl benchmark was never tested for that (the original suites have only numerical and categorical data - at the time not all frameworks supported text data). We do want to support text features however, so we might be able to dedicate some time to resolve this systematically.

In the mean time, I believe that removing lines asking for the encoded data for autosklearn (i.e., lines 17,18,23,24 in the __init__.py) should be a feasible workaround. As far as I can tell those lines are only needed for older versions of auto-sklearn, without them the command completes successfully.

@PGijsbers PGijsbers added the framework For issues with frameworks in the current benchmark label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework For issues with frameworks in the current benchmark
Projects
None yet
Development

No branches or pull requests

2 participants