-
Notifications
You must be signed in to change notification settings - Fork 55
Closed
Description
This issue is called out in one of the commits in PR #117
The second issue is specific to map():
ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").It appears the the datasets, only in the case of num_proc>1,
when we hit the "error converting dtype" case and set the column
to None, it ends up being still considered a string column rather
than the new type.This second issue deserves further investigation and may require
a fix to the datasets library.
The related code in filterblock.py as of that PR is:
def _map_dtype(samples, column, dtype, num_proc=1):
def convert_column(sample):
try:
sample[column] = dtype(sample[column])
except ValueError as e:
logger.error(
"Error converting dtype: %s, filling with None to be filtered later", e
)
sample[column] = None
return sample
# FIXME: it appears multiprocessing map has issues with
# None columns. If we pass num_proc>1 here and the error
# case is triggered above, we get:
# ValueError: The features can't be aligned ...
# because the column is still considered a string not
# the new dtype.
num_proc = 1
return samples.map(convert_column, num_proc=num_proc)We need to investigate this error more deeply to figure out the best fix
Metadata
Metadata
Assignees
Labels
No labels