Skip to content

Investigate Dataset.map() multiprocessing failure #123

@russellb

Description

@russellb

This issue is called out in one of the commits in PR #117

The second issue is specific to map():

ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").

It appears the the datasets, only in the case of num_proc>1,
when we hit the "error converting dtype" case and set the column
to None, it ends up being still considered a string column rather
than the new type.

This second issue deserves further investigation and may require
a fix to the datasets library.

The related code in filterblock.py as of that PR is:

def _map_dtype(samples, column, dtype, num_proc=1):
    def convert_column(sample):
        try:
            sample[column] = dtype(sample[column])
        except ValueError as e:
            logger.error(
                "Error converting dtype: %s, filling with None to be filtered later", e
            )
            sample[column] = None
        return sample

    # FIXME: it appears multiprocessing map has issues with
    # None columns. If we pass num_proc>1 here and the error
    # case is triggered above, we get:
    #   ValueError: The features can't be aligned ...
    # because the column is still considered a string not
    # the new dtype.
    num_proc = 1

    return samples.map(convert_column, num_proc=num_proc)

We need to investigate this error more deeply to figure out the best fix

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions