Investigate Dataset.map() multiprocessing failure

This issue is called out in one of the commits in PR #117 

> The second issue is specific to map():
> 
> ```
> ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").
> ```
> 
> It appears the the datasets, only in the case of num_proc>1,
> when we hit the "error converting dtype" case and set the column
> to None, it ends up being still considered a string column rather
> than the new type.
> 
> This second issue deserves further investigation and may require
> a fix to the datasets library.

The related code in filterblock.py as of that PR is:

```python
def _map_dtype(samples, column, dtype, num_proc=1):
    def convert_column(sample):
        try:
            sample[column] = dtype(sample[column])
        except ValueError as e:
            logger.error(
                "Error converting dtype: %s, filling with None to be filtered later", e
            )
            sample[column] = None
        return sample

    # FIXME: it appears multiprocessing map has issues with
    # None columns. If we pass num_proc>1 here and the error
    # case is triggered above, we get:
    #   ValueError: The features can't be aligned ...
    # because the column is still considered a string not
    # the new dtype.
    num_proc = 1

    return samples.map(convert_column, num_proc=num_proc)
```

We need to investigate this error more deeply to figure out the best fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate Dataset.map() multiprocessing failure #123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate Dataset.map() multiprocessing failure #123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions