Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing non-scalar, non-innate features in RTDC_Hierarchy objects leads to Error #165

Closed
maxschloegel opened this issue Feb 4, 2022 · 7 comments · Fixed by #169
Closed
Assignees

Comments

@maxschloegel
Copy link
Contributor

maxschloegel commented Feb 4, 2022

When registering a non-scalar feature (i.e. temporary feature or plugin feature) and creating a RTDC_Hierarchy object, leads to *** TypeError: Boolean indexing array has incompatible shape.

This seems to be due to the indexing behaviour of RTDC_Hierarchy instances in its __getitem__()-function
This is caused by more complex indexing behaviour of higher-dimensional np.ndarrays.

@maxschloegel
Copy link
Contributor Author

One option to solve this issue is changing the boolean mask, used in __getitem__()-function to a numpy-array of indices, by using np.where(mask).

This can be done by changing:

            item = self.hparent[key]
            return item[self.hparent.filter.all]

to

            item = self.hparent[key]
            idxs = np.where(self.hparent.filter.all)
            return item[idxs]

The "only" downside of that is, that the loaded non-scalar features are not loaded as HDF5-dataset (like other non-scalar features), but as np.ndarray instead. This implies that the array is loaded completely into memory.
Since this use case is only an edge case and I will only need it once for exporting the Hierarchy-Datasets as hdf5-datasets (where the length of the Hierarchy-Datasets is not very large), this might not be an issue, but on the other hand it is not a 100% clean solution.
What do you think @paulmueller

@maxschloegel
Copy link
Contributor Author

While running the tests, I had the feeling that it might have taken a bit longer than normally. I will run the tests for the original and the new coda and compare the time. I don't want to introduce inefficient code here.

@paulmueller
Copy link
Member

paulmueller commented Feb 7, 2022

Another option would be to introduce a wrapper class that returns a lazy object which can then be indexed via __getitem__ - similar to how it should work with image and mask.

EDIT: I think you would want to (a) generalize the ChildImage, ChildImageBG and ChildMask classes (pass feature name to __init__) into one class and (b) use the plugin feature's "feature shape" metadata (it should not contain nan's, otherwise raise a NotImplementedError) to determine whether we have to return N-dimensional data.

I did this for data export like so: https://github.com/ZELLMECHANIK-DRESDEN/dclab/blob/0.39.14/dclab/rtdc_dataset/writer.py#L180-L207

@maxschloegel
Copy link
Contributor Author

Ok, so I implemented a generalized ChildNonScalar class, including non-scalar PluginFeatures, TemporaryFeatures and "image", "image_bg" and "mask".
Since this is not specific to a PluginFeature, but the issue also occurred when using TemporaryFeatures, I wanted to use shape or ndim of ch.hparent[key], but that threw problems with TDMS-tests (specifically for ImageColumn, where in one test, the TDMS-dataset did not contain the feature "image", so the "image"-list was empty and the shape could not be computed.)Anyways, we can talk about that at some point.
Due to these problems I now tested it slightly different:

def __getitem__(self, key):
    # contour, image, and traces are added automatically
    # to `self._events` in `self.apply_filter`.
    if key in self._events:
        return self._events[key]
    elif key in self.features_scalar or key in dfn.FEATURES_NON_SCALAR:
        item = self.hparent[key]
        return item[self.hparent.filter.all]
    else:
        self._events.setdefault(key, ChildNonScalar(self, key))
        return self[key]

Where or key in dfn.FEATURES_NON_SCALAR was included to deal with the TDMS-problem.

I feel not 100% happy with the solution, because I basically had to add or key in dfn.FEATURES_NON_SCALAR just for the TDMS-problem, but it felt like going down the road of finding this issue would be quite some deep rabbit hole.

@maxschloegel
Copy link
Contributor Author

For completeness sake, here the implementation of ChildNonScalar:

class ChildNonScalar(ChildBase):
    def __init__(self, child, key):
        super(ChildNonScalar, self).__init__(child)
        self.key = key

    def __getitem__(self, idx):
        pidx = map_indices_child2parent(child=self.child,
                                        child_indices=idx)
        hp = self.child.hparent
        return hp[self.key][pidx]

    @property
    def shape(self):
        hp = self.child.hparent
        return tuple([len(self)] + list(hp[self.key][0].shape))

Other suggestions for the name are more than welcome!

@paulmueller
Copy link
Member

paulmueller commented Feb 8, 2022

Maybe ChildNDArray (to isolate it from ChildContour and ChildTrace) would be a better name.

The tdms thing looks like a bug and should probably be handled in a separate new issue. It might make sense to extend that not-100%-clean case you added by checking that the root parent of the hierarchy child is a tmds dataset and link to the new issue (until that is fixed).

Regarding temporary features, maybe the user should specify the dimension here as well. There should be a general consensus on how to address the dimensionality of features in dclab. Maybe we will have to make sure that either (A) users are forced to specify the number of dimensions or (B) dclab gets smarter in figuring out the dimensionality. It is probably smart to turn this into another new issue. (dclab became smarter when I fixed #166, at least when it comes to data export to HDF5).

@maxschloegel
Copy link
Contributor Author

ChildNDArray sounds good, I will use it instead.
The bug in the tdms-problem was addressed in PR #167.

Regarding the registration of features happening in two different places (apply_filter() and __getitem__()), I created another issue issue #168 . For now I will implement the changes needed to resolve this current issue and issue #168 addresses the improvement of the code's cleanness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants