Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backward/forward compatibility wrt v0.6.0 group structure/variable name change #606

Closed
leewujung opened this issue Mar 30, 2022 · 10 comments
Assignees
Milestone

Comments

@leewujung
Copy link
Member

leewujung commented Mar 30, 2022

Context

We are coming up against multiple planned breaking changes in v0.6.0. The changes include netCDF group structures as well as coordinates and variable names and attributes.

We know that many users, including both OOI and NCEI have converted some non-trivial volume of data into zarr or nc using echopype v0.5.x, so it is imperative that all processing can accommodate these already converted data.

In light of the group structure changes, we also want to ease the transition for ourselves such that we do not disturb the entire codebase significantly all at once (so that we can focus on convention compliance for v0.6.0).

Problems and Approaches

There are two types of changes we need to deal with:

  1. group structure changes: more involved
  2. variable name changes: more straightforward

Group structure changes

After discussions (#567) we concluded that a cleaner approach is for us to use DataTree functionality to allow accessing the data like a dictionary or like in the netcdf4 library, such as EchoData["Sonar/Beam_groupX"], because our current attribute setup would require EchoData.sonar.beam_group1 which can be quite involved to implement. The downside to this are:

  • DataTree could be not yet stable, though it seems pretty much on track now
  • we would need to change how data in different groups are accessed for all the downstream functions
  • if we go ahead and directly use open_tree on v0.5.x data, the groups will not be in the right place

The last one requires some more discussion, so I wonder if an intermediate step we can take in v0.6.0 is to avoid this (for now) by making our code forward compatible to the v0.6.0 data:
Basically, we would hook up:

  • EchoData.beam to root/Sonar/Beam_group1
  • EchoData.beam_power (if exists) to root/Sonar/Beam_group2
  • EchoData.sonar still to root/Sonar (no action)
  • EchoData.nmea still to root/Platform/NMEA (no action)
    This way all the downstream functions do not have to change in v0.6.0, and can be dealt with later once we figure out how to use DataTree to handle v0.5.x data.

Variable name changes

For these changes, I propose that instead of taking the forward compatibility route as above, here we actually change the variable names in the code once the correct groups are read into xarray datasets.

For example, we would:

  • read in range_bin --> immediately change that to range_sample --> change all the downstream processing that uses range_bin also to range_sample
  • read in frequency --> immediately change that to channel --> change all the downstream processing that uses range_bin also to range_sample

Decision?

I want to list out all the changes we need to make in v0.6.0 here, but that requires that we decide what approach to take for the group structure changes problem. So, pinging @emiliom @b-reyes @lsetiawan for inputs!

Once we arrive at a decision I'll add specific tasks to this issue.

@b-reyes
Copy link
Contributor

b-reyes commented Mar 30, 2022

Thank you @leewujung for putting this together! I have been working on this conversion problem, so I am happy to work on several of these items. I think that once we establish the sensor-specific files in #596, then we can proceed with making the code forward compatible. The following is the first step I would take (input is welcome) to address this issue. Note that this is primarily focused on changes to downstream items related to open_converted.

  • Find the version of echopype that created the zarr/nc file and use the sensor-specific files created in Modify SONAR-netCDF4 convention file #596 to open and read in the file into a EchoData object.
    • This will require changes to __load_file.
      • Replace self.group_map.items() with a sensor-specific dict obtained from Modify SONAR-netCDF4 convention file #596.
      • Replace lines 616-620 with a function that correctly assigns the group based on the sensor-specific file.
      • Add a depreciation warning/prompt to save to zarr/nc, if conversion is necessary
      • Apply change to variable names, if necessary

From my understanding, if we end up implementing datatree, we would only need to modify the above items slightly.

@emiliom
Copy link
Collaborator

emiliom commented Mar 30, 2022

Wow, thank you @leewujung !! This is really helpful.

Group structure changes

I think what you describe as forward compatibility is what we had agreed on. That is, to temporarily retain (at least in 0.6.0) the current EchoData group referencing (.beam, .sonar, etc) but map them to the 0.6.0 group paths (ie, correct SONAR-netCDF4 v1 paths).

When using open_converted to read a 0.5.x file (which would use open_datatree behind the scenes), the group paths and names would be "wrong" (ie, 0.5.x paths) vis a vis 0.6.x. As @b-reyes discussed above (thanks for the additional details, @b-reyes!), the group mappings we're discussing in #596 (and #598) are intended to address this challenge.

I think there are still challenges we need to work through, and let's discuss them at our meeting tomorrow. For example, all these group mappings and variable renamings (and other changes -- see below) have a consequence I hadn't thought about: open_converted applied on a 0.5.x file might not be doable with lazy loading alone.

There are two types of changes we need to deal with:

Actually, there's a third type that's related to "variable name changes" but a lot more consequential: changes to a dimension/coordinate that go beyond just a name change. Specifically:

@leewujung
Copy link
Member Author

all these group mappings and variable renamings (and other changes -- see below) have a consequence I hadn't thought about: open_converted applied on a 0.5.x file might not be doable with lazy loading alone.

I am not sure "what it takes" to rename the coordinate variables (i.e. does the renaming operation requires actually loading data into memory?), but this part to hook up the v0.6.0 groups and subgroups to our current flattened EchoData.GROUP as part of open_converted should be fine with lazy loading.

For the third type of changes -- Thanks! I agree this is a different type that should be listed to the main issue description! I'll add it later today.

  • Addition of the new beam dimension (Add beam dimension to Beam group #520). EK80 is the only exception, where the change really is just a simple name change to the existing quadrant dimension.

This is true, but I think the processing of the other sonar models can be achieve without much disturbance of the code by using squeeze to remove that extra dimension beam with dim=1.

Fundamental change to the meaning of the frequency dimension (#566). It's not just a simple renaming to channel. The frequency information will now be stored somewhere else, in the new frequency_nominal(channel) variable. Code that will need to be changed includes query patterns that relied on the frequency coordinate, and open_raw operations that populate the frequency information.

Agreed. I think this will be a tedious PR thought I don't see immediately major obstacles.

@emiliom
Copy link
Collaborator

emiliom commented Mar 30, 2022

but this part to hook up the v0.6.0 groups and subgroups to our current flattened EchoData.GROUP as part of open_converted should be fine with lazy loading.

100% correct. I'm glad you pointed it out, since it's easy (for me at least) to get lost in the details.

This is true, but I think the processing of the other sonar models can be achieve without much disturbance of the code by using squeeze to remove that extra dimension beam with dim=1.

Oh, that's brilliant!!! I hadn't thought of that. Hopefully it will pan out that way.

@b-reyes
Copy link
Contributor

b-reyes commented Mar 30, 2022

Actually, there's a third type that's related to "variable name changes" but a lot more consequential: changes to a dimension/coordinate that go beyond just a name change. Specifically:

Thank you for pointing this out @emiliom, I will be sure to keep this in mind. Perhaps I can lay the groundwork for this forward compatibility by just implementing the restructuring of the groups and changing range_bin to range_sample, then we can work together to see the best way to go about the change to channel.

@leewujung
Copy link
Member Author

just implementing the restructuring of the groups and changing range_bin to range_sample

That will be great! This will also let us know if renaming a coordinate requires loading the coordinate data into memory. But I think even if it is, the penalty may not be a problem for the size of data files we are talking about at the moment -- until we fix the problem with combine_echodata to produce a gigantic zarr file (since right now people run out of memory to combine too many files 😅, so that difficult scenario to have to rename the coordinate from a gigantic dataset actually does not happen -- one of the times where we have the blessing of missing functionalities...?).

@emiliom
Copy link
Collaborator

emiliom commented Mar 31, 2022

Perhaps I can lay the groundwork for this forward compatibility by just implementing the restructuring of the groups and changing range_bin to range_sample then we can work together to see the best way to go about the change to channel

That sounds great, @b-reyes ! Specially the range_bin > range_sample conversion (in open_converted, right?). For the restructuring of the groups, once we settle down on the form of the mapping dict (#596 (comment)), you can start working on it. But bear in mind that @lsetiawan said he'll work on the EchoData.grouplabel <> EchoData.getter["group-path"] mappings on Friday. Your work on open_converted to map groups from 0.5.x to 0.6.x will probably have to interact with Don's work; though I don't have a good feel for how much of a handshake there needs to be.

@leewujung
Copy link
Member Author

But bear in mind that @lsetiawan #567 (comment) he'll work on the EchoData.grouplabel <> EchoData.getter["group-path"] mappings on Friday. Your work on open_converted to map groups from 0.5.x to 0.6.x will probably have to interact with Don's work; though I don't have a good feel for how much of a handshake there needs to be.

I suggest that we talk this through and agree on who will do what in the call tomorrow morning. There are 2 steps to consider:

  1. open_tree can provide the mechanism for linking both the v0.5.x and v0.6.x files to the current flattened EchoData.GROUP structure without us changing much in the downstream computations (other than coordinate and variable name fixes).

  2. To actually enable the EchoData.grouplabel <> EchoData.getter["group-path"] mapping. Directly using open_tree wouldn't work, since EchoData["Sonar/Beam_groupX"] only works for v0.6.x and EchoData["Beam"] only works for v0.5.x, so something has to happen there.

@b-reyes
Copy link
Contributor

b-reyes commented Mar 31, 2022

@emiliom yes, once we settle #596 I will start working on this, I plan to follow the outline I created above. I will be sure to discuss this with @lsetiawan in the meeting today.

@leewujung
Copy link
Member Author

Repository owner moved this from In Progress to Done in Echopype May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants