Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataTree for EchoData #611

Merged
merged 20 commits into from
Apr 12, 2022
Merged

Conversation

lsetiawan
Copy link
Member

@lsetiawan lsetiawan commented Mar 31, 2022

This PR implements xarray-datatree for the underlying EchoData structure to ease access. This work is part of issue #567 and #606

TODO

  • Make root accessible via echodata['Top-level'] or echodata['/'].
  • Make repr to show the tree relatioships
  • Make repr_html to have dropdowns for each tree
  • Add a way to access echopype software version
  • Add group paths listing with /
  • Tweak group texts (Don't include paths!)

@emiliom emiliom requested a review from b-reyes March 31, 2022 19:42
@codecov-commenter
Copy link

codecov-commenter commented Mar 31, 2022

Codecov Report

Merging #611 (b67e241) into dev (ebb160b) will decrease coverage by 0.25%.
The diff coverage is 88.13%.

@@            Coverage Diff             @@
##              dev     #611      +/-   ##
==========================================
- Coverage   78.66%   78.41%   -0.26%     
==========================================
  Files          40       42       +2     
  Lines        3600     3743     +143     
==========================================
+ Hits         2832     2935     +103     
- Misses        768      808      +40     
Flag Coverage Δ
unittests 78.41% <88.13%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
echopype/echodata/echodata.py 77.05% <78.40%> (-10.23%) ⬇️
echopype/echodata/widgets/widgets.py 92.30% <92.30%> (ø)
echopype/convert/api.py 83.84% <100.00%> (+1.20%) ⬆️
echopype/echodata/api.py 83.33% <100.00%> (ø)
echopype/echodata/convention/utils.py 100.00% <100.00%> (ø)
echopype/echodata/widgets/utils.py 100.00% <100.00%> (ø)
echopype/convert/set_groups_ad2cp.py 98.90% <0.00%> (+0.01%) ⬆️
echopype/calibrate/api.py 83.33% <0.00%> (+2.08%) ⬆️
echopype/preprocess/api.py 89.79% <0.00%> (+2.61%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@lsetiawan lsetiawan marked this pull request as ready for review March 31, 2022 23:46
@lsetiawan lsetiawan requested review from emiliom and leewujung March 31, 2022 23:46
@lsetiawan
Copy link
Member Author

Demo

Open Raw

from echopype import open_raw
echodata = open_raw('echopype/test_data/ek60/ncei-wcsd/Summer2017-D20170615-T190214.raw', sonar_model='EK60')

In [5]: print(repr(echodata))
EchoData: standardized raw data from Internal Memory
  > top: (Top-level) contains metadata about the SONAR-netCDF4 file format.
  > environment: (Environment) contains information relevant to acoustic propagation through water.
  > platform: (Platform) contains information about the platform on which the sonar is installed.
  > nmea: (Platform/NMEA) contains information specific to the NMEA protocol.
  > provenance: (Provenance) contains metadata about how the SONAR-netCDF4 version of the data were obtained.
  > sonar: (Sonar) contains specific metadata for the sonar system.
  > beam: (Sonar/Beam_group1) contains backscatter data (either complex samples or uncalibrated power samples) and other beam or channel-specific data, including split-beam angle data when they exist.
  > vendor: (Vendor specific) contains vendor-specific information about the sonar and the data.

In [6]: print(echodata)
EchoData: standardized raw data from Internal Memory
DataTree('Top-level', parent=None)
│   Dimensions:  ()
│   Data variables:
│       *empty*Attributes:
│       conventions:                 CF-1.7, SONAR-netCDF4-1.0, ACDD-1.3keywords:                    EK60sonar_convention_authority:  ICESsonar_convention_name:       SONAR-netCDF4sonar_convention_version:    1.0summary:                     
│       title:                       
│       date_created:                2017-06-15T19:02:14Zsurvey_name:                 
├── DataTree('Environment')
│   Dimensions:                 (frequency: 3, ping_time: 19)
│   Coordinates:
│     * frequency               (frequency) float64 1.8e+04 3.8e+04 1.2e+05* ping_time               (ping_time) datetime64[ns] 2017-06-15T19:02:14.20...
│   Data variables:
│       absorption_indicative   (frequency, ping_time) float64 0.002226 ... 0.04069sound_speed_indicative  (frequency, ping_time) float64 1.507e+03 ... 1.50...
├── DataTree('Platform')
│   │   Dimensions:        (location_time: 72, frequency: 3, ping_time: 19)
│   │   Coordinates:
│   │     * location_time  (location_time) datetime64[ns] 2017-06-15T19:02:15.4450001...
│   │     * frequency      (frequency) float64 1.8e+04 3.8e+04 1.2e+05
│   │     * ping_time      (ping_time) datetime64[ns] 2017-06-15T19:02:14.206000128 ....
│   │   Data variables:
│   │       latitude       (location_time) float64 dask.array<chunksize=(72,), meta=np.ndarray>
│   │       longitude      (location_time) float64 dask.array<chunksize=(72,), meta=np.ndarray>
│   │       sentence_type  (location_time) <U3 dask.array<chunksize=(72,), meta=np.ndarray>
│   │       pitch          (frequency, ping_time) float64 dask.array<chunksize=(3, 19), meta=np.ndarray>
│   │       roll           (frequency, ping_time) float64 dask.array<chunksize=(3, 19), meta=np.ndarray>
│   │       heave          (frequency, ping_time) float64 dask.array<chunksize=(3, 19), meta=np.ndarray>
│   │       water_level    (frequency, ping_time) float64 dask.array<chunksize=(3, 19), meta=np.ndarray>
│   └── DataTree('NMEA')
│       Dimensions:        (location_time: 688)
│       Coordinates:
│         * location_time  (location_time) datetime64[ns] 2017-06-15T19:02:14.2059996...
│       Data variables:
│           NMEA_datagram  (location_time) <U73 '$SDVLW,2.084,N,2.084,N' ... '$INHDT,...
│       Attributes:
│           description:  All NMEA sensor datagrams
├── DataTree('Provenance')
│   Dimensions:  ()
│   Data variables:
│       *empty*Attributes:
│       conversion_software_name:     echopypeconversion_software_version:  0.5.6.dev42+gebb160b.d20220331conversion_time:              2022-03-31T23:48:28Zsrc_filenames:                echopype/test_data/ek60/ncei-wcsd/Summer201...
│       duplicate_ping_times:         0
├── DataTree('Sonar')
│   │   Dimensions:           (beam_group: 1)
│   │   Dimensions without coordinates: beam_group
│   │   Data variables:
│   │       beam_group_name   (beam_group) <U11 'Beam_group1'
│   │       beam_group_descr  (beam_group) <U131 'contains backscatter power (uncalib...
│   │   Attributes:
│   │       sonar_manufacturer:      Simrad
│   │       sonar_model:             ER60
│   │       sonar_serial_number:     
│   │       sonar_software_name:     
│   │       sonar_software_version:  2.4.3
│   │       sonar_type:              echosounder
│   └── DataTree('Beam_group1')
│       Dimensions:                         (frequency: 3, ping_time: 19,
│                                            range_sample: 3888)
│       Coordinates:
│         * frequency                       (frequency) float64 1.8e+04 3.8e+04 1.2e+05* ping_time                       (ping_time) datetime64[ns] 2017-06-15T19:...
│         * range_sample                    (range_sample) int64 0 1 2 ... 3886 3887Data variables: (12/30)
│           channel_id                      (frequency) <U37 'GPT  18 kHz 009072058c8...
│           beam_type                       (frequency) int64 1 1 1beamwidth_receive_alongship     (frequency) float64 10.3 6.8 7.3beamwidth_receive_athwartship   (frequency) float64 10.3 6.8 7.2beamwidth_transmit_alongship    (frequency) float64 10.3 6.8 7.3beamwidth_transmit_athwartship  (frequency) float64 10.3 6.8 7.2
│           ...                              ...
│           data_type                       (frequency, ping_time) float64 3.0 ... 3.0count                           (frequency, ping_time) float64 3.888e+03 ...
│           offset                          (frequency, ping_time) float64 0.0 ... 0.0transmit_mode                   (frequency, ping_time) float64 0.0 ... 0.0angle_athwartship               (frequency, ping_time, range_sample) float64 ...
│           angle_alongship                 (frequency, ping_time, range_sample) float64 ...
│       Attributes:
│           beam_mode:              verticalconversion_equation_t:  type_3
└── DataTree('Vendor')
    Dimensions:           (frequency: 3, pulse_length_bin: 5)
    Coordinates:
      * frequency         (frequency) float64 1.8e+04 3.8e+04 1.2e+05
      * pulse_length_bin  (pulse_length_bin) int64 0 1 2 3 4
    Data variables:
        sa_correction     (frequency, pulse_length_bin) float64 0.0 -0.83 ... -0.34
        gain_correction   (frequency, pulse_length_bin) float64 20.3 23.35 ... 26.62
        pulse_length      (frequency, pulse_length_bin) float64 0.000512 ... 0.00...

Open Converted

from echopype import open_raw
echodata = open_converted('echopype/test_data/ek60/ncei-wcsd/Summer2017-D20170615-T190214.zarr')

# Note that without the backward compatibility current version missed the beam group!!!
In [18]: print(repr(echodata))
EchoData: standardized raw data from /home/lsetiawan/GitRepos/GitHub/echopype/echopype/test_data/ek60/ncei-wcsd/Summer2017-D20170615-T190214.zarr
  > top: (Top-level) contains metadata about the SONAR-netCDF4 file format.
  > environment: (Environment) contains information relevant to acoustic propagation through water.
  > platform: (Platform) contains information about the platform on which the sonar is installed.
  > nmea: (Platform/NMEA) contains information specific to the NMEA protocol.
  > provenance: (Provenance) contains metadata about how the SONAR-netCDF4 version of the data were obtained.
  > sonar: (Sonar) contains specific metadata for the sonar system.
  > vendor: (Vendor specific) contains vendor-specific information about the sonar and the data.

In [19]: print(echodata)
EchoData: standardized raw data from /home/lsetiawan/GitRepos/GitHub/echopype/echopype/test_data/ek60/ncei-wcsd/Summer2017-D20170615-T190214.zarr
DataTree('Top-level', parent=None)
│   Dimensions:  ()
│   Data variables:
│       *empty*Attributes:
│       conventions:                 CF-1.7, SONAR-netCDF4-1.0, ACDD-1.3date_created:                2017-06-15T19:02:14Zkeywords:                    EK60sonar_convention_authority:  ICESsonar_convention_name:       SONAR-netCDF4sonar_convention_version:    1.0summary:                     
│       survey_name:                 
│       title:                       
├── DataTree('Beam')
│   Dimensions:                         (frequency: 3, ping_time: 19,
│                                        range_bin: 3888)
│   Coordinates:
│     * frequency                       (frequency) float64 1.8e+04 3.8e+04 1.2e+05* ping_time                       (ping_time) datetime64[ns] 2017-06-15T19:...
│     * range_bin                       (range_bin) int64 0 1 2 3 ... 3885 3886 3887Data variables: (12/30)
│       angle_alongship                 (frequency, ping_time, range_bin) float64 ...
│       angle_athwartship               (frequency, ping_time, range_bin) float64 ...
│       angle_offset_alongship          (frequency) float64 ...
│       angle_offset_athwartship        (frequency) float64 ...
│       angle_sensitivity_alongship     (frequency) float64 ...
│       angle_sensitivity_athwartship   (frequency) float64 ...
│       ...                              ...
│       transducer_offset_y             (frequency) float64 ...
│       transducer_offset_z             (frequency) float64 ...
│       transmit_bandwidth              (frequency, ping_time) float64 ...
│       transmit_duration_nominal       (frequency, ping_time) float64 ...
│       transmit_mode                   (frequency, ping_time) float64 ...
│       transmit_power                  (frequency, ping_time) float64 ...
│   Attributes:
│       beam_mode:              verticalconversion_equation_t:  type_3
├── DataTree('Environment')
│   Dimensions:                 (frequency: 3, ping_time: 19)
│   Coordinates:
│     * frequency               (frequency) float64 1.8e+04 3.8e+04 1.2e+05* ping_time               (ping_time) datetime64[ns] 2017-06-15T19:02:14.20...
│   Data variables:
│       absorption_indicative   (frequency, ping_time) float64 ...
│       sound_speed_indicative  (frequency, ping_time) float64 ...
├── DataTree('Platform')
│   │   Dimensions:        (frequency: 3, ping_time: 19, location_time: 72)
│   │   Coordinates:
│   │     * frequency      (frequency) float64 1.8e+04 3.8e+04 1.2e+05
│   │     * location_time  (location_time) datetime64[ns] 2017-06-15T19:02:15.4450001...
│   │     * ping_time      (ping_time) datetime64[ns] 2017-06-15T19:02:14.206000128 ....
│   │   Data variables:
│   │       heave          (frequency, ping_time) float64 ...
│   │       latitude       (location_time) float64 ...
│   │       longitude      (location_time) float64 ...
│   │       pitch          (frequency, ping_time) float64 ...
│   │       roll           (frequency, ping_time) float64 ...
│   │       sentence_type  (location_time) <U3 ...
│   │       water_level    (frequency, ping_time) float64 ...
│   └── DataTree('NMEA')
│       Dimensions:        (location_time: 688)
│       Coordinates:
│         * location_time  (location_time) datetime64[ns] 2017-06-15T19:02:14.2059996...
│       Data variables:
│           NMEA_datagram  (location_time) <U73 ...
│       Attributes:
│           description:  All NMEA sensor datagrams
├── DataTree('Provenance')
│   Dimensions:  ()
│   Data variables:
│       *empty*Attributes:
│       conversion_software_name:     echopypeconversion_software_version:  0.4.1.dev438+g7aa2cd0.d20210409conversion_time:              2021-04-16T17:48:07Zsrc_filenames:                ./echopype/test_data/ek60/ncei-wcsd/Summer2...
├── DataTree('Sonar')
│   Dimensions:  ()
│   Data variables:
│       *empty*Attributes:
│       sonar_manufacturer:      Simradsonar_model:             ER60sonar_serial_number:     
│       sonar_software_name:     
│       sonar_software_version:  2.4.3sonar_type:              echosounder
└── DataTree('Vendor')
    Dimensions:           (frequency: 3, pulse_length_bin: 5)
    Coordinates:
      * frequency         (frequency) float64 1.8e+04 3.8e+04 1.2e+05
      * pulse_length_bin  (pulse_length_bin) int64 0 1 2 3 4
    Data variables:
        gain_correction   (frequency, pulse_length_bin) float64 ...
        pulse_length      (frequency, pulse_length_bin) float64 ...
        sa_correction     (frequency, pulse_length_bin) float64 ...

@lsetiawan
Copy link
Member Author

From slack conversations, access need to change to be like ['/Groupname'] for subgroups from root aka 'Top-level'. Ongoing discussion about whether to access Top-level as echodata['Top-level'] or not..

@b-reyes asked:

I understand why you have it. From what I understand, that is not how netcdf python does it (please correct me if I am wrong). I believe that ed would give the root group if we were following netcdf access patterns. Is that possible and if so, is this better?

My current thoughts:

The thing is "EchoData" is essentially a wrapper object for the Sonarnetcdf. The problem with having Echodata (ed) as this root group means that we would have to put a wrapper to the Xarray Dataset since essentially when you are fetching these groups, you get the actual underlying dataset. So in my mind it makes sense to have this separation and call the actual groups with ['Group'] and leave EchoData as our own object that we work with that we can highly customize with various functions, etc.

Right now when you retrieve a group such as ['/Sonar'] Ideally I'd like the repr to have the header of the name of that group, description, and any children, but not sure how to do that since our return value is an xarray Dataset... wonder if somehow I can put a wrapper to the ds repr :thinking_face:

@emiliom
Copy link
Collaborator

emiliom commented Apr 1, 2022

Regarding netcdf4 group path references, including (especially!) the "root group":

First, a distinction. The encoding in 1.0.yml, the SONAR-netCDF4 convention, and our own discussions are dealing with two parallel but not identical entities: a label (string) we may use verbally to conveniently refer (not access) to a group vs the group path (string) used to actually access a group (and we can add a third: if the group path is "parent/child", the group name is "child"). In 1.0.yml, the group label is the group item's name attribute and the group path is the ep_group attribute. Here's a fragment from that file:

groups:
  top:
    name: Top-level
    description: contains metadata about the SONAR-netCDF4 file format.
    ep_group:
...

  beam_power:
    name: Sonar/Beam_group2
    description: >-
      contains backscatter power (uncalibrated) and other beam or channel-specific data,
      including split-beam angle data when they exist.
      Only exists if complex backscatter data they already in Sonar/Beam_group1
    ep_group: Sonar/Beam_group2

Second, the root group a special group. It can have all the elements of a group (variables, dimensions, other groups), but it's path is simply "/". There's no group name per se. When you open a netCDF4 file with the Python netcdf4 package using Dataset, what's returned is the root group. Its groups property lists all the groups under it, but not itself. From the Python netcdf4 package documentation:

So, at one level, the root group is just another group. At another, it's a very special group with no name other than "/" (but that's its path and not really a name). SONAR-netCDF4 describes it in a way that's completely parallel to all other groups. The conventions gives labels to all groups, b/c it's convenient, and the label for the root group is "Top-level".

As you can see in netcdf4 an explicit analogy is made between group paths and unix directory paths. But I'd say its implementation is a bit loose. In xarray open_dataset you can interchangeably use or omit the initial "/" when using the group parameter to specify a group. Same difference. You can get the root group in 3 ways: by omitting the group parameter, or by passing either an empty string "" or a slash "/". In the netcdf4 package, slashes are not a part of the group name, but of course it's part of the path string. Hence, SONAR-netCDF4's reference to, say, "/Sonar" as the name of a group is somewhat pedantic and maybe confusing. It'd be necessary only if a "Sonar" group could occur somewhere else other than at the/Sonar path.

Whew.

The EchoData object follows the SONAR-netCDF4 of listing all groups equally, including the root group. The convention label for the root group is "Top-level". Check. Because the EchoData object is our own customized thing that exists for convenience, we get to define its features. The implicit decisions we've made in the past are that each group returns an xarray dataset. xarray datasets don't know anything about groups. That's ok, we choose to stick with xarray datasets for their usability. So, ed["Top-level"] is perfectly fine as a group like any other which returns an xarray dataset and doesn't know anything about group hierarchies. ed["Sonar"] is also an xarray dataset that doesn't know the netcdf4 Sonar group has Beam_groupX children. The group hierarchy will only be found in the EchoData getters and the group_paths attribute @lsetiawan has created.

Could this arrangement cause confusion? Sure. But does have its own self consistency and clear motivations and benefits? Absolutely.

One possible change we could make, to be more explicit, is to set the top-level ep_group to "/" rather than None, and for all other groups make sure they start with "/". It's not necessary, but it might add clarity. If we make this change, we'll have to also change other modules and tests.

@lsetiawan
Copy link
Member Author

Demo

Below is the demo of the echodata representations

repr or str

Screenshot from 2022-04-11 11-29-32

html repr

Screenshot from 2022-04-11 11-29-20

@emiliom
Copy link
Collaborator

emiliom commented Apr 11, 2022

That's awesome, @lsetiawan ! Both repr's look good. My only comment is about indentation, especially for the Beam_groupX groups. For both the text and HTML repr's, b/c the group description is long the text wraps into a new line. But the wrapped line is not indented, so it looks awkward and breaks the visual arrangement a bit. For the html repr is a small, minor effect, b/c there's still an overall indentation. For the text repr it's a bigger impact.

@lsetiawan
Copy link
Member Author

lsetiawan commented Apr 11, 2022

it looks awkward and breaks the visual arrangement a bit.

Yea... that's because of my small screen... I'm not sure how to fix that. It's the browser/window auto line wrap. If you or brandon have any suggestion, please let me know. I'm at a loss on that one 😅

It almost seems like somehow the code needs to figure out the window size in realtime and create line breaks in the text to make it work.

@emiliom
Copy link
Collaborator

emiliom commented Apr 11, 2022

Yea... that's because of my small screen...

Well, the small screen exposes the wrapping issue. But the wrapping issue is there.

Looking at the example of the current (0.5.x) text repr you included earlier, the long text doesn't wrap; it just extends beyond the size of the window. Is that just the default markdown block behavior vs the behavior in a terminal?

One alternative is to clip the line at a certain maximum length. That's done in the DataTree text repr (you also included an example above last week), where the line gets clipped and a "..." string is added. Anyway, I'm not saying I think that's a better option. Just an alternative to consider.

@lsetiawan
Copy link
Member Author

I'll take a look at the css and see if I can make adjustments there. I think the ... is good.

Is that just the default markdown block behavior vs the behavior in a terminal?

That example is the default behaviour of the markdown block. If you try this on a terminal, it will wrap.

@emiliom
Copy link
Collaborator

emiliom commented Apr 11, 2022

I'll take a look at the css and see if I can make adjustments there. I think the ... is good.

Maybe wait to see what @b-reyes and @leewujung think? I think I prefer the ... behavior, but I could go either way.

@leewujung
Copy link
Member

I don't think the lack of indent is a problem in the HTML repr since the arrowhead is clear on what the expected behavior is. I agree this is a bigger problem for the text repr, but I think terminal in general has that behavior and it is expected by the user. Going with ... is fine with me also.

I would advocate for getting this merged and shelve the text repr "problem" to a separate issue that would potentially to be fixed in the future.

@b-reyes
Copy link
Contributor

b-reyes commented Apr 11, 2022

@lsetiawan what happens if you throw a newline in the string?

@emiliom
Copy link
Collaborator

emiliom commented Apr 11, 2022

I would advocate for getting this merged and shelve the text repr "problem" to a separate issue that would potentially to be fixed in the future.

That would be fine with me, too

@lsetiawan
Copy link
Member Author

I would advocate for getting this merged and shelve the text repr "problem" to a separate issue that would potentially to be fixed in the future.

Yea. I think this will take more thinking in terms of the repr behavior. And probably better to put it to a separate PR since this PR is getting really big.

@lsetiawan
Copy link
Member Author

Yea... looks like even default xarray and datatree repr wraps funky in a small enough screen.
Screenshot from 2022-04-11 14-18-13

@emiliom
Copy link
Collaborator

emiliom commented Apr 11, 2022

I just realized that when I generated a screenshot of the spacing between groups in the HTML repr just now, I ended up running my set of "tests" on this PR! That's because I ran the notebook where I use the datatree accessors in different ways. SO, I can say that this PR is good to go!

Copy link
Collaborator

@emiliom emiliom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo-hoo!!

@lsetiawan lsetiawan added this to the 0.6.0 milestone Apr 11, 2022
@lsetiawan lsetiawan self-assigned this Apr 11, 2022
Copy link
Contributor

@b-reyes b-reyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsetiawan thank you for all of your great work Don!

Two items that I think need to be reviewed:

  • There are a couple of places that are trying to import from echopype. However, these references seem to be incorrect and will attempt to import using whatever echopype is in your environment. Please see my suggested changes.

  • In a past meeting, I think we talked about changing the name group_paths to group_names or something like that.

@lsetiawan
Copy link
Member Author

I think we are good to go now. Once I have @b-reyes seal of approval. I will merge 😄

@lsetiawan lsetiawan merged commit 5033dc7 into OSOceanAcoustics:dev Apr 12, 2022
@lsetiawan lsetiawan deleted the datatree branch April 12, 2022 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

💡 Idea: Potentially update echodata object repr to using datatree
6 participants