-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datatree prototyping #360
base: main
Are you sure you want to change the base?
Datatree prototyping #360
Conversation
|
def with_url(self, url: str) -> DatatreeBuilder: | ||
""" Sets the input url """ | ||
self._url = url.rstrip(os.path.sep) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside, but wouldn't DatatreeBuilder(url=...)
be more straightforward than DatatreeBuilder().with_url(...)
? This seems like we're tempting fate to have somebody go
databuilder1 = DatatreeBuilder().with_url(url1)
databuilder2 = databuilder1.with_url(url2)
Contrived, I know, just wondering why you did it this way specifically...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a builder design pattern chosen because of the multiplicity of options for configuring the Datatree. There are variants where the required arguments (and the url is certainly requireD) are part of the constructor so your suggestion is fairly standard.
Also to be clear, the prototype exists to guide the Working Group's decision making and smoke test any obvious representational flaws (primarily with inherited coordinates, but there may be others).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect the builder to be necessary once the WG has settled on a representation.
2.0, | ||
2.5, | ||
3.0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would have been forgiven to just pick your favourite one ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :-) This does however allow us to explore multiple options and identify possible pitfalls in each one.
f"{self._destination_url} does not have a .zarr extension " | ||
f"and will not be automatically recognised by xarray as a zarr store.\n" | ||
f"The engine will need to be explicitly specified via " | ||
f"xarray.open_datatree({self._destination_url}, engine='zarr')") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe .ps
is currently used? No strong opinion, though the collision with PostScript feels strange to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... also might want to move this to with_destination
, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destination url can be named whatever the user wishes, but the warning is there to give the user a hint as to why a plain xarray.open_datatree(url)
will fail (The zarr backend expects a .zarr
extension at the zarr root group).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... also might want to move this to
with_destination
, maybe?
Certainly possible, we could change this if the Builder was necessary going forward. I suspect it won't be once we settle on an option. The prototype exists to explore the design space.
dest = os.path.join(destination_root, partition) | ||
elif dataset == "antenna_xds": | ||
# The antenna dataset is a child of the tree root | ||
# For prototyping purposes, we only write it once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, would normally need to check equivalence. Honestly not sure myself whether it's worth the extra effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave a note here that if we choose this option, equivalence tests are possible with ds1.identical(ds2)
where ds1
and ds2
are xarray.Dataset
variables. This should also work with xarray.Datatree
but checking the entire tree would IMO be overkill here.
pass | ||
|
||
|
||
@xarray.register_datatree_accessor("msa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, would at least drop the "a". This needs to be memorable - alternative might be msv4
? Though not sure what we will do with image or calibration datamodels then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msa
is an acronym for Measurement Set Accessor
which signals its purpose fairly clearly IMO. To me ms
or msv4
impllies that the Measurement Set is an attribute of the main Datatree structure, when actually the Measurement Set is equivalent to the Datatee.
Regarding image
and calibration
data models, it should be possible to create separate accessors for them. After, having written the previous paragraph, dt.cala
and dt.imga
seem aesthetically unappealing to me => dt.cal
and dt.{img,image}
are nicer IMO.
maybe dt.ms
is a good compromise for visibility data. Other ideas:
dt.msx
dt.imgx
dt.calx
which phonetically might imply the accessor idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I's effectively a namespace, so I'm still somewhat partial to xrd
. msx
also has a nice ring to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I's effectively a namespace, so I'm still somewhat partial to
xrd
.msx
also has a nice ring to it.
I would argue that xrd
is an implementation of the spec, while ms{v4}
is the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the example details a geo
accessor (presumably for geosciences) how about using a radio astronomy related acronoym:
xra
:xarray radio astronomy
rax
:radio astronomy xarray/axxessor
ra
may be to close to right ascension
|
||
def __init__(self, datatree: DataTree): | ||
self._dt = datatree | ||
root = self._dt.root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whether this would also be the place for a schema check... Could see both sides: I'd be nice if you could just go ds.msv4
, and you immediately had some likely assumptions checked. However, might become bothersome if we'd end up artificially gating something behind it. We'd likely want to provide some sort of "skip schema check" environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that this is called on every Datatree node so their might be performance implications. Another place to do the checks might be on the accessor. datatree.ms.validate()
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the accessor gets created lazily - so the check would "only" happen the first time you use the accessor on any particular datatree node. Which would seem reasonable to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a quick check and it seems that it creates an accessor for each node. Filtering doesn't not seem to create new nodes (or accessors) but admittedly I haven't tried anything complicated re: filtering.
$ python src/xradio/datatree/dt_convert.py ~/data/VLBA_TL016B_split
_lsrk.vis.zarr/
Datatree Build Strategy
* Datatree proposal option:'1.0'
* Data in '/home/simon/data/VLBA_TL016B_split_lsrk.vis.zarr' will be copied to
'/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr'
* '/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr' will be overwritten
* Metadata will be consolidated at the Datatree root
dt = xarray.open_datatree("/home/simon/data/VLBA_TL016B_split_lsrk.vis_datatree.zarr", engine="zarr")
Creating MS Accessor /
Creating MS Accessor /VLBA_TL016B_split_0
Creating MS Accessor /VLBA_TL016B_split_1
Creating MS Accessor /VLBA_TL016B_split_2
Creating MS Accessor /VLBA_TL016B_split_3
Creating MS Accessor /VLBA_TL016B_split_0/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_0/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_0/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_0/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_0/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_0/weather_xds
Creating MS Accessor /VLBA_TL016B_split_1/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_1/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_1/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_1/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_1/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_1/weather_xds
Creating MS Accessor /VLBA_TL016B_split_2/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_2/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_2/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_2/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_2/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_2/weather_xds
Creating MS Accessor /VLBA_TL016B_split_3/antenna_xds
Creating MS Accessor /VLBA_TL016B_split_3/correlated_xds
Creating MS Accessor /VLBA_TL016B_split_3/field_and_source_xds_base
Creating MS Accessor /VLBA_TL016B_split_3/gain_curve_xds
Creating MS Accessor /VLBA_TL016B_split_3/system_calibration_xds
Creating MS Accessor /VLBA_TL016B_split_3/weather_xds
elif self._option == 2.0: | ||
antenna_ds = self._dt.children[name] | ||
elif self._option == 2.5: | ||
antenna_ds = self._dt.root[name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear - proposal 2.5 was meant to suggest that datasets such as antenna_xds
could appear both below the correlated data dataset or at any node above. That's why it's 2.5: It is a generalisation. And same logic should apply to weather etc (shared weather table doesn't sound crazy at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. My immediate thought here is that the flexibility is going to be confusing for software development, but probably solvable with links (which are not been explored in this PR, yet).
Would it be possible to divide secondary datasets into shareable/non-shareable categories where the former are located as a child of the tree root, while the rest are children or siblings of the partition?
I like option 2.5 as it uses Datatree's coordinate inheritance which should work seamlessly with @Jan-Willem 's work on interpolating visibility coordinates in secondary dataset. This seems to work on the single VLBA I've been testing on but probably needs to be smoke tested on the rest of the MSv4 test data.
This might be better to discuss in person in the wG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but probably solvable with links (which are not been explored in this PR, yet).
One of the findings I take from this prototype is that it's fairly easy to navigate the various options using the standard DataTree
methods (root, parent, children, siblings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear - proposal 2.5 was meant to suggest that datasets such as antenna_xds could appear both below the correlated data dataset or at any node above. That's why it's 2.5: It is a generalisation. And same logic should apply to weather etc (shared weather table doesn't sound crazy at all).
Maybe this is possible without links. What about this idea?
- If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset
- Same logic applies for other secondary datasets
This could be enforced via the dt.msa.antenna
accessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is possible without links. What about this idea?
- If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset
- Same logic applies for other secondary datasets
This could be enforced via the
dt.msa.antenna
accessor.
The danger with this approach maybe global/parition datasets getting out of sync. Perhaps it should be enforced as global xor partition by both the schema checker and accessor.
This would be a relaxation of the following:
Would it be possible to divide secondary datasets into shareable/non-shareable categories where the former are located as a child of the tree root, while the rest are children or siblings of the partition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is possible without links. What about this idea?
- If a partition has a child/sibling antenna dataset then that takes precedence over a global antenna dataset
- Same logic applies for other secondary datasets
Yes, that was the proposal. I'm giving pseudo-code on Confluence:
# Option 2.5
for parent in pathlib.parents(path):
if 'ANTENNA' in parent.children:
antenna = parent.children['ANTENNA']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just to be clear - I would just propose to support this in the accessor, but not necessarily go out of our way to actually make systematic use of it. It's just to retain the optimisation option.
This draft PR contains a script that re-arranges a Processing Set to produce a Datatree corresponding to the options present in this document:
Invoke the script with
There are multiple options to configure the resulting Datatree, call the script with
--help
to discover them.This PR has only been tested with the
VLBA_TL016B_split_lsrk
dataset.Accessors are still TODO, but it might be useful to start getting a feel for the layout.Accessors have been implemented for the various options/cc @Jan-Willem @scpmw