-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803
base: main
Are you sure you want to change the base?
Conversation
I see two structural changes here:
Some time ago, we encountered these examples and some others in the wild. We did not like the look of these Adapters: Thus, we added additional columns to the Asset--DataSource many-to-many relation table which encode which role and order each Asset plays. To be specific, with each row in the many-to-many relation table we also store, in addition to
I really like having standard separate paths for introspection (ii) and construction from DataSource/Asset(iii). That's a huge improvement. I see in the implementation of Now on shakier ground, just brainstorming... Can we get the best of both worlds? As a starting point, convenience function like this might work: def from_node_and_data_source(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
"Usage example: from_data_source(CSVAdapter, node, data_source)"
parameters = defaultdict(list)
for asset in data_source.assets:
if asset.num is None:
# This asset is associated with a parameter that takes a single URI.
parameters[asset.parameter] = asset.data_uri
else:
# This asset is associated with a parameter that takes a list of URIs.
parameters[asset.parameter].append(asset.data_uri)
return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters) (A further convenience could wrap this and figure out the Thus, we would drop the We would retain two great aspects of this PR:
There may be better implementations of these goals available, but I hope this is a promising example. |
Two minor weak points in both the implementation on
|
Notes from discussion with @genematx def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
"Usage example: from_data_source(CSVAdapter, node, data_source)"
parameters = defaultdict(list)
for asset in data_source.assets:
if asset.num is None:
# This asset is associated with a parameter that takes a single URI.
parameters[asset.parameter] = asset.data_uri
else:
# This asset is associated with a parameter that takes a list of URIs.
parameters[asset.parameter].append(asset.data_uri)
return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)
asset-datasource relation
asset_id data_source_id parameter num
103 1 tiff_image 1
103 2 calib_image 1
parameter num
--------- ---
# HDF5
hdf5_file NULL
# TIFF sequence
tiff_images 1
tiff_images 2
# TIFF stack
tiff_stack NULL
"hdf5_file"
["image1", "image2"]
Asset:
- data_uri
- size
- hash_meth
HDF5Adapter(hdf5_file: str, ...)
TIFFSequence(tiff_images: List[str], ...)
TIFFStack(tiff_image: str, ...)
class Storage:
filesystem: str
sql: str
class MyAdapter:
def __init__(self, foo_file: str, metadata, structure, specs, ...):
"Construct Adapter from info extracted from node, data_source and its assets"
if foo_file == "":
raise ValueError
self._foo_file = foo_file
def init_storage(self, storage, data_source) -> DataSource:
"Allocate assets for writing data."
# Always creates some Assets and attaches them.
# Sometimes adds/alters data_source.parameters.
# Could look at data_source.mimetype...
return data_source # includes data_source.assets
@classmethod
def from_uris(cls, *files) -> "MyAdapter":
"Accept inherently heterogeneous/unsorted files with unknown structure and introspect."
...
return cls(...)
@classmethod
def from_catalog(cls, node, data_source):
return from_catalog(cls, node, data_source)
# in tiled/catalog/adapter.py
adapter_cls = adapters_by_mimetype[data_source.mimetype]
adapter = from_catalog(adapter_cls, node, data_source)
from_catalog(CSVAdapter, node, data_source)
CSVAdapter.from_catalog(node, data_source)
def asset_parameters_to_adapter_kwargs(data_source):
"Transform database representation to Python representation."
parameters = defaultdict(list)
for asset in data_source.assets:
if asset.num is None:
# This asset is associated with a parameter that takes a single URI.
parameters[asset.parameter] = asset.data_uri
else:
# This asset is associated with a parameter that takes a list of URIs.
parameters[asset.parameter].append(asset.data_uri)
return parameters
def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
"Usage example: from_data_source(CSVAdapter, node, data_source)"
parameters = asset_parameters_to_adapter_kwargs(data_source)
return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters) |
CSVAdapter
to accept kwargs forpd.read_csv
(e.g. separator)dataframe_adapter
property fromCSVAdapter
multipart/related;type=text/csv
mimetypeCSVArrayAdaper
backed by anArrayAdapter
(instead ofTableAdapter
). It can be used to load homogeneous numerical arrays stored as scv files. The distinction between the two is intended to be done by the mimetype: "text/csv;header=present" -- for tables, and "text/csv;header=absent" -- for arrays.from_assets
andfrom_uris
being the two primary methods).Checklist
Add the ticket number which this PR closes to the comment section