MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803

genematx · 2024-10-29T14:42:44Z

Allow CSVAdapter to accept kwargs for pd.read_csv (e.g. separator)
Remove dataframe_adapter property from CSVAdapter
Allow simple instantiation with either single uri or a list of uris
Specify multipart/related;type=text/csv mimetype
Experimental: Add a new CSVArrayAdaper backed by an ArrayAdapter (instead of TableAdapter). It can be used to load homogeneous numerical arrays stored as scv files. The distinction between the two is intended to be done by the mimetype: "text/csv;header=present" -- for tables, and "text/csv;header=absent" -- for arrays.
Unify the classmethod interfaces across the internal adapters (with from_assets and from_uris being the two primary methods).

Checklist

Add a Changelog entry
~~Add the ticket number which this PR closes to the comment section~~

danielballan · 2025-01-16T03:21:15Z

I see two structural changes here:

In main, we have registries that map mimetypes to any callable that returns an Adapter. In this PR, the registries map must map to an Adapter class. These Adapter classes adhere to an interface with several methods, used for various functions:
i. "I want to create a new empty dataset of this format, to write into it." (init_storage)
ii. "I found some files in this format that I want introspect, to register them in my catalog database." (from_uris)
iii. "I previously introspected or wrote some files, and now I want to access them based on this information from my catalog database." (from_assets)
As before, __init__, the "single privileged entrypoint", expects only exactly what it needs, and its signature varies from class to class.
There is an additional change involved in (iii), from_assets. In this PR, it expects a flat list of Asset objects, which may point to:
i. Files of mixed types and roles, such as TIFFs in an ordered sequence mixed with "sidecar" JSON or YAML metadata files
ii. Files that must be tracked by the database for accounting purposes but are actually not used directly by the Adapter (e.g. HDF5 "data" files in a virtual dataset)

Some time ago, we encountered these examples and some others in the wild. We did not like the look of these Adapters:
(a) Their signatures did not explain to the caller what was expected: just "list of URIs".
(b) The Adapter code had to assign roles, distinguish scalars for lists, potentially sort list. It seemed wasteful and error-prone. When we introspect the files (ii) we know this information. We should write it down at that time and use it!

Thus, we added additional columns to the Asset--DataSource many-to-many relation table which encode which role and order each Asset plays. To be specific, with each row in the many-to-many relation table we also store, in addition to asset_id and and data_source_id, a parameter and a num:

parameter	num	What to do
`<name>`	`NULL`	`Adapter.from_assets(<name>=asset.data_uri)`
`<name>`	`<int>`	`Adapter.from_assets(<name>=[asset.data_uri, ...])`
`NULL`	`NULL`	`Adapter.from_assets(...)` # do not include asset in parameters
`NULL`	`<int>`	Disallowed by database triggers

I really like having standard separate paths for introspection (ii) and construction from DataSource/Asset(iii). That's a huge improvement.

I see in the implementation of from_assets a regression that (a) puts us back to homogenous vague signatures (assets: list[Assets]) which mask over diverse specific requirements and (b) leaves the Adapters redoing work (and perhaps making mistakes of missing errors) that was already done and recorded when the data was registered or written, as the case may be.

Now on shakier ground, just brainstorming...

Can we get the best of both worlds? As a starting point, convenience function like this might work:

def from_node_and_data_source(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
    "Usage example: from_data_source(CSVAdapter, node, data_source)"
    parameters = defaultdict(list)
    for asset in data_source.assets:
        if asset.num is None:
            # This asset is associated with a parameter that takes a single URI.
            parameters[asset.parameter] = asset.data_uri
        else:
            # This asset is associated with a parameter that takes a list of URIs.
            parameters[asset.parameter].append(asset.data_uri)
    return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)

(A further convenience could wrap this and figure out the adapter_cls automatically by looking up data_source.mimetype in an adapter_by_mimetype registry, i.e. f(node, data_source) -> Adapter.)

Thus, we would drop the from_assets constructor in favor of using the explicit cls.__init__, which declares in its signature precisely what it expects: whether it be one URI, a list of URIs, or multiple different URIs/lists with different formats and/or roles to play.

We would retain two great aspects of this PR:

an easy way to construct Adapters from data_sources and their assets
a clear separation between between the introspection/registration path and the "from assets" path

There may be better implementations of these goals available, but I hope this is a promising example.

danielballan · 2025-01-16T14:13:27Z

Two minor weak points in both the implementation on main and my sketch above:

The namespace of parameters is shared with the namespace of Tiled-reserved terms like metadata, structure, and specs. Echoing @tacaswell from the other data, when "user" (user = Adapter author) and framework names share a namespace, that's often painful. OTOH, having clean function signatures that describe their expected inputs is nice! And the set of reserved terms is unlikely to change much.
The names we use for parameters right now---often data_uri or data_uris---are not very descriptive. Their utility becomes more obvious in these special cases like HDF5 with "master" and "data" files or arrangements of mixed formats, like TIFF+YAML. Maybe we should migrate to more specific names for parameters, like "image_file"?

danielballan · 2025-01-16T15:43:18Z

Notes from discussion with @genematx

def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
    "Usage example: from_data_source(CSVAdapter, node, data_source)"
    parameters = defaultdict(list)
    for asset in data_source.assets:
        if asset.num is None:
            # This asset is associated with a parameter that takes a single URI.
            parameters[asset.parameter] = asset.data_uri
        else:
            # This asset is associated with a parameter that takes a list of URIs.
            parameters[asset.parameter].append(asset.data_uri)
    return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)


asset-datasource relation

asset_id  data_source_id  parameter   num
103       1               tiff_image  1
103       2               calib_image 1


 parameter   num
 ---------   ---
# HDF5
  hdf5_file NULL

# TIFF sequence
 tiff_images 1
 tiff_images 2

# TIFF stack
 tiff_stack NULL

"hdf5_file"
["image1", "image2"]

 Asset:
 - data_uri
 - size
 - hash_meth


 HDF5Adapter(hdf5_file: str, ...)
 TIFFSequence(tiff_images: List[str], ...)
 TIFFStack(tiff_image: str, ...)


 class Storage:
     filesystem: str
     sql: str



 class MyAdapter:
    def __init__(self, foo_file: str, metadata, structure, specs, ...):
        "Construct Adapter from info extracted from node, data_source and its assets"
	if foo_file == "":
		raise ValueError
    	self._foo_file = foo_file

    def init_storage(self, storage, data_source) -> DataSource:
        "Allocate assets for writing data."
    	# Always creates some Assets and attaches them.
	# Sometimes adds/alters data_source.parameters.
	# Could look at data_source.mimetype...
        return data_source  # includes data_source.assets

    @classmethod
    def from_uris(cls, *files) -> "MyAdapter":
    	"Accept inherently heterogeneous/unsorted files with unknown structure and introspect."
        ...
	return cls(...)

    @classmethod
    def from_catalog(cls, node, data_source):
        return from_catalog(cls, node, data_source)


# in tiled/catalog/adapter.py
adapter_cls = adapters_by_mimetype[data_source.mimetype]
adapter = from_catalog(adapter_cls, node, data_source)

from_catalog(CSVAdapter, node, data_source)

CSVAdapter.from_catalog(node, data_source)


def asset_parameters_to_adapter_kwargs(data_source):
    "Transform database representation to Python representation."
    parameters = defaultdict(list)
    for asset in data_source.assets:
        if asset.num is None:
            # This asset is associated with a parameter that takes a single URI.
            parameters[asset.parameter] = asset.data_uri
        else:
            # This asset is associated with a parameter that takes a list of URIs.
            parameters[asset.parameter].append(asset.data_uri)
    return parameters


def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
    "Usage example: from_data_source(CSVAdapter, node, data_source)"
    parameters = asset_parameters_to_adapter_kwargs(data_source)
    return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)

genematx added 3 commits October 29, 2024 10:39

MNT: small refactoring changes to CSV adapter

931e1a8

MNT: refactor and remove dataframe_adapter property

ac4a265

MNT: add changelog entry

1e6d78a

genematx requested review from skarakuzu and danielballan October 29, 2024 15:18

genematx marked this pull request as ready for review October 29, 2024 18:46

genematx added 15 commits October 30, 2024 12:25

ENH: draft CSV Array adapter

e02f849

ENH: constructor methods for CSVArrayAdapter

0be72ab

ENH: Subclass CSVArrayAdapter from ArrayAdapter

de240a8

Merge branch 'refactor-csv' into csv-array

9c6ade4

ENH: add from_assets constructor for Adapters

bfce5ff

ENH: pass dtype parameter to CSV adapter from structure

c00a2e9

FIX: instance checking of data_type

feca99c

ENH: allow to pass nrows parameter to CSVAdapter

d7f5c4a

FIX: reshape and rechunk when extending a csv array

8bbd481

Merge branch 'main' into refactor-csv

bc7dd51

FIX: typo

8d10672

Merge branch 'main' into refactor-csv

0e876ef

MNT: remove unused read_csv function

3510bdd

TST: Tests for CSVArrayAdapter

5acdfa1

ENH: from_uris for CSVArrayAdapter

a4149d5

genematx changed the title ~~MNT: small refactoring changes to CSV adapter~~ MNT: Refactoring changes to CSV adapter + CSVArrayAdapter Jan 14, 2025

TST: fix relative links in test

4f785de

genematx added 2 commits January 17, 2025 09:57

MNT: remove access_policy attribute from adapters

2b59ef8

TST: ensure old tests pass

0159b7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803

MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803

genematx commented Oct 29, 2024 •

edited

Loading

danielballan commented Jan 16, 2025 •

edited

Loading

danielballan commented Jan 16, 2025

danielballan commented Jan 16, 2025

MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803

Are you sure you want to change the base?

MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803

Conversation

genematx commented Oct 29, 2024 • edited Loading

Checklist

danielballan commented Jan 16, 2025 • edited Loading

danielballan commented Jan 16, 2025

danielballan commented Jan 16, 2025

genematx commented Oct 29, 2024 •

edited

Loading

danielballan commented Jan 16, 2025 •

edited

Loading