Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

genematx
Copy link
Contributor

@genematx genematx commented Oct 29, 2024

  • Allow CSVAdapter to accept kwargs for pd.read_csv (e.g. separator)
  • Remove dataframe_adapter property from CSVAdapter
  • Allow simple instantiation with either single uri or a list of uris
  • Specify multipart/related;type=text/csv mimetype
  • Experimental: Add a new CSVArrayAdaper backed by an ArrayAdapter (instead of TableAdapter). It can be used to load homogeneous numerical arrays stored as scv files. The distinction between the two is intended to be done by the mimetype: "text/csv;header=present" -- for tables, and "text/csv;header=absent" -- for arrays.
  • Unify the classmethod interfaces across the internal adapters (with from_assets and from_uris being the two primary methods).

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section

@genematx genematx marked this pull request as ready for review October 29, 2024 18:46
@genematx genematx changed the title MNT: small refactoring changes to CSV adapter MNT: Refactoring changes to CSV adapter + CSVArrayAdapter Jan 14, 2025
@danielballan
Copy link
Member

danielballan commented Jan 16, 2025

I see two structural changes here:

  1. In main, we have registries that map mimetypes to any callable that returns an Adapter. In this PR, the registries map must map to an Adapter class. These Adapter classes adhere to an interface with several methods, used for various functions:
    i. "I want to create a new empty dataset of this format, to write into it." (init_storage)
    ii. "I found some files in this format that I want introspect, to register them in my catalog database." (from_uris)
    iii. "I previously introspected or wrote some files, and now I want to access them based on this information from my catalog database." (from_assets)
    As before, __init__, the "single privileged entrypoint", expects only exactly what it needs, and its signature varies from class to class.

  2. There is an additional change involved in (iii), from_assets. In this PR, it expects a flat list of Asset objects, which may point to:
    i. Files of mixed types and roles, such as TIFFs in an ordered sequence mixed with "sidecar" JSON or YAML metadata files
    ii. Files that must be tracked by the database for accounting purposes but are actually not used directly by the Adapter (e.g. HDF5 "data" files in a virtual dataset)

Some time ago, we encountered these examples and some others in the wild. We did not like the look of these Adapters:
(a) Their signatures did not explain to the caller what was expected: just "list of URIs".
(b) The Adapter code had to assign roles, distinguish scalars for lists, potentially sort list. It seemed wasteful and error-prone. When we introspect the files (ii) we know this information. We should write it down at that time and use it!

Thus, we added additional columns to the Asset--DataSource many-to-many relation table which encode which role and order each Asset plays. To be specific, with each row in the many-to-many relation table we also store, in addition to asset_id and and data_source_id, a parameter and a num:

parameter num What to do
<name> NULL Adapter.from_assets(<name>=asset.data_uri)
<name> <int> Adapter.from_assets(<name>=[asset.data_uri, ...])
NULL NULL Adapter.from_assets(...) # do not include asset in parameters
NULL <int> Disallowed by database triggers

I really like having standard separate paths for introspection (ii) and construction from DataSource/Asset(iii). That's a huge improvement.

I see in the implementation of from_assets a regression that (a) puts us back to homogenous vague signatures (assets: list[Assets]) which mask over diverse specific requirements and (b) leaves the Adapters redoing work (and perhaps making mistakes of missing errors) that was already done and recorded when the data was registered or written, as the case may be.


Now on shakier ground, just brainstorming...

Can we get the best of both worlds? As a starting point, convenience function like this might work:

def from_node_and_data_source(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
    "Usage example: from_data_source(CSVAdapter, node, data_source)"
    parameters = defaultdict(list)
    for asset in data_source.assets:
        if asset.num is None:
            # This asset is associated with a parameter that takes a single URI.
            parameters[asset.parameter] = asset.data_uri
        else:
            # This asset is associated with a parameter that takes a list of URIs.
            parameters[asset.parameter].append(asset.data_uri)
    return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)

(A further convenience could wrap this and figure out the adapter_cls automatically by looking up data_source.mimetype in an adapter_by_mimetype registry, i.e. f(node, data_source) -> Adapter.)

Thus, we would drop the from_assets constructor in favor of using the explicit cls.__init__, which declares in its signature precisely what it expects: whether it be one URI, a list of URIs, or multiple different URIs/lists with different formats and/or roles to play.

We would retain two great aspects of this PR:

  • an easy way to construct Adapters from data_sources and their assets
  • a clear separation between between the introspection/registration path and the "from assets" path

There may be better implementations of these goals available, but I hope this is a promising example.

@danielballan
Copy link
Member

Two minor weak points in both the implementation on main and my sketch above:

  1. The namespace of parameters is shared with the namespace of Tiled-reserved terms like metadata, structure, and specs. Echoing @tacaswell from the other data, when "user" (user = Adapter author) and framework names share a namespace, that's often painful. OTOH, having clean function signatures that describe their expected inputs is nice! And the set of reserved terms is unlikely to change much.
  2. The names we use for parameters right now---often data_uri or data_uris---are not very descriptive. Their utility becomes more obvious in these special cases like HDF5 with "master" and "data" files or arrangements of mixed formats, like TIFF+YAML. Maybe we should migrate to more specific names for parameters, like "image_file"?

@danielballan
Copy link
Member

Notes from discussion with @genematx

def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
    "Usage example: from_data_source(CSVAdapter, node, data_source)"
    parameters = defaultdict(list)
    for asset in data_source.assets:
        if asset.num is None:
            # This asset is associated with a parameter that takes a single URI.
            parameters[asset.parameter] = asset.data_uri
        else:
            # This asset is associated with a parameter that takes a list of URIs.
            parameters[asset.parameter].append(asset.data_uri)
    return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)


asset-datasource relation

asset_id  data_source_id  parameter   num
103       1               tiff_image  1
103       2               calib_image 1


 parameter   num
 ---------   ---
# HDF5
  hdf5_file NULL

# TIFF sequence
 tiff_images 1
 tiff_images 2

# TIFF stack
 tiff_stack NULL

"hdf5_file"
["image1", "image2"]

 Asset:
 - data_uri
 - size
 - hash_meth


 HDF5Adapter(hdf5_file: str, ...)
 TIFFSequence(tiff_images: List[str], ...)
 TIFFStack(tiff_image: str, ...)


 class Storage:
     filesystem: str
     sql: str



 class MyAdapter:
    def __init__(self, foo_file: str, metadata, structure, specs, ...):
        "Construct Adapter from info extracted from node, data_source and its assets"
	if foo_file == "":
		raise ValueError
    	self._foo_file = foo_file

    def init_storage(self, storage, data_source) -> DataSource:
        "Allocate assets for writing data."
    	# Always creates some Assets and attaches them.
	# Sometimes adds/alters data_source.parameters.
	# Could look at data_source.mimetype...
        return data_source  # includes data_source.assets

    @classmethod
    def from_uris(cls, *files) -> "MyAdapter":
    	"Accept inherently heterogeneous/unsorted files with unknown structure and introspect."
        ...
	return cls(...)

    @classmethod
    def from_catalog(cls, node, data_source):
        return from_catalog(cls, node, data_source)


# in tiled/catalog/adapter.py
adapter_cls = adapters_by_mimetype[data_source.mimetype]
adapter = from_catalog(adapter_cls, node, data_source)

from_catalog(CSVAdapter, node, data_source)

CSVAdapter.from_catalog(node, data_source)


def asset_parameters_to_adapter_kwargs(data_source):
    "Transform database representation to Python representation."
    parameters = defaultdict(list)
    for asset in data_source.assets:
        if asset.num is None:
            # This asset is associated with a parameter that takes a single URI.
            parameters[asset.parameter] = asset.data_uri
        else:
            # This asset is associated with a parameter that takes a list of URIs.
            parameters[asset.parameter].append(asset.data_uri)
    return parameters


def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
    "Usage example: from_data_source(CSVAdapter, node, data_source)"
    parameters = asset_parameters_to_adapter_kwargs(data_source)
    return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants