The Big Spool Refactor #482

d-chambers · 2025-01-06T18:01:30Z

d-chambers
Jan 6, 2025
Maintainer

TLDR: Swapping out the pytables backend of the DirectorySpool for a proper RDBMs will likely remove significant limitations and enable new features. We can also clean up the spool and chunking implementations. I am leaning towards using DuckDB.

After several discussions, it has become apparent that we need to refactor the Spool class, particularly the way the DirectorySpool indexes file archives. This discussion details reasons why, benefits, and a plan moving forward.

Current Implementation

Currently, the DirectorySpool creates a small hdf5 index using the table features of PyTables. All of the indexed features are stored in a single table. The columns of this table can be printed using the get_contents method of a DirectorySpool:

import dascore as  dc
spool = dc.get_example_spool("random_directory_das")

print(spool.get_contents().columns)

They are:

'file_format', 'time_min', 'time_step', 'path', 'station', 'network', 'tag', 'instrument_id', 'dims', 'data_type', 'experiment_id', 'time_max', 'file_version', 'data_category'.

Any other information, such as the data units, extents of other dimensions, etc. are not known until a patch is loaded into memory. The limitations of this index are the cause of many open issues such as:

#252 - There is no distance information stored
#362, #436 - The relative and samples keywords probably require more information to work
#417 - Spool.concatenate does't have enough information from the current index to work on DirectorySpool instances
#437 - Spool.chunk can't use units since we don't store coordinate units.
#447 - The index doesn't have enough information to use samples on the distance coord since it doesn't store any information about it.

We could resolve some of these issues by adding more columns to the table. These would be:

dtype- The array datatype
data_units - The units of the array.
shape - The shape of the array
coord units, length, dtype etc.

However, keeping everything in a single table means it will be difficult to support indexing arbitrary coordinates and attributes (which I would like to do). Moreover, every time we change the schema, it will require all users to re-index their directory spools, so it will be best to implement all the anticipated changes in one go so users only have to do this once.

A New Schema

Using a proper RDBMS will allow us to use multiple tables. Specifically, with 3 source tables I think we can support arbitrary coordinates and attributes. These would be a Patch table, a Coord table, and an Attr table.

Source Patch Table

Name	DataType	Description
address	VarChar (Primary Key)	A unique path for the patch relative to the index.
ndims	Int	Number of dimensions in patch
data_units	VarChar	The units of the patch
dtype	VarChar	The numpy dtype string
file_format	VarChar	The format which indicates which FiberIO is needed to read the patch
file_version	VarChar	The version of the Format needed to read the Patch
station	VarChar	The station code
network	VarChar	The network code
acquisition	VarChar	The acquisition id (will point to an inventory eventually)
tag	VarChar	A custom (optional) tag for the metadata
dims	List[VarChar]	The names of the dimensions, separated by a comma
coords	List[VarChar]	The names of the non-dimension coordinates, separated by a comma
shape	List[Int]	The dimension shapes

Source Coordinate Table

Name	DataType	Description
address	VarChar (References Patch(address))	The Patch path/id
name	VarChar	The name of the coordinate
shape	List[Int]	The shape of the coordinate
dtype	VarChar	The numpy dtype string
units	VarChar	The units of the coordinate
dims	List[VarChar]	The dimensions associated with the coordinate.
start	Union[DateTimeNS, TimeDeltaNS, Int, Float]	The start (minimum value) of the coordinate
stop	Union[DateTimeNS, TimeDeltaNS, Int, Float]	The end (maximum value) of the coordinate
step	Union[DateTimeNS, TimeDeltaNS, Int, Float]	The increment of the coordinate (None if not evenly sampled)

Source Attr Table

Name	DataType	Description
address	VarChar (References Patch(address))	The Patch path/id
name	VarChar	The name of the attribute
value	Union[DateTimeNS, Int, Float, VarChar]	The value of the coordinate

Advantages and Other Plans

This schema can then index coordinates and attributes with arbitrary names. To create the contents dataframe (e.g. from spool.get_contents()) we only need to pivot the coordinate and attributes tables and join them back to the patch table.

While we are doing this refactor, we can add support for fsspec so that the address column can represent either a local or remote resource. Since DuckDB already has support for the major cloud stores, DASCore spools could handle remote/local data in the same way.

We could also refactor the MemorySpool to simply use an in-memory DuckDB instance. Then, combining spools (e.g., with spool = spool1 + spool2) only involves combining the corresponding source tables, either through an append or upsert operation.

Additionally, I suspect much of the current chunking implementation can be improved by pushing some of the operations into the database layer (e.g. using as_of merges). I will need to explore this more, but we may be able to have the chunking tables in the database as well.

edit: Added dtype

ahmadtourei · 2025-01-09T17:52:42Z

ahmadtourei
Jan 9, 2025
Collaborator

Hey @d-chambers,
This new schema sounds promising! As we discussed earlier, I also agree with refactoring the spool class. However, I'm a bit concerned about how the new design will affect spool query performance compared to our current schema. Do you have any thoughts on that? Is there any way to gather performance metrics before implementing the final version?

0 replies

d-chambers · 2025-01-09T18:16:10Z

d-chambers
Jan 9, 2025
Maintainer Author

@ahmadtourei,

I really don't know yet. I will try to come up with a minimal viable implementation then we can run some benchmarks to compare. Right now, the index querying is very fast, so as long as it is still a good experience for the users I am inclined to sacrifice some performance for more features/flexibility. For example, if reasonably large queries go from 10ms to 20ms, most users probably won't even notice the performance degradation but will probably appreciate the new features.

I will post some updates here once I have something working.

0 replies

d-chambers · 2025-01-17T19:44:06Z

d-chambers
Jan 17, 2025
Maintainer Author

I am working on the refactor and expanding the flexibility of the dc.spool function. Part of this means adding support for universal_pathlib, which is almost working. This will give us the ability to support a wide variety of data sources. Here is how it is shaping up:

from pathlib import Path

import dascore as dc
from upath import UPath

# Standard way to init a spool on a local file system. These should all still work fine.
sp_local = dc.spool("path_to_my_spool")
sp_local = dc.spool(Path("path_to_my_spool"))
sp_local = dc.spool(UPath("path_to_my_spool")

# Using an http
upath_http = Upath("https://github.com/dasdae/test_data/raw/master/das/terra15_v5_test_file.hdf5")
sp_http = dc.spool(upath_http)

# Using an s3 bucket (note: fs specific parameters, such as authentication, are handled by they UPath)
upath_s3 = UPath("s3://gdr-data-lake/FORGE/DAS/Neubrex/April_2024/v1.0.0", anon=True)
sp_s3 = dc.spool(upath)

Of course, we will simply convert string-like inputs to UPaths under the hood, so simply passing the url will work if no extra parameters are needed.

The other thing is to make it easy to add/combine spools as you would lists. For example, the following should work:

# using a sequence 
sp = dc.spool([upath_http, upath_s3]) 

# or adding spools together
sp = sp_http + sp_s3 + sp_local

I also figure it will be useful to add the ability to just write out the index file. These should be able to store the chunking/selecting of the entire spool. With full_paths the index keeps track of the absolute path (url, bucket, etc).

# Write out only the index, which will also save info about how the spool was selected/chunked.
sp.write_index("path_to_index.db", full_paths=True)

# Read only the index, which has enough info to point to the data files.
sp = sp.spool(index="path_to_index.db")

I still have a lot of work to do on the database bits and spool refactor, but I wanted to put the extended API out there before I implement it to see if anyone has any feedback.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Big Spool Refactor #482

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

The Big Spool Refactor #482

d-chambers Jan 6, 2025 Maintainer

Current Implementation

A New Schema

Source Patch Table

Source Coordinate Table

Source Attr Table

Advantages and Other Plans

Replies: 3 comments

ahmadtourei Jan 9, 2025 Collaborator

d-chambers Jan 9, 2025 Maintainer Author

d-chambers Jan 17, 2025 Maintainer Author

d-chambers
Jan 6, 2025
Maintainer

ahmadtourei
Jan 9, 2025
Collaborator

d-chambers
Jan 9, 2025
Maintainer Author

d-chambers
Jan 17, 2025
Maintainer Author