The Big Spool Refactor #482
Replies: 3 comments
-
Hey @d-chambers, |
Beta Was this translation helpful? Give feedback.
-
I really don't know yet. I will try to come up with a minimal viable implementation then we can run some benchmarks to compare. Right now, the index querying is very fast, so as long as it is still a good experience for the users I am inclined to sacrifice some performance for more features/flexibility. For example, if reasonably large queries go from 10ms to 20ms, most users probably won't even notice the performance degradation but will probably appreciate the new features. I will post some updates here once I have something working. |
Beta Was this translation helpful? Give feedback.
-
I am working on the refactor and expanding the flexibility of the from pathlib import Path
import dascore as dc
from upath import UPath
# Standard way to init a spool on a local file system. These should all still work fine.
sp_local = dc.spool("path_to_my_spool")
sp_local = dc.spool(Path("path_to_my_spool"))
sp_local = dc.spool(UPath("path_to_my_spool")
# Using an http
upath_http = Upath("https://github.com/dasdae/test_data/raw/master/das/terra15_v5_test_file.hdf5")
sp_http = dc.spool(upath_http)
# Using an s3 bucket (note: fs specific parameters, such as authentication, are handled by they UPath)
upath_s3 = UPath("s3://gdr-data-lake/FORGE/DAS/Neubrex/April_2024/v1.0.0", anon=True)
sp_s3 = dc.spool(upath) Of course, we will simply convert string-like inputs to UPaths under the hood, so simply passing the url will work if no extra parameters are needed. The other thing is to make it easy to add/combine spools as you would lists. For example, the following should work: # using a sequence
sp = dc.spool([upath_http, upath_s3])
# or adding spools together
sp = sp_http + sp_s3 + sp_local I also figure it will be useful to add the ability to just write out the index file. These should be able to store the chunking/selecting of the entire spool. With # Write out only the index, which will also save info about how the spool was selected/chunked.
sp.write_index("path_to_index.db", full_paths=True)
# Read only the index, which has enough info to point to the data files.
sp = sp.spool(index="path_to_index.db") I still have a lot of work to do on the database bits and spool refactor, but I wanted to put the extended API out there before I implement it to see if anyone has any feedback. |
Beta Was this translation helpful? Give feedback.
-
TLDR: Swapping out the pytables backend of the
DirectorySpool
for a proper RDBMs will likely remove significant limitations and enable new features. We can also clean up the spool and chunking implementations. I am leaning towards using DuckDB.After several discussions, it has become apparent that we need to refactor the
Spool
class, particularly the way theDirectorySpool
indexes file archives. This discussion details reasons why, benefits, and a plan moving forward.Current Implementation
Currently, the
DirectorySpool
creates a small hdf5 index using the table features of PyTables. All of the indexed features are stored in a single table. The columns of this table can be printed using theget_contents
method of aDirectorySpool
:They are:
'file_format', 'time_min', 'time_step', 'path', 'station', 'network', 'tag', 'instrument_id', 'dims', 'data_type', 'experiment_id', 'time_max', 'file_version', 'data_category'.
Any other information, such as the data units, extents of other dimensions, etc. are not known until a patch is loaded into memory. The limitations of this index are the cause of many open issues such as:
#252 - There is no distance information stored
#362, #436 - The
relative
andsamples
keywords probably require more information to work#417 -
Spool.concatenate
does't have enough information from the current index to work onDirectorySpool
instances#437 -
Spool.chunk
can't use units since we don't store coordinate units.#447 - The index doesn't have enough information to use samples on the distance coord since it doesn't store any information about it.
We could resolve some of these issues by adding more columns to the table. These would be:
However, keeping everything in a single table means it will be difficult to support indexing arbitrary coordinates and attributes (which I would like to do). Moreover, every time we change the schema, it will require all users to re-index their directory spools, so it will be best to implement all the anticipated changes in one go so users only have to do this once.
A New Schema
Using a proper RDBMS will allow us to use multiple tables. Specifically, with 3 source tables I think we can support arbitrary coordinates and attributes. These would be a Patch table, a Coord table, and an Attr table.
Source Patch Table
Source Coordinate Table
Source Attr Table
Advantages and Other Plans
This schema can then index coordinates and attributes with arbitrary names. To create the contents dataframe (e.g. from
spool.get_contents()
) we only need to pivot the coordinate and attributes tables and join them back to the patch table.While we are doing this refactor, we can add support for fsspec so that the address column can represent either a local or remote resource. Since DuckDB already has support for the major cloud stores, DASCore spools could handle remote/local data in the same way.
We could also refactor the MemorySpool to simply use an in-memory DuckDB instance. Then, combining spools (e.g., with
spool = spool1 + spool2
) only involves combining the corresponding source tables, either through an append or upsert operation.Additionally, I suspect much of the current chunking implementation can be improved by pushing some of the operations into the database layer (e.g. using as_of merges). I will need to explore this more, but we may be able to have the chunking tables in the database as well.
edit: Added dtype
Beta Was this translation helpful? Give feedback.
All reactions