support I/O with zarr #1766

Berkant03 · 2025-01-24T09:03:14Z

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- benchmarks: created for new functionality
- benchmarks: performance improved or maintained
- documentation updated where needed

Description

Added functions to load zarr into memory and save a dndarray into zarr file.

Issue/s resolved: #1632

Type of change

New feature

Memory requirements

Memory requirements are as high as the dataset to be loaded is big for loading the data.
Saving an array can up to double the required memory.

Performance

Save Performance depends on the chunk sizes of the zarr array. This depends on various factors like number of
processes and the shape of the array.

Does this change modify the behaviour of other functions? If so, which?

no

- added kwargs for more options opening zarr

…lit 0

for more information, see https://pre-commit.ci

…s and added flake8 ignore to unpack operator

for more information, see https://pre-commit.ci

github-actions · 2025-01-28T07:31:31Z

Thank you for the PR!

codecov · 2025-01-28T08:16:08Z

Codecov Report

Attention: Patch coverage is 11.11111% with 48 lines in your changes missing coverage. Please review.

Project coverage is 91.91%. Comparing base (3082dd9) to head (c0eac1a).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
heat/core/io.py	11.11%	48 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1766      +/-   ##
==========================================
- Coverage   92.26%   91.91%   -0.36%     
==========================================
  Files          84       84              
  Lines       12447    12501      +54     
==========================================
+ Hits        11484    11490       +6     
- Misses        963     1011      +48

Flag	Coverage Δ
unit	`91.91% <11.11%> (-0.36%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mrfh92 · 2025-01-28T12:37:57Z

It looks like that the zarr-specific tests have not been run on codebase. The tests are clearly there, but the lines are not covered according to codecov; so I guess, thats related to the fact that zarr is optional dependency and not installed for the tests on codebase, so the tests are all skipped.

Berkant03 · 2025-01-29T08:37:41Z

Yes someone with access to the runner needs to add zarr to the pip installation.

mrfh92 · 2025-01-31T10:10:00Z

I have done this together with PyTorch 2.6.0-support in #1775

github-actions · 2025-01-31T10:58:33Z

Thank you for the PR!

mrfh92 · 2025-01-31T11:42:40Z

@Berkant03 the CI now took zarr into account. there seems to be a small error somewhere (you can see the error message by clicking onto "Details" next to the ci/codebase entry above. Then you can access the full logs of the run by clicking onto test-amd; test-cuda has been cancelled as the error seems to introduce a deadlock.

mrfh92 · 2025-01-31T12:32:08Z

heat/core/io.py

+            )
+
+        # Wait for the file creation to finish
+        MPI_WORLD.handle.Barrier()


You should be able to also use MPI_WORLD.Barrier(); i guess there is some functionality that makes a Heat-communicator directly fall back to the .handle-attribute if it doesnt know some method.

mrfh92 · 2025-01-31T12:33:37Z

heat/core/io.py

+            _, _, slices = MPI_WORLD.chunk(dndarray.gshape, dndarray.split)
+
+            zarr_array[slices] = (
+                dndarray.larray.numpy()  # Numpy array needed as zarr can only understand numpy dtypes and infers it.


this seems to cause a problem in the tests if the larray is actually on a GPU.
See https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html for the options.

mrfh92 · 2025-01-31T12:39:26Z

heat/core/io.py

+            if os.path.exists(path) and not overwrite:
+                raise RuntimeError("Given Path already exists.")
+
+            # Zarr functions by chunking the data, were a chunk is a file inside the store.


It might be a good idea to add this as

Notes ---------- ...

to the docstring of the function.

Berkant03 added 4 commits January 20, 2025 11:32

Added Loading of Zarr-Format.

c414d14

- Added Saving of DNDarrays to zarr

c56a291

- added kwargs for more options opening zarr

reworked save logic and added tests. Added zarr as optional dependency.

fcd07fa

seperated save of zarr for both splits

5a8b8a4

github-actions bot added core features testing Implementation of tests, or test-related issues labels Jan 24, 2025

renamed test for savin zarr split=none and added extra test for 1d sp…

04a30be

…lit 0

Berkant03 force-pushed the features/1632-support_I/O_with_zarr branch from 50c1b4e to 04a30be Compare January 24, 2025 09:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

b98d97e

for more information, see https://pre-commit.ci

Berkant03 added enhancement New feature or request I/O labels Jan 24, 2025

Berkant03 requested a review from mrfh92 January 24, 2025 09:22

Merge branch 'main' into features/1632-support_I/O_with_zarr

931eba1

Berkant03 added PR talk dependencies Pull requests that update a dependency file labels Jan 27, 2025

Berkant03 and others added 7 commits January 27, 2025 10:30

Added optional comm to load zarr and changed docstrings to raw string…

f2e3a68

…s and added flake8 ignore to unpack operator

[pre-commit.ci] auto fixes from pre-commit.com hooks

dcd4c87

for more information, see https://pre-commit.ci

remove raw docstrings for normal ones for flake8

793aece

added doc string for comm in load zarr and moved noqa to right line

6eb55e9

removed unpacking for pydocstyle

bc93e18

[pre-commit.ci] auto fixes from pre-commit.com hooks

85e26b4

for more information, see https://pre-commit.ci

move the __all__ parameters for zarr into the initialization block

c0eac1a

Merge branch 'main' into features/1632-support_I/O_with_zarr

3e6d88b

mrfh92 reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support I/O with zarr #1766

support I/O with zarr #1766

Berkant03 commented Jan 24, 2025 •

edited

Loading

github-actions bot commented Jan 28, 2025

codecov bot commented Jan 28, 2025 •

edited

Loading

mrfh92 commented Jan 28, 2025

Berkant03 commented Jan 29, 2025

mrfh92 commented Jan 31, 2025 •

edited

Loading

github-actions bot commented Jan 31, 2025

mrfh92 commented Jan 31, 2025

mrfh92 Jan 31, 2025

mrfh92 Jan 31, 2025

mrfh92 Jan 31, 2025

support I/O with zarr #1766

Are you sure you want to change the base?

support I/O with zarr #1766

Conversation

Berkant03 commented Jan 24, 2025 • edited Loading

Due Diligence

Description

Type of change

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Jan 28, 2025

codecov bot commented Jan 28, 2025 • edited Loading

Codecov Report

mrfh92 commented Jan 28, 2025

Berkant03 commented Jan 29, 2025

mrfh92 commented Jan 31, 2025 • edited Loading

github-actions bot commented Jan 31, 2025

mrfh92 commented Jan 31, 2025

mrfh92 Jan 31, 2025

Choose a reason for hiding this comment

mrfh92 Jan 31, 2025

Choose a reason for hiding this comment

mrfh92 Jan 31, 2025

Choose a reason for hiding this comment

Berkant03 commented Jan 24, 2025 •

edited

Loading

codecov bot commented Jan 28, 2025 •

edited

Loading

mrfh92 commented Jan 31, 2025 •

edited

Loading