Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support I/O with zarr #1766

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Conversation

Berkant03
Copy link
Collaborator

@Berkant03 Berkant03 commented Jan 24, 2025

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • benchmarks: created for new functionality
    • benchmarks: performance improved or maintained
    • documentation updated where needed

Description

Added functions to load zarr into memory and save a dndarray into zarr file.

Issue/s resolved: #1632

Type of change

  • New feature

Memory requirements

Memory requirements are as high as the dataset to be loaded is big for loading the data.
Saving an array can up to double the required memory.

Performance

Save Performance depends on the chunk sizes of the zarr array. This depends on various factors like number of
processes and the shape of the array.

Does this change modify the behaviour of other functions? If so, which?

no

@github-actions github-actions bot added core features testing Implementation of tests, or test-related issues labels Jan 24, 2025
@Berkant03 Berkant03 force-pushed the features/1632-support_I/O_with_zarr branch from 50c1b4e to 04a30be Compare January 24, 2025 09:12
@Berkant03 Berkant03 added enhancement New feature or request I/O labels Jan 24, 2025
@Berkant03 Berkant03 requested a review from mrfh92 January 24, 2025 09:22
@Berkant03 Berkant03 added PR talk dependencies Pull requests that update a dependency file labels Jan 27, 2025
Copy link
Contributor

Thank you for the PR!

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 11.11111% with 48 lines in your changes missing coverage. Please review.

Project coverage is 91.91%. Comparing base (3082dd9) to head (c0eac1a).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
heat/core/io.py 11.11% 48 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1766      +/-   ##
==========================================
- Coverage   92.26%   91.91%   -0.36%     
==========================================
  Files          84       84              
  Lines       12447    12501      +54     
==========================================
+ Hits        11484    11490       +6     
- Misses        963     1011      +48     
Flag Coverage Δ
unit 91.91% <11.11%> (-0.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 28, 2025

It looks like that the zarr-specific tests have not been run on codebase. The tests are clearly there, but the lines are not covered according to codecov; so I guess, thats related to the fact that zarr is optional dependency and not installed for the tests on codebase, so the tests are all skipped.

@Berkant03
Copy link
Collaborator Author

Yes someone with access to the runner needs to add zarr to the pip installation.

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 31, 2025

I have done this together with PyTorch 2.6.0-support in #1775

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 31, 2025

@Berkant03 the CI now took zarr into account. there seems to be a small error somewhere (you can see the error message by clicking onto "Details" next to the ci/codebase entry above. Then you can access the full logs of the run by clicking onto test-amd; test-cuda has been cancelled as the error seems to introduce a deadlock.

)

# Wait for the file creation to finish
MPI_WORLD.handle.Barrier()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to also use MPI_WORLD.Barrier(); i guess there is some functionality that makes a Heat-communicator directly fall back to the .handle-attribute if it doesnt know some method.

_, _, slices = MPI_WORLD.chunk(dndarray.gshape, dndarray.split)

zarr_array[slices] = (
dndarray.larray.numpy() # Numpy array needed as zarr can only understand numpy dtypes and infers it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to cause a problem in the tests if the larray is actually on a GPU.
See https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html for the options.

if os.path.exists(path) and not overwrite:
raise RuntimeError("Given Path already exists.")

# Zarr functions by chunking the data, were a chunk is a file inside the store.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to add this as

Notes 
---------- 
...

to the docstring of the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core dependencies Pull requests that update a dependency file enhancement New feature or request features I/O PR talk testing Implementation of tests, or test-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support I/O with zarr
2 participants