Skip to content

Conversation

pfackeldey
Copy link
Collaborator

This PR adds to and from safetensors conversions. They're extremely fast at the cost of file size because they to not include any compression. The idea is that all buffers are saved as a long sequence of uncompressed bytes along with metadata that remembers where each buffers starts and stops (similar to an awkward array). Loading it mmaps the file and accessing individual buffers loads only the corresponding slice into memory. This is basically what zarr does but with a dynamic chunk size instead of a static one (which is good for us, because we don't have rectangular arrays) and when one turns off compression.

Copy link

codecov bot commented Oct 17, 2025

Codecov Report

❌ Patch coverage is 83.92857% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.69%. Comparing base (b749e49) to head (960c99c).
⚠️ Report is 445 commits behind head on main.

Files with missing lines Patch % Lines
src/awkward/operations/ak_to_safetensors.py 80.76% 5 Missing ⚠️
src/awkward/operations/ak_from_safetensors.py 85.71% 4 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_from_safetensors.py 85.71% <85.71%> (ø)
src/awkward/operations/ak_to_safetensors.py 80.76% <80.76%> (ø)

... and 197 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pfackeldey pfackeldey marked this pull request as ready for review October 17, 2025 12:20
Copy link

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3685

@pfackeldey pfackeldey requested a review from ianna October 17, 2025 13:47
@pfackeldey
Copy link
Collaborator Author

Something is looking weird with the API docs of these two functions, but I don't see what I did wrong... Any ideas?

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfackeldey - excellent work! A few minor comments, please, check. Also you correctly support str, pathlib.Path, or file-like objects for destination in docstring, but the implementation does not explicitly normalize Path objects. While safetensors.numpy.save_file accepts paths, an explicit cast like:

import os
from pathlib import Path

if isinstance(destination, Path):
    destination = os.fspath(destination)

can make behavior more predictable across platforms.

@ianna ianna added the pr-next-release Required for the next release label Oct 17, 2025
@ianna
Copy link
Collaborator

ianna commented Oct 17, 2025

Something is looking weird with the API docs of these two functions, but I don't see what I did wrong... Any ideas?

Ah, this should come first:

    """
    Args:
...

and then the function description, I think.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 17, 2025

@pfackeldey do you wanna add tests for every single layout type? You can just copy the layouts from tests/test_3608_to_packed_for_typetracer_backed_arrays.py. I remember adding all the layouts there recently at least. Or tell an LLM to do it actually :)

@pfackeldey
Copy link
Collaborator Author

@pfackeldey do you wanna add tests for every single layout type? You can just copy the layouts from tests/test_3608_to_packed_for_typetracer_backed_arrays.py. I remember adding all the layouts there recently at least. Or tell an LLM to do it actually :)

no, this uses to/from_buffers under-the-hood which is well-tested already. I don't think it makes sense to add redundant test cases. This conversion here works as long as ak.to/from_buffers works.

@ikrommyd
Copy link
Collaborator

ikrommyd commented Oct 20, 2025

@pfackeldey maybe I missed something in the code, but shouldn't you materialize before writing to safetensors? to_buffers doesn't by itself. It spits out VirtualNDArray instances. Maybe to_packed is worth it too?

@pfackeldey
Copy link
Collaborator Author

@pfackeldey maybe I missed something in the code, but shouldn't you materialize before writing to safetensors? to_buffers doesn't by itself. It spits out VirtualNDArray instances. Maybe to_packed is worth it too?

good point! I'll add that 👍

@ikrommyd
Copy link
Collaborator

And I had one more thing that I just thought of. Maybe there should be a check that the array is not typetracer-backed when writing? I'm not sure what other IO functions to, I didn't check before writing this . I am saying this because to_buffers will work and to_packed but then you'd try to convert to bytes a typetracer which will probably give not a super clean error

@pfackeldey
Copy link
Collaborator Author

And I had one more thing that I just thought of. Maybe there should be a check that the array is not typetracer-backed when writing? I'm not sure what other IO functions to, I didn't check before writing this . I am saying this because to_buffers will work and to_packed but then you'd try to convert to bytes a typetracer which will probably give not a super clean error

it fails with a correct and good error already:

... 
TypeError: cannot call 'to_buffers' on an array without concrete data

@ikrommyd
Copy link
Collaborator

Ah good. I was under the impression from buffers would be fine. I should have tried it before speaking I guess. Thanks for checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-next-release Required for the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants