-
Notifications
You must be signed in to change notification settings - Fork 98
feat: add to_/from_safetensors #3685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files
🚀 New features to boost your workflow:
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3685 |
Something is looking weird with the API docs of these two functions, but I don't see what I did wrong... Any ideas? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfackeldey - excellent work! A few minor comments, please, check. Also you correctly support str
, pathlib.Path
, or file-like objects for destination in docstring, but the implementation does not explicitly normalize Path
objects. While safetensors.numpy.save_file
accepts paths, an explicit cast like:
import os
from pathlib import Path
if isinstance(destination, Path):
destination = os.fspath(destination)
can make behavior more predictable across platforms.
Ah, this should come first: """
Args:
... and then the function description, I think. |
@pfackeldey do you wanna add tests for every single layout type? You can just copy the layouts from |
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
no, this uses to/from_buffers under-the-hood which is well-tested already. I don't think it makes sense to add redundant test cases. This conversion here works as long as |
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
Co-authored-by: Ianna Osborne <[email protected]>
@pfackeldey maybe I missed something in the code, but shouldn't you materialize before writing to safetensors? |
good point! I'll add that 👍 |
And I had one more thing that I just thought of. Maybe there should be a check that the array is not typetracer-backed when writing? I'm not sure what other IO functions to, I didn't check before writing this . I am saying this because |
it fails with a correct and good error already:
|
Ah good. I was under the impression from buffers would be fine. I should have tried it before speaking I guess. Thanks for checking. |
This PR adds to and from safetensors conversions. They're extremely fast at the cost of file size because they to not include any compression. The idea is that all buffers are saved as a long sequence of uncompressed bytes along with metadata that remembers where each buffers starts and stops (similar to an awkward array). Loading it mmaps the file and accessing individual buffers loads only the corresponding slice into memory. This is basically what zarr does but with a dynamic chunk size instead of a static one (which is good for us, because we don't have rectangular arrays) and when one turns off compression.