Skip to content

Conversation

@t-reents
Copy link

Adding a readline method to PackedObjectReader to address #174. This is important as it causes errors in aiida-workgraph and aiida-pythonjob.

According to the test and some manual comparison, it seems that this is working fine. Would be great if we could quickly iterate and get this in, as @superstar54 was also keen in fixing this for the WorkGraph/PythonJob

@t-reents
Copy link
Author

Pinging @agoscinski @GeigerJ2 @khsrali for a review, thanks!

@codecov
Copy link

codecov bot commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 74.82517% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.91%. Comparing base (7b7b593) to head (763f903).

Files with missing lines Patch % Lines
disk_objectstore/utils.py 73.72% 36 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #194      +/-   ##
==========================================
- Coverage   99.63%   97.91%   -1.73%     
==========================================
  Files           8        8              
  Lines        1931     2060     +129     
==========================================
+ Hits         1924     2017      +93     
- Misses          7       43      +36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@t-reents
Copy link
Author

mypy is complaining, but not sure if we actually need the readline method for the other classes as well? Especially the zip doesn't seem necessary for the pickle use-case which is the background of this PR

Copy link
Contributor

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining mypy error: So the PackedObjectReader accepts fhandlers of type StreamSeekBytesType which is defined as

StreamSeekBytesType = Union[
    BinaryIO,
    'PackedObjectReader',
    'CallbackStreamWrapper',
    'ZlibLikeBaseStreamDecompresser',
]

Since PackedObjectReader readline method uses the fhandler's readline method, the type checker correctly gives you the error that you need to implement this function for these file handlers as well. If you consider not to implement the methods in the other classe, you at least need to do an isinstance check and error out in case it is a type that does not support readline.

return b''

readline_size = remaining if (size is None or size < 0) else min(size, remaining)
line = self._fhandle.readline(readline_size)
Copy link
Contributor

@agoscinski agoscinski Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this reads only up until some terminator, at least for BinaryIO this is b'\n', even though it can be changed for some file handler, the readline_size is just a maximum value up until it reads. I am not sure if this is the intended behavior, since my understanding of the PackedObjectReader is that it contains already "one line". It seems like you just make reading slower since you randomly chunk your data whenever the you find a b'\n' which has no meaning for arbitrary packed objects (unlike your text example in the tests). What is more meaningful in terms of packed data is to have some maximum number to limit memory consumption (for example 65536 bytes for 64KB). You would loose however the behavior for texts where the terminator makes sense, which IMO is not important for this abstraction but maybe I am wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex!
I generally agree and was also thinking if b'\n' really makes sense, or something "fixed". In the end, I followed the "new-line" approach, based on what I saw for the readline method of the BytesIO / BufferedReader that would be used when the repository isn't packed https://github.com/python/cpython/blob/main/Lib/_pyio.py#L509-L518. Moreover, think that I was mostly biased by the pickle application of that. I just had a quick look there and it seems that pickle is also most relying on b'\n' in the readline method.
But again, I agree that this might not be the most reasonable approach for such a packed file.

@t-reents
Copy link
Author

Since PackedObjectReader readline method uses the fhandler's readline method, the type checker correctly gives you the error that you need to implement this function for these file handlers as well. If you consider not to implement the methods in the other classe, you at least need to do an isinstance check and error out in case it is a type that does not support readline.

Sure. My comment was just to say if it's necessary to implement those (from our practical perspective). In case we would agree to not do it for now, and leave this for another PR, I'd of course add such a check, totally agree on that.

t-reents and others added 3 commits October 31, 2025 08:37
This commit adds a `readline` method to the `PackedObjectReader`, which
was necessary to make it comatible with pickle. The current version
causes problems in `aiida-workgraph` and `aiida-pythonjob`, see aiidateam#174.
@GeigerJ2 GeigerJ2 force-pushed the fix/PackedObjectReader_readline branch from 8f66e0d to caceab7 Compare October 31, 2025 07:37
@GeigerJ2
Copy link
Contributor

GeigerJ2 commented Nov 3, 2025

Notes from discussion with @giovannipizzi:

  • Whenever we add readline also add readlines: https://github.com/python/cpython//blob/38d4b436ca767351db834189b3a5379406cd52a8/Lib/_pyio.py#L561 (self becomes self.readline())
  • Add readline to all classes affected
    • BytesIO has it by default
    • PackedObjectReader done in this PR by @t-reents
    • ZlibLikeBaseStreamDecompresser -> has to be done. needed when pack_all_loose called with compress=True (same as original issue)
      • implement as readline using read, not re-implement complex logic done inside read.
      • for readlines just copy from IOBase
      • possibly use peek implementation (strip down everything from the implementation there we don't need). Without peek:
      def readline(self, size=-1):
          res = bytearray()
          while size < 0 or len(res) < size:
              b = self.read(1)
              if not b:
                  break
              res += b
              if res.endswith(b"\n"):
                  break
          return bytes(res)
      -> Add benchmarks to this: long string of random data, and one which is only 1 without \n (10mb), zip it. Call read with full size and call readline with full size, benchmark time for each. Benchmark for uncompressible and compressible one, and for reading directly from file. This is very slow, with small n for self.read(n), while for large n, it's wrong if there are intermediate \n in the string. look into peek implementation.
      -> We are already using _CHUNKSIZE for ZlibLikeBaseStreamDecompresser
      -> Internal code from inner while in separate function for re-use
      -> or implementation without peek
    • CallbackStreamWrapper: implement readline from underlying stream (currently done for .read, also copy-paste callback code; expand tests in test_callback_stream_wrapper)
      • long string with 3 \n, count if number of callback calls is as expected
  • Fix typing of StreamSeekBytesType in utils.py (maybe other issue, infinite loop of types)
  • Test that methods work for all StreamSeekBytesTypes, performance benchmark for large file

@GeigerJ2 GeigerJ2 changed the title Add readline method to PackedObjectReader Implement readline(s) to all stream classes Nov 4, 2025
@GeigerJ2 GeigerJ2 changed the title Implement readline(s) to all stream classes Implement readline(s) for all stream-seek classes Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants