Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block digest verification fails on some copied record #123

Open
dlazesz opened this issue Dec 20, 2020 · 5 comments
Open

Block digest verification fails on some copied record #123

dlazesz opened this issue Dec 20, 2020 · 5 comments

Comments

@dlazesz
Copy link

dlazesz commented Dec 20, 2020

I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc (input.warc.gz):

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter

print('Validate input')
with open('input.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Negate the full condition for simplicity (Select the problematic record)
        if not (record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') ==
                'https://www.origo.hu/hir-archivum/2019/20190119.html'):
            writer.write_record(record)

print('Validate output without problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Select the problematic record
        if record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') == \
         'https://www.origo.hu/hir-archivum/2019/20190119.html':
            writer.write_record(record) 

print('Validate output just the problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):  # This will fail
        pass

The output is the following:

Validate input
Validate output without problematic URL
Validate output just the problematic URL
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    for record in ArchiveIterator(stream, check_digests='raise'):
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 119, in _iterate_records
    self.read_to_end()
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 212, in read_to_end
    b = self.record.raw_stream.read(BUFF_SIZE)
  File "/usr/local/lib/python3.6/dist-packages/warcio/limitreader.py", line 27, in read
    return self._update(buff)
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 99, in _update
    self.digest_checker.problem('block digest failed: {}'.format(self.block_digest))
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 31, in problem
    raise ArchiveLoadFailed(value)
warcio.exceptions.ArchiveLoadFailed: block digest failed: sha1:HRYQXT3HWIGWZMDEAAGYGEJGK334YEGM

The expected behavior would be to raise exception earlier:

  • When input.warc.gz is read
  • When the problematic record is copied to the output.warc.gz

The current state makes the user imply that the output.warc.gz is valid until it is re-read.

BTW: The behavior of check_digests=True equals to check_digests=False which is not what one would expect.

@wumpus
Copy link
Collaborator

wumpus commented Dec 22, 2020

There are multiple possible bugs here, one possibility is that the copy is writing the wrong block digest, perhaps because it changed the block and kept the same digest. If I can see the input file, it would be helpful.

The check_digests API bug is a separate one, I'm not sure how I made that mistake, but I'll open a separate bug for it (#124)

@dlazesz
Copy link
Author

dlazesz commented Dec 22, 2020

@wumpus

If I can see the input file, it would be helpful.

The input file is attached to the OP and the bug should be reproducible with it: https://github.com/webrecorder/warcio/files/5721385/input.warc.gz

Thank you for investigating the issue!

@wumpus
Copy link
Collaborator

wumpus commented Dec 23, 2020

Thanks, I didn't notice input.warc.gz was a link.

The difference between input.warc and output.warc is that they have the same digest, but the content length is one octet shorter for out.warc. And lo and behold, right at the top, I see

input: HTTP/1.1 200 ^M # trailing whitespace, no OK
out: HTTP/1.1 200^M # no trailing whitespace

Which is to say, there's trailing whitespace in the http headers in input and not in out.

How... interesting! I was thinking my digest-checking code was the guilty party, but instead it could be that the copy is dropping that http header trailing whitespace while repeating the same digest? Changing the output by dropping trailing whitespace is dodgy, repeating the digest is much dodgier.

@wumpus
Copy link
Collaborator

wumpus commented Dec 23, 2020

While I'm here I'll also mention that Transfer-Encoding: chunked is in both input and output headers, but it's not actually chunked. This is a common problem in warcs. warcio happily reads these warcs by falling back to non-chunked.

@wumpus
Copy link
Collaborator

wumpus commented Dec 23, 2020

@ikreymer I see two choices, you probably have an opinion:

  1. Save the verbatim header bytes while reading, and use them on writing if the user has not modified anything, keeping the digests exactly the same if it's an exact copy
  2. Keep the current header parsing, which might change the header on reading, and always recompute digests on write

Option 1 is a "first, do no harm" philosophy, but it will be a little ugly to notice changes to the headers between read and write. Option 2 is a small code change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants