Skip to content

Block digest verification fails on some copied record #123

Open
@dlazesz

Description

@dlazesz

I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc (input.warc.gz):

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter

print('Validate input')
with open('input.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Negate the full condition for simplicity (Select the problematic record)
        if not (record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') ==
                'https://www.origo.hu/hir-archivum/2019/20190119.html'):
            writer.write_record(record)

print('Validate output without problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Select the problematic record
        if record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') == \
         'https://www.origo.hu/hir-archivum/2019/20190119.html':
            writer.write_record(record) 

print('Validate output just the problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):  # This will fail
        pass

The output is the following:

Validate input
Validate output without problematic URL
Validate output just the problematic URL
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    for record in ArchiveIterator(stream, check_digests='raise'):
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 119, in _iterate_records
    self.read_to_end()
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 212, in read_to_end
    b = self.record.raw_stream.read(BUFF_SIZE)
  File "/usr/local/lib/python3.6/dist-packages/warcio/limitreader.py", line 27, in read
    return self._update(buff)
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 99, in _update
    self.digest_checker.problem('block digest failed: {}'.format(self.block_digest))
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 31, in problem
    raise ArchiveLoadFailed(value)
warcio.exceptions.ArchiveLoadFailed: block digest failed: sha1:HRYQXT3HWIGWZMDEAAGYGEJGK334YEGM

The expected behavior would be to raise exception earlier:

  • When input.warc.gz is read
  • When the problematic record is copied to the output.warc.gz

The current state makes the user imply that the output.warc.gz is valid until it is re-read.

BTW: The behavior of check_digests=True equals to check_digests=False which is not what one would expect.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions