Block digest verification fails on some copied record

I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc ([input.warc.gz](https://github.com/webrecorder/warcio/files/5721385/input.warc.gz)):

```python
from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter

print('Validate input')
with open('input.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Negate the full condition for simplicity (Select the problematic record)
        if not (record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') ==
                'https://www.origo.hu/hir-archivum/2019/20190119.html'):
            writer.write_record(record)

print('Validate output without problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Select the problematic record
        if record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') == \
         'https://www.origo.hu/hir-archivum/2019/20190119.html':
            writer.write_record(record) 

print('Validate output just the problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):  # This will fail
        pass
```

The output is the following:

```python
Validate input
Validate output without problematic URL
Validate output just the problematic URL
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    for record in ArchiveIterator(stream, check_digests='raise'):
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 119, in _iterate_records
    self.read_to_end()
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 212, in read_to_end
    b = self.record.raw_stream.read(BUFF_SIZE)
  File "/usr/local/lib/python3.6/dist-packages/warcio/limitreader.py", line 27, in read
    return self._update(buff)
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 99, in _update
    self.digest_checker.problem('block digest failed: {}'.format(self.block_digest))
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 31, in problem
    raise ArchiveLoadFailed(value)
warcio.exceptions.ArchiveLoadFailed: block digest failed: sha1:HRYQXT3HWIGWZMDEAAGYGEJGK334YEGM
```

The expected behavior would be to raise exception earlier:
- When input.warc.gz is read
- When the problematic record is copied to the output.warc.gz

The current state makes the user imply that the output.warc.gz is valid until it is re-read.

BTW: The behavior of `check_digests=True` equals to `check_digests=False` which is not what one would expect.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Block digest verification fails on some copied record #123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Block digest verification fails on some copied record #123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions