Open
Description
I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc (input.warc.gz):
from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter
print('Validate input')
with open('input.warc.gz', 'rb') as stream:
for record in ArchiveIterator(stream, check_digests='raise'):
pass
with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
for record in ArchiveIterator(stream, check_digests='raise'):
# Negate the full condition for simplicity (Select the problematic record)
if not (record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') ==
'https://www.origo.hu/hir-archivum/2019/20190119.html'):
writer.write_record(record)
print('Validate output without problematic URL')
with open('out.warc.gz', 'rb') as stream:
for record in ArchiveIterator(stream, check_digests='raise'):
pass
with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
for record in ArchiveIterator(stream, check_digests='raise'):
# Select the problematic record
if record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') == \
'https://www.origo.hu/hir-archivum/2019/20190119.html':
writer.write_record(record)
print('Validate output just the problematic URL')
with open('out.warc.gz', 'rb') as stream:
for record in ArchiveIterator(stream, check_digests='raise'): # This will fail
pass
The output is the following:
Validate input
Validate output without problematic URL
Validate output just the problematic URL
Traceback (most recent call last):
File "test.py", line 30, in <module>
for record in ArchiveIterator(stream, check_digests='raise'):
File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 119, in _iterate_records
self.read_to_end()
File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 212, in read_to_end
b = self.record.raw_stream.read(BUFF_SIZE)
File "/usr/local/lib/python3.6/dist-packages/warcio/limitreader.py", line 27, in read
return self._update(buff)
File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 99, in _update
self.digest_checker.problem('block digest failed: {}'.format(self.block_digest))
File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 31, in problem
raise ArchiveLoadFailed(value)
warcio.exceptions.ArchiveLoadFailed: block digest failed: sha1:HRYQXT3HWIGWZMDEAAGYGEJGK334YEGM
The expected behavior would be to raise exception earlier:
- When input.warc.gz is read
- When the problematic record is copied to the output.warc.gz
The current state makes the user imply that the output.warc.gz is valid until it is re-read.
BTW: The behavior of check_digests=True
equals to check_digests=False
which is not what one would expect.
Metadata
Metadata
Assignees
Labels
No labels