-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block digest verification fails on some copied record #123
Comments
There are multiple possible bugs here, one possibility is that the copy is writing the wrong block digest, perhaps because it changed the block and kept the same digest. If I can see the input file, it would be helpful. The check_digests API bug is a separate one, I'm not sure how I made that mistake, but I'll open a separate bug for it (#124) |
The input file is attached to the OP and the bug should be reproducible with it: https://github.com/webrecorder/warcio/files/5721385/input.warc.gz Thank you for investigating the issue! |
Thanks, I didn't notice The difference between input.warc and output.warc is that they have the same digest, but the content length is one octet shorter for out.warc. And lo and behold, right at the top, I see input: Which is to say, there's trailing whitespace in the http headers in input and not in out. How... interesting! I was thinking my digest-checking code was the guilty party, but instead it could be that the copy is dropping that http header trailing whitespace while repeating the same digest? Changing the output by dropping trailing whitespace is dodgy, repeating the digest is much dodgier. |
While I'm here I'll also mention that |
@ikreymer I see two choices, you probably have an opinion:
Option 1 is a "first, do no harm" philosophy, but it will be a little ugly to notice changes to the headers between read and write. Option 2 is a small code change. |
I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc (input.warc.gz):
The output is the following:
The expected behavior would be to raise exception earlier:
The current state makes the user imply that the output.warc.gz is valid until it is re-read.
BTW: The behavior of
check_digests=True
equals tocheck_digests=False
which is not what one would expect.The text was updated successfully, but these errors were encountered: