Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check whether URL is in cdxj index before trying to generate a thumbnail #475

Closed
lwrubel opened this issue Jun 17, 2022 · 3 comments · Fixed by #490
Closed

Check whether URL is in cdxj index before trying to generate a thumbnail #475

lwrubel opened this issue Jun 17, 2022 · 3 comments · Fixed by #490
Assignees
Labels
web archiving for June-July 2022 work cycle

Comments

@lwrubel
Copy link
Contributor

lwrubel commented Jun 17, 2022

If a seed is registered but the related WARCs have not yet been indexed (typically, they have been stuck in a wasCrawlPreassemblyWF step but other situations can cause this), the thumbnail for a seed will be a "not found" playback page. Example of a seed with a thumbnail showing a not found error.

Before the wasSeedPreassembly thumbnail-generator step's capture method is called, check whether the URL is found in the cdxj.

It should be possible to check pywb cdxj API for a URL.

Query for a site that is in the index:
https://was-pywb-stage.stanford.edu/was/cdx?url=https://library.stanford.edu

Site that is not in the index:
https://was-pywb-stage.stanford.edu/was/cdx?url=https://www.loc.gov

If there are no lines in the response, the step should log an error and stop the workflow step. Since a thumbnail won't get created, then once the web archive does have the content the step can be successfully re-run.

@lwrubel
Copy link
Contributor Author

lwrubel commented Jun 17, 2022

@peterchanws and @andrewjbtw: how does this approach sound to you? At one point, we had discussed creating a placeholder thumbnail image instead. However, replacing that placeholder thumbnail with a real thumbnail sounded like a pain point. The main culprit leading to content not being in the archive was the wasMetadataGenerator step that's been removed so the "not found" situation should be less frequent.

@andrewjbtw
Copy link
Collaborator

If re-running the thumbnail generator step after the URL is indexed in SWAP would reliably generate a thumbnail, that would be a big improvement. If there will still be cases where you have to manually skip the thumbnail generation in the workflow and then supply the thumbnail yourself, then there will probably still be an improvement but how much of an improvement will depend on how much more reliable the thumbnail generation step gets.

In the current system, thumbnails cannot be generated for a few reasons in addition to content not being found:

  • system hits a timeout while waiting for the page to load
  • too many redirects

In those cases, supplying a thumbnail manually will still be required. A placeholder to get past the step could be an improvement here because the failure of the workflow step itself is a pain point, in some ways more of a pain point than the steps to manually supply the thumbnail.

@lwrubel
Copy link
Contributor Author

lwrubel commented Jun 21, 2022

#483 adds a 30-second wait before puppeteer takes the snapshot, addressing #446. That is intended to reduce in large part the timeout issue. I've added a note about the too many redirects scenario to #81.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
web archiving for June-July 2022 work cycle
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants