You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a seed is registered but the related WARCs have not yet been indexed (typically, they have been stuck in a wasCrawlPreassemblyWF step but other situations can cause this), the thumbnail for a seed will be a "not found" playback page. Example of a seed with a thumbnail showing a not found error.
Before the wasSeedPreassembly thumbnail-generator step's capture method is called, check whether the URL is found in the cdxj.
It should be possible to check pywb cdxj API for a URL.
Query for a site that is in the index: https://was-pywb-stage.stanford.edu/was/cdx?url=https://library.stanford.edu
Site that is not in the index: https://was-pywb-stage.stanford.edu/was/cdx?url=https://www.loc.gov
If there are no lines in the response, the step should log an error and stop the workflow step. Since a thumbnail won't get created, then once the web archive does have the content the step can be successfully re-run.
The text was updated successfully, but these errors were encountered:
@peterchanws and @andrewjbtw: how does this approach sound to you? At one point, we had discussed creating a placeholder thumbnail image instead. However, replacing that placeholder thumbnail with a real thumbnail sounded like a pain point. The main culprit leading to content not being in the archive was the wasMetadataGenerator step that's been removed so the "not found" situation should be less frequent.
If re-running the thumbnail generator step after the URL is indexed in SWAP would reliably generate a thumbnail, that would be a big improvement. If there will still be cases where you have to manually skip the thumbnail generation in the workflow and then supply the thumbnail yourself, then there will probably still be an improvement but how much of an improvement will depend on how much more reliable the thumbnail generation step gets.
In the current system, thumbnails cannot be generated for a few reasons in addition to content not being found:
system hits a timeout while waiting for the page to load
too many redirects
In those cases, supplying a thumbnail manually will still be required. A placeholder to get past the step could be an improvement here because the failure of the workflow step itself is a pain point, in some ways more of a pain point than the steps to manually supply the thumbnail.
#483 adds a 30-second wait before puppeteer takes the snapshot, addressing #446. That is intended to reduce in large part the timeout issue. I've added a note about the too many redirects scenario to #81.
If a seed is registered but the related WARCs have not yet been indexed (typically, they have been stuck in a wasCrawlPreassemblyWF step but other situations can cause this), the thumbnail for a seed will be a "not found" playback page. Example of a seed with a thumbnail showing a not found error.
Before the wasSeedPreassembly thumbnail-generator step's capture method is called, check whether the URL is found in the cdxj.
It should be possible to check pywb cdxj API for a URL.
Query for a site that is in the index:
https://was-pywb-stage.stanford.edu/was/cdx?url=https://library.stanford.edu
Site that is not in the index:
https://was-pywb-stage.stanford.edu/was/cdx?url=https://www.loc.gov
If there are no lines in the response, the step should log an error and stop the workflow step. Since a thumbnail won't get created, then once the web archive does have the content the step can be successfully re-run.
The text was updated successfully, but these errors were encountered: