-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification of playback index requirements #69
Comments
I think we know some of this, but I'm after a clarification. For example (this is based on some comments from @ato, thanks!): For OutbackCDX, we can populate it with records generated using pywb's cdx-indexer tool, or another tool that meets that specification. Where is that specification? And does this need a recent version of OutbackCDX? (I seem to remember there was some issue with POST request indexing before)? And is the cdx-indexer one up-to-date? I thought you'd changed it recently and it wasn't everywhere yet? For matching, OutbackCDX has a -y command-line for loading a pywb fuzzy matching file (rules.yaml). Is this the same fuzzy matching ruleset/approach as ReplayWeb.page? Do we have to worry about keeping the match rules in sync? |
The short answer is that populating an index produced from cdx-indexer or cdxj-indexer should work when added to OutbackCDX, as long as prefix queries work. The longer answer is that the process of converting a request/response pair to a URL for inclusion in the cdx(j) process is something that should definitely be documented better. It is currently implemented across several libraries, in both python and javascript. It is not yet well documented, but should be.
The actual fuzzy matching is done on the client-side of the index fetch, either in wabac.js or in pywb, by performing a prefix query from the index server. The fuzzy matching is implemented slightly differently between wabac.js and pywb but the end results are mostly the same. The domain specific rules exist in at the client level, as the prefix based query allows for more flexibility in how the fuzzy matching is done. (As an alternative to avoid prefix query is to create 'fake cdx' entries which can be queried with an exact match, which then makes the dependency on the index a bit more strict, but this can be avoided when using prefix querying. This approach is only used when prefix querying is not available). Currently, connecting ReplayWeb.page to OutbackCDX is not yet fully, but definitely could be, especially in combination with nla/outbackcdx#79. For this, should be possible to use cdx created via cdxj-indexer/cdx-indexer. But definitely the transformation to post-aware URL should be documented somewhere as a first step. |
Thanks for clarifying @ikreymer - so, the pywb rules are now supported by OutbackCDX but it not strictly necessary to deploy them there, right? i.e. you can do it if it looks like yielding better performance, but it's not a requirement? |
I've just been experimenting with, so I'm going to drop some notes here (I guess I'll want to review and add on top of webrecorder/pywb#588 ?). Firstly, support for POST parameters was added to OutbackCDX here, and as such, OutbackCDX >= 0.8.0 is needed to support this feature. It's also worth noting that the OutbackCDX implementation does not use the urlkey provided by the indexer, but instead calculates it's own key, and adds the Also note that at the current time, webarchive-discovery does not support this convention. |
|
Actually, I'm looking at an old replayweb.page WARC where it altered the WARC-Target-URI, so I've likely just confused myself. |
Yeah OK, indeed the pywb cdx-indexer and cdxj-indexer only append to the canonicalised URL field and leave the original URL untouched which OutbackCDX's normal behaviour is to discard as it uses its own urlkey generation. That PR 91 which copies the __wb_post_data parameter back to original url is not enough as JSON requests don't even use that field but rather create a synthetic query string based on the JSON structure. To properly implement that strategy it'd need to look for __wb_method= and copy it and everything after it. What our custom indexing code at NLA does and what I thought cdx-indexer did but was wrong about is add the extra parameters to the original URL field. That then works fine with pywb reading it and with all versions of OutbackCDX. |
Also, (and yes this needs to be documented and will be soon), cdxj-indexer and pywb cdx-indexer also add |
Is that just |
I thinking there are two separate(iish) problems: One is making sure OutbackCDX+ On the former question, I think I'm right in saying that, at the current moment, I can't use |
Yes, it seems currently OutbackCDX isn't quite in sync with the latest indexing, but should be fixable. |
So maybe just need an option that, instead of writing separate "url", "requestBody", "method" fields, it combines them into the URL field (initially was just hesitant to not store the original URL at all), making it an option for cdxj and requirement (since can't add new fields) for cdx output with POST canonicalization. |
I just updated to Pywb 2.6.0 and also implemented form-urlencoded POST request encoding in our indexing pipeline (previously we were only doing it for JSON requests). While testing that I discovered that Pywb wasn't actually passing the POST-encoded version of the url through to OutbackCDX. It turns out the pages I was testing earlier didn't care too much about getting the correct graphql response as long as long as they just got any response, so it was "working" purely by accident. 🤦 For now I seem to have gotten things working by patching XmlQueryIndexSource to prefer params['alt_url'] (which has the POST data encoded into it) over params['url']. But I'm not at all confident that's the correct solution. |
I'm working on our indexer and re-reading this thread, and still struggling to know what to do. I'm using Should we try to hold a call to work out what the details should be? @ikreymer @ato ? |
I am the author of #91 and updating to pywb 2.6.x while using outbackcdx. Currently seeing the following points:
If I understand @ikreymer correctly, (1) could also be solved in OutbackCDX by integrating the fields For 2) I could imagine a re-implementation of my original pywb pull request webrecorder/pywb#587 which adds the It would be the original outbackcdx and pywb pull requests adjusted to the new requirements. I could try to make these changes, but if there's a simpler solution I'm more than happy. @ato @ikreymer @anjackson What do you think about this? PS. I just realized that this discussion is maybe taking place in the wrong project (replayweb.page). Should we split up and move to pywb and outbackcdx? |
OutbackCDX can now store arbitrary CDXJ fields (currently this is gated behind the
I think it might be nicer to actually add method and requestBody as query parameters to the CDX server API.
|
@ato 🚀 Really cool, I'm going to test this and report back. For me, the implemented solution makes absolutely sense, many thanks! Just an idea/questions: would it maybe make sense to just store the content of PUT/POST as a hash value? Could we run into size problems (length of PUT/POST data)? I also saw you're working on an index upgrader 🎉 I working on renewing the Dockerfile as building the tools now fails with maven:3-eclipse-temurin-17. Do you think there's any chance of upgrading the rocksdbjni image to a newer version? (it is set to 6.20.3 in pom.xml, current is 8.1.1.1) - it is hard for me to estimate the changes at API level, but could try. |
I think the main downside to this option is it means you can't change fuzzy matching rules (e.g. fields to ignore for matching because they contain random, time-specific or user-agent specific data) without going back to the source WARC records which can take a lot of processing time for large collections. I guess it also makes the records less detailed for other purposes like troubleshooting or research. Using hashes would indeed certainly make the storage simpler though, particular for long requests. My current personal goal is to at least make OutbackCDX compatible with the CDX/CDXJ "--post-append" indexes that the Webrecorder suite of tools are currently producing as that's already seeing quite a bit of use. I expect this is an area that's going to see more refinement and experimentation over time though. There's the case of responses that differ based on request headers -- an obvious example being content type negotiation, but I'm sure someone's made at least one site somewhere that passes essential API parameters as custom HTTP request headers.
The Pywb requestBody transformation truncates the converted/canonicalized requestBody to 4096 bytes.
Yep. Intending to do that for the next release as well. I haven't tried the very latest release but I did try 7.x not so long ago and there weren't any API changes. I've opened an issue to remind myself. nla/outbackcdx#114 |
When playing back web archives using ReplayWeb.page, we can get very high quality playback, and I think this is down to:
For large-scale web archives, we need to find ways to support this via OutbackCDX or SolrWayback/webarchive-discovery. Can you give us a clear summary of what we need to do ensure that pywb (and future ReplayWeb.page?) can achieve the same level of playback quality?
The text was updated successfully, but these errors were encountered: