-
-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutbackCDX does not get parameters of POST request #585
Comments
OutbackCDX does not currently have any support for indexing POST requests (pull requests welcome though). |
@ato I added a first pull request at nla/outbackcdx#91. It would be great if you could have a look at it - feedback and discussion welcome. 😀 |
Sorry for not responding earlier! Most of the POST matching is done on form data (application/x-www-form-urlencoded), then the query params can be matched similar to GET query params. The base64-encoded __wb_post_data was sort of added as a last resort option, in case it will be useful for whatever else kind of data, and looks like it actually is useful here! Often times, the values in the POST do not match exactly, and then it falls back to the fuzzy matching.. Given this use case, I wonder if JSON data should be treated differently as well, perhaps just added as __wb_json_data= which could be more useful and helpful in doing inexact matches? Just an idea, of course the current PR will probably be a quicker way to get this supported, but may be interesting to consider this change. |
@ikreymer Thanks for the input! I agree that it would make sense to implement a solution for JSON POST data. But wouldn't this mean breaking existing solutions and require reindexing? (currently having __wb_post_data in the surt). Independent of the JSON issue, in order for the replay to work with OutbackCDX, a small change is needed in the pywb RemoteIndexSource to pass the __wb_post_data with the url field (since outbackcdx does its own canonicalization and ignores the urlkey parameter). I placed a pull request for this in #587. This currently only works with base64 encoded data, but I could of course change it to allow any format. |
Closing this, as I believe all issues related to POST here should now be resolved (and, it appears this dashboard has been updated to not use POST anymore) |
Describe the bug
When using OutbackCDX as an index server, the __wb_post_data is not sent with the url to the outbackcdx server. On webpages with multiple XHR POSTs to the same URL, this will return the wrong data. Using a local CDXJ file index works as expected.
Steps to reproduce the bug
cdx-indexer -p -s corona-data.warc.gz | curl -X POST --data-binary @- http://127.0.0.1:8078/collection
Expected behavior
The replayed POST requests should contain correct responses (so the diagrams can be drawn)
Screenshots
Replayed page with invalid (white) diagrams. The reason for this is that the CDX information for the POST requests to _dash-update-components are not passed with the query.
Environment
Additional context
I tried to track this down to the
_get_api_url
function in warcserver/indexsource.py. The url used does not contain the__wb_post_data
.FileIndexSource
uses thekey
parameter. So I see the following options:key
using theurlkey
parameter of outbackcdx (and updating documentation)__wb_post_data
to the url parameterThere might be also be other options to consider. Also, the __wb_post_data changed to __warc_post_data with cdxj-indexer, so maybe there is more development going on. I'd be interested to contribute a fix, but need some guidance as to the best way.
Update. Quote from the OutbackCDX page: "The canonicalized URL (first field) is ignored, OutbackCDX performs its own canonicalization." - indexing in OutbackCDX seems to ignore the __wb_post_data parameter, so this might need further evaluation/coordination.
The text was updated successfully, but these errors were encountered: