-
Notifications
You must be signed in to change notification settings - Fork 13
[XMLProcessor] Skip DTD #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
…ssor_on_the_current_tag test to use a root node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed thoughts on this one. On the one hand, it’s nice to skip over the DTD, but particularly when there are ENTITY definitions it could be dangerous to continue on to parse the rest of the document. Those entities can be meaningful.
|
IIUC, this and the other two prs make XMLProcessor more of a non-validating processor:
The spec still required internal entity processing:
Even if there was a recommendation to support custom entities, I would rather not implement that at all for security reasons, e.g. billion laughs attack. Here's an excerpt I like from https://pypi.org/project/defusedxml/, a library for secure XML parsing in Python:
This leaves us with a few smaller features that we could likely safely support, e.g. default attribute values. At this point, however, I'd rather skip even those as we're getting into a confusing minefield territory of "which dtd features are available and which are not". Perhaps we could reject dtd on sight, and then require an opt-in, e.g. via |
this seems more reasonable to me because otherwise we’re intentionally misparsing documents.
There is middle-ground here, which I believe is reasonable. We can even use the concept of a budget like we did with the HTML API to hand out expansions, and we can limit expansion recursion depth. If these kinds of constraints are violated then we can reject the document, making it possible to parse almost every normative non-malicious document while rejecting the ones which are troublesome. It’d probably even go far to simply expand entities which themselves contain no other entities beyond character references.
of the documents you have seen, have you assessed what they contain in the DTD? or are these documents just sitting there in tests vs. in the real world?
for better or worse this is pretty much how all XML parsers work because the spec leaves so much wiggle room, and places computationally impossible demands on proper parsing. we already have somewhat limited spec support anyway, right? what stands out to me the most is not what a library supports, but the difference between what a library claims to support and what it actually supports. if we refuse to parse something we know we don’t understand, people get frustrated but they can accept that. if we claim to support a document and produce a different parse than software which actually supports it, then we’ve given those people reason to be frustrated. |
Description
Allows
XMLProcessorto parse documents containing inline DTD definitions similar to these:This is done by recognizing, parsing, and skipping all the
DOCTYPEinternals:<!ELEMENT>,<!ATTLIST>,<!ENTITY>,<!NOTATION>, and nested processing instructions and conditional sections. XMLProcessor does not use the parsed DTD information in any way or even expose it to the caller. It just skips right past it.Example
Testing instructions
CI. This PR enables more than a 1000 tests from the W3C XML test suite and also add new XMLProcessorTest cases to confirm we can stream-parse incomplete documents involving inline DTD declarations.
Supersedes #204
cc @dmsnell @sirreal