Skip to content

Conversation

@adamziel
Copy link
Collaborator

@adamziel adamziel commented Nov 2, 2025

Description

Allows XMLProcessor to parse documents containing inline DTD definitions similar to these:

<?xml version="1.0"?>
<!DOCTYPE book [
    <!ELEMENT book (#PCDATA)>
    <!ATTLIST book genre CDATA #IMPLIED>
    <!ENTITY writer "Homer">
]>

This is done by recognizing, parsing, and skipping all the DOCTYPE internals: <!ELEMENT>, <!ATTLIST>, <!ENTITY>, <!NOTATION>, and nested processing instructions and conditional sections. XMLProcessor does not use the parsed DTD information in any way or even expose it to the caller. It just skips right past it.

Example

$xml = <<<XML
<?xml version="1.0"?>
<!DOCTYPE book [
    <!ELEMENT book (#PCDATA)>
    <!ATTLIST book genre CDATA #IMPLIED>
    <!ENTITY writer "Homer">
]>
<book genre="epic">&writer;</book>
XML;

$processor = XMLProcessor::create_from_string( $xml );
$processor->next_tag();                    // lands on <book>
$processor->set_attribute( '', 'genre', 'classic' );

echo $processor->get_updated_xml();

/*
<?xml version="1.0"?>
<!DOCTYPE book [
    <!ELEMENT book (#PCDATA)>
    <!ATTLIST book genre CDATA #IMPLIED>
    <!ENTITY writer "Homer">
]>
<book genre="classic">&writer;</book>
*/

Testing instructions

CI. This PR enables more than a 1000 tests from the W3C XML test suite and also add new XMLProcessorTest cases to confirm we can stream-parse incomplete documents involving inline DTD declarations.

Supersedes #204

cc @dmsnell @sirreal

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed thoughts on this one. On the one hand, it’s nice to skip over the DTD, but particularly when there are ENTITY definitions it could be dangerous to continue on to parse the rest of the document. Those entities can be meaningful.

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 3, 2025

IIUC, this and the other two prs make XMLProcessor more of a non-validating processor:

Non-validating processors are required to check only the document entity, including the entire internal DTD subset, for well-formedness.

Note that when processing invalid documents with a non-validating processor the application may not be presented with consistent information

Certain well-formedness errors, specifically those that require reading external entities, may fail to be detected by a non-validating processor. Examples include the constraints entitled Entity Declared, Parsed Entity, and No Recursion, as well as some of the cases described as forbidden in 4.4 XML Processor Treatment of Entities and References

The spec still required internal entity processing:

they must use the information in those declarations to normalize attribute values, include the replacement text of internal entities, and supply default attribute values

Even if there was a recommendation to support custom entities, I would rather not implement that at all for security reasons, e.g. billion laughs attack. Here's an excerpt I like from https://pypi.org/project/defusedxml/, a library for secure XML parsing in Python:

How to avoid XML vulnerabilities:

  • Don’t allow DTDs
  • Don’t expand entities
  • Don’t resolve externals
  • Limit parse depth
  • Limit total input size
  • Limit parse time
  • Favor a SAX or iterparse-like parser for potential large data

This leaves us with a few smaller features that we could likely safely support, e.g. default attribute values. At this point, however, I'd rather skip even those as we're getting into a confusing minefield territory of "which dtd features are available and which are not".

Perhaps we could reject dtd on sight, and then require an opt-in, e.g. via $processor->dangerously_skip_dtds(). Or an error token. Currently I'm unhappy about not being able to even scan any document with a custom dtd.

@dmsnell
Copy link
Member

dmsnell commented Nov 3, 2025

Perhaps we could reject dtd on sight

this seems more reasonable to me because otherwise we’re intentionally misparsing documents.

Even if there was a recommendation to support custom entities, I would rather not implement that at all for security reasons, e.g. billion laughs attack.

There is middle-ground here, which I believe is reasonable. We can even use the concept of a budget like we did with the HTML API to hand out expansions, and we can limit expansion recursion depth. If these kinds of constraints are violated then we can reject the document, making it possible to parse almost every normative non-malicious document while rejecting the ones which are troublesome.

It’d probably even go far to simply expand entities which themselves contain no other entities beyond character references.

I'm unhappy about not being able to even scan any document with a custom dtd

of the documents you have seen, have you assessed what they contain in the DTD? or are these documents just sitting there in tests vs. in the real world?

we're getting into a confusing minefield territory of "which dtd features are available and which are not".

for better or worse this is pretty much how all XML parsers work because the spec leaves so much wiggle room, and places computationally impossible demands on proper parsing.

we already have somewhat limited spec support anyway, right? what stands out to me the most is not what a library supports, but the difference between what a library claims to support and what it actually supports. if we refuse to parse something we know we don’t understand, people get frustrated but they can accept that. if we claim to support a document and produce a different parse than software which actually supports it, then we’ve given those people reason to be frustrated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants