[XMLProcessor] Skip DTD #205

adamziel · 2025-11-02T14:30:32Z

Description

Allows XMLProcessor to parse documents containing inline DTD definitions similar to these:

<?xml version="1.0"?>
<!DOCTYPE book [
    <!ELEMENT book (#PCDATA)>
    <!ATTLIST book genre CDATA #IMPLIED>
    <!ENTITY writer "Homer">
]>

This is done by recognizing, parsing, and skipping all the DOCTYPE internals: <!ELEMENT>, <!ATTLIST>, <!ENTITY>, <!NOTATION>, and nested processing instructions and conditional sections. XMLProcessor does not use the parsed DTD information in any way or even expose it to the caller. It just skips right past it.

Example

$xml = <<<XML
<?xml version="1.0"?>
<!DOCTYPE book [
    <!ELEMENT book (#PCDATA)>
    <!ATTLIST book genre CDATA #IMPLIED>
    <!ENTITY writer "Homer">
]>
<book genre="epic">&writer;</book>
XML;

$processor = XMLProcessor::create_from_string( $xml );
$processor->next_tag();                    // lands on <book>
$processor->set_attribute( '', 'genre', 'classic' );

echo $processor->get_updated_xml();

/*
<?xml version="1.0"?>
<!DOCTYPE book [
    <!ELEMENT book (#PCDATA)>
    <!ATTLIST book genre CDATA #IMPLIED>
    <!ENTITY writer "Homer">
]>
<book genre="classic">&writer;</book>
*/

Testing instructions

CI. This PR enables more than a 1000 tests from the W3C XML test suite and also add new XMLProcessorTest cases to confirm we can stream-parse incomplete documents involving inline DTD declarations.

Supersedes #204

cc @dmsnell @sirreal

…ections

…ssor_on_the_current_tag test to use a root node

dmsnell

Mixed thoughts on this one. On the one hand, it’s nice to skip over the DTD, but particularly when there are ENTITY definitions it could be dangerous to continue on to parse the rest of the document. Those entities can be meaningful.

adamziel · 2025-11-03T08:39:43Z

IIUC, this and the other two prs make XMLProcessor more of a non-validating processor:

Non-validating processors are required to check only the document entity, including the entire internal DTD subset, for well-formedness.

Note that when processing invalid documents with a non-validating processor the application may not be presented with consistent information

Certain well-formedness errors, specifically those that require reading external entities, may fail to be detected by a non-validating processor. Examples include the constraints entitled Entity Declared, Parsed Entity, and No Recursion, as well as some of the cases described as forbidden in 4.4 XML Processor Treatment of Entities and References

The spec still required internal entity processing:

they must use the information in those declarations to normalize attribute values, include the replacement text of internal entities, and supply default attribute values

Even if there was a recommendation to support custom entities, I would rather not implement that at all for security reasons, e.g. billion laughs attack. Here's an excerpt I like from https://pypi.org/project/defusedxml/, a library for secure XML parsing in Python:

How to avoid XML vulnerabilities:

Don’t allow DTDs

Don’t expand entities

Don’t resolve externals

Limit parse depth

Limit total input size

Limit parse time

Favor a SAX or iterparse-like parser for potential large data

This leaves us with a few smaller features that we could likely safely support, e.g. default attribute values. At this point, however, I'd rather skip even those as we're getting into a confusing minefield territory of "which dtd features are available and which are not".

Perhaps we could reject dtd on sight, and then require an opt-in, e.g. via $processor->dangerously_skip_dtds(). Or an error token. Currently I'm unhappy about not being able to even scan any document with a custom dtd.

dmsnell · 2025-11-03T10:35:46Z

Perhaps we could reject dtd on sight

this seems more reasonable to me because otherwise we’re intentionally misparsing documents.

Even if there was a recommendation to support custom entities, I would rather not implement that at all for security reasons, e.g. billion laughs attack.

There is middle-ground here, which I believe is reasonable. We can even use the concept of a budget like we did with the HTML API to hand out expansions, and we can limit expansion recursion depth. If these kinds of constraints are violated then we can reject the document, making it possible to parse almost every normative non-malicious document while rejecting the ones which are troublesome.

It’d probably even go far to simply expand entities which themselves contain no other entities beyond character references.

I'm unhappy about not being able to even scan any document with a custom dtd

of the documents you have seen, have you assessed what they contain in the DTD? or are these documents just sitting there in tests vs. in the real world?

we're getting into a confusing minefield territory of "which dtd features are available and which are not".

for better or worse this is pretty much how all XML parsers work because the spec leaves so much wiggle room, and places computationally impossible demands on proper parsing.

we already have somewhat limited spec support anyway, right? what stands out to me the most is not what a library supports, but the difference between what a library claims to support and what it actually supports. if we refuse to parse something we know we don’t understand, people get frustrated but they can accept that. if we claim to support a document and produce a different parse than software which actually supports it, then we’ve given those people reason to be frustrated.

adamziel added 11 commits November 2, 2025 00:02

[XMLProcessor] Skip DOCTYPE, ENTITY, ATTLIST, NOTATION, conditional s…

663263a

…ections

Unskip most XMLProcessor W3C conformance tests

a46d274

Simplify path processing

9ffdf1a

Adjust get_updated_xml_applies_the_updates_so_far_and_keeps_the_proce…

49d9527

…ssor_on_the_current_tag test to use a root node

phpcbf

17a23c0

support incomplete markup declarations

b56bbca

Improve DTD docstrings, do not use &$offset reference

d367635

Merge branch 'trunk' into skip-dtd

616e81b

phpcs

d6a467d

phpcs

661382d

Support single quotes in pub ids

66152db

dmsnell reviewed Nov 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XMLProcessor] Skip DTD #205

[XMLProcessor] Skip DTD #205

Uh oh!

adamziel commented Nov 2, 2025 •

edited

Loading

Uh oh!

dmsnell left a comment

Uh oh!

adamziel commented Nov 3, 2025 •

edited

Loading

Uh oh!

dmsnell commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[XMLProcessor] Skip DTD #205

Are you sure you want to change the base?

[XMLProcessor] Skip DTD #205

Uh oh!

Conversation

adamziel commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing instructions

Uh oh!

dmsnell left a comment

Choose a reason for hiding this comment

Uh oh!

adamziel commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmsnell commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamziel commented Nov 2, 2025 •

edited

Loading

adamziel commented Nov 3, 2025 •

edited

Loading