Skip to content

Commit 7241121

Browse files
authored
[XMLProcessor] ASCII fast path for parsing names (#201)
Speeds up XMLProcessor by consuming any ASCII bytes with `strspn` and avoiding calls to the utf8 decoder for most tags out there. The PHPUnit test suite for WXR files It speeds up parsing the 10MB WXR file in the test set from ~1.7s on average to ~0.6s on average. This PR also moves from `utf8_codepoint_at` to `_wp_scan_utf8` for UTF-8 decoding without any speed penalty – see #200 for prior context. cc @dmsnell
1 parent 72c1176 commit 7241121

File tree

1 file changed

+90
-40
lines changed

1 file changed

+90
-40
lines changed

components/XML/class-xmlprocessor.php

Lines changed: 90 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@
55
use WP_HTML_Span;
66
use WP_HTML_Text_Replacement;
77

8-
use function WordPress\Encoding\utf8_codepoint_at;
8+
use function WordPress\Encoding\compat\_wp_scan_utf8;
9+
use function WordPress\Encoding\utf8_ord;
910

1011
/**
1112
* XML API: XMLProcessor class
@@ -17,11 +18,14 @@
1718
* It implements a subset of the XML 1.0 specification (https://www.w3.org/TR/xml/)
1819
* and supports XML documents with the following characteristics:
1920
*
20-
* * XML 1.0
21-
* * Well-formed
22-
* * UTF-8 encoded
23-
* * Not standalone (so can use external entities)
24-
* * No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections (will fail on them)
21+
* – XML 1.0
22+
* – Well-formed
23+
* – UTF-8 encoded
24+
* – Not standalone (so can use external entities)
25+
* – No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections (will fail on them)
26+
*
27+
* XML 1.1 is explicitly not a design goal here. Version 1.1 is
28+
* more complex specification and not so widely supported.
2529
*
2630
* ### Possible future direction for this module
2731
*
@@ -41,12 +45,6 @@
4145
* * <!NOTATION, see https://www.w3.org/TR/xml/#sec-entity-decl
4246
* * Conditional sections, see https://www.w3.org/TR/xml/#sec-condition-sect
4347
*
44-
* @TODO: Support XML 1.1.
45-
*
46-
* @TODO: Evaluate the performance of utf8_codepoint_at() against using the mbstring
47-
* extension. If mbstring is faster, then use it whenever it's available with
48-
* utf8_codepoint_at() as a fallback.
49-
*
5048
* @package WordPress
5149
* @subpackage HTML-API
5250
* @since WP_VERSION
@@ -1198,8 +1196,8 @@ protected function parse_next_token() {
11981196
/**
11991197
* Compute fully qualified attributes and assert:
12001198
*
1201-
* * All attributes have valid namespaces.
1202-
* * No two attributes have the same (local name, namespace) pair.
1199+
* All attributes have valid namespaces.
1200+
* No two attributes have the same (local name, namespace) pair.
12031201
*
12041202
* @see https://www.w3.org/TR/2006/REC-xml-names11-20060816/#uniqAttrs
12051203
*/
@@ -1690,8 +1688,8 @@ private function parse_next_tag() {
16901688
* names.
16911689
*
16921690
* Reference:
1693-
* * https://www.w3.org/TR/xml/#NT-STag
1694-
* * https://www.w3.org/TR/xml/#NT-Name
1691+
* https://www.w3.org/TR/xml/#NT-STag
1692+
* https://www.w3.org/TR/xml/#NT-Name
16951693
*/
16961694
$tag_name_length = $this->parse_name( $at + 1 );
16971695
if ( false === $tag_name_length ) {
@@ -2328,48 +2326,100 @@ private function skip_whitespace() {
23282326
* @return int
23292327
*/
23302328
private function parse_name( $offset ) {
2331-
static $i = 0;
23322329
$name_byte_length = 0;
2330+
$at = $offset;
2331+
2332+
// Fast path: consume any ASCII NameStartChar bytes.
2333+
$name_byte_length += strspn(
2334+
$this->xml,
2335+
':ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz',
2336+
$offset + $name_byte_length,
2337+
1
2338+
);
2339+
23332340
while ( true ) {
23342341
/**
23352342
* Parse the next unicode codepoint.
23362343
*
2337-
* We use a custom UTF-8 decoder here. No other method
2338-
* is reliable and available enough to depend on it in
2339-
* WordPress core:
2344+
* We use a the `_wp_scan_utf8` UTF-8 decoder introduced in WordPress 6.9. No other method
2345+
* is reliable and available enough to depend on it in WordPress core:
23402346
*
2341-
* * mb_ord() – is not available on all hosts.
2342-
* * iconv_substr() – is not available on all hosts.
2343-
* * preg_match() – can fail with PREG_BAD_UTF8_ERROR when the input
2347+
* mb_ord() – is available on 99.5%+ or more of hosts, but not on all hosts.
2348+
* iconv_substr() – is not available on all hosts.
2349+
* preg_match() – can fail with PREG_BAD_UTF8_ERROR when the input
23442350
* contains an incomplete UTF-8 byte sequence – even
23452351
* when that sequence comes after a valid match. This
23462352
* failure mode cannot be reproduced with just any string.
23472353
* The runtime must be in a specific state. It's unclear
23482354
* how to reliably reproduce this failure mode in a
23492355
* unit test.
23502356
*
2351-
* Performance-wise, character-by-character processing via utf8_codepoint_at
2352-
* is still much faster than relying on preg_match(). The mbstring extension
2353-
* is likely faster. It would be interesting to evaluate the performance
2354-
* and prefer mbstring whenever it's available.
2357+
* Performance-wise, character-by-character processing via _wp_scan_utf8
2358+
* is pretty slow. The ASCII fast path below enables skipping most of the
2359+
* UTF-8 decoder calls.
2360+
*
2361+
* If the UTF-8 decoder performance ever becomes a bottleneck, there are a
2362+
* few ways to significantly improve it:
2363+
*
2364+
* – Call a native grapheme_ function when available.
2365+
* – Introduce a custom UTF-8 decoder optimized for codepoint-by-codepoint processing.
2366+
* It could be the streaming version of the UTF-8 decoder, such as `_wp_iterate_utf8`,
2367+
* that avoids the repeated strspn() calls. Alternatively, the older `utf8_codepoint_at`
2368+
* function could be restored if its codepoint-by-codepoint decoding performance is
2369+
* better than the _wp_scan_utf8.
2370+
*/
2371+
2372+
/**
2373+
* The ASCII speedup includes all ASCII NameStartChar, which are also valid
2374+
* NameChar, making it possible to quickly scan past these bytes without
2375+
* further processing.
2376+
*/
2377+
$name_byte_length += strspn( $this->xml, ":ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz-.0123456789\u{B7}", $offset + $name_byte_length );
2378+
2379+
/*
2380+
* Quickly check if the next byte is an ASCII byte that is not allowed in XML
2381+
* NameStartChar. If so, we can break out of the loop without calling the UTF-8 decoder.
2382+
*
2383+
* Even though this does not seem to be different from the ASCII fast path in the
2384+
* _wp_scan_utf8 function, skipping that function call still provides a ~50% speed
2385+
* improvement.
23552386
*/
2356-
$codepoint = utf8_codepoint_at(
2387+
$is_non_name_ascii_byte = strspn(
23572388
$this->xml,
2389+
"\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f" .
2390+
"\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f" .
2391+
" !\"#$%&'()*+,./;<=>?@[\\]^`{|}~\x7f",
23582392
$offset + $name_byte_length,
2359-
$bytes_parsed
2360-
);
2361-
if (
2362-
// Byte sequence is not a valid UTF-8 codepoint.
2363-
( 0xFFFD === $codepoint && 0 === $bytes_parsed ) ||
2364-
// No codepoint at the given offset.
2365-
null === $codepoint ||
2366-
// The codepoint is not a valid part of an XML NameChar or NameStartChar.
2367-
! $this->is_valid_name_codepoint( $codepoint, 0 === $name_byte_length )
2368-
) {
2393+
1
2394+
) > 0;
2395+
if ( $is_non_name_ascii_byte ) {
2396+
break;
2397+
}
2398+
2399+
// EOF.
2400+
if ( $offset + $name_byte_length >= strlen( $this->xml ) ) {
2401+
break;
2402+
}
2403+
2404+
// The next byte sequence is, very likely, a UTF-8 codepoint. Let's
2405+
// try to decode it.
2406+
$at = $offset + $name_byte_length;
2407+
$new_at = $at;
2408+
$invalid_length = 0;
2409+
if ( 1 !== _wp_scan_utf8( $this->xml, $new_at, $invalid_length, null, 1 ) ) {
2410+
// EOF or invalid utf-8 byte sequence.
2411+
break;
2412+
}
2413+
2414+
$codepoint_byte_length = $new_at - $at;
2415+
$codepoint = utf8_ord( substr( $this->xml, $at, $codepoint_byte_length ) );
2416+
2417+
// The codepoint is not a valid part of an XML NameChar or NameStartChar.
2418+
if ( ! $this->is_valid_name_codepoint( $codepoint, 0 === $name_byte_length ) ) {
23692419
break;
23702420
}
2371-
$codepoint = null;
2372-
$name_byte_length += $bytes_parsed;
2421+
$name_byte_length += $codepoint_byte_length;
2422+
$at = $new_at;
23732423
}
23742424

23752425
return $name_byte_length;

0 commit comments

Comments
 (0)