-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WP_HTML_Tag_Processor overview issue #44410
Comments
#42485) Introduce WP_HTML_Tag_Processor for reliably modifying HTML attributes. Dynamic blocks often need to inject a CSS class name or set <img src /> in the rendered block HTML markup but lack the means to do so. WP_HTML_Tag_Processor solves this problem. It scans through an HTML document to find specific tags, then transforms those tags by adding, removing, or updating the values of the HTML attributes within that tag (opener). Importantly, it does not fully parse HTML or _recurse_ into the HTML structure. Instead WP_HTML_Tag_Processor scans linearly through a document and only parses the HTML tag openers. Example: ``` $p = new WP_HTML_Tag_Processor('<div id="first"><img /></div>'); $p->next_tag('img')->set_attribute('src', '/wp-content/logo.png'); echo $p; // <div id="first"><img src="/wp-content/logo.png" /></div> ``` For more details and context, see the original GitHub Pull Request at #42485 and the overview issue at #44410. Co-authored-by: Adam Zieliński <[email protected]> Co-authored-by: Dennis Snell <[email protected]> Co-authored-by: Grzegorz Ziółkowski <[email protected]> Co-authored-by: Sören Wrede <[email protected]> Co-authored-by: Colin Stewart <[email protected]>
Frankly I think #42485 was not ready for merging. Seems it lacks any security features, references, docs, how-to's, etc. Even doesn't mention what security and escaping requirements it has. It seems open to exploits in its current state. This issue mentions some of that, but misses few things. For example validity of new attribute names. Also, it doesn't mention the needed security related inline docs and examples which are crucial for new APIs. As I commented on #42485 (comment) the security side of this API still awaits a decision. Imho it needs to implement some restrictions that will make is much more robust and problems-free in the future, and now is the time for these restrictions. If this is not done (in time) there is a risk that WP will end up with another exploits-prone API just like the shortcodes. If you don't believe me please see the huge, slow, cumbersome regex-based shortcodes sanitization functions that had to be added in order to try to secure badly written, exploitable code in plugins. Please consider restricting the illegal chars in attribute names and html-entity encoding the Also see #42485 (comment). |
@azaozz – #42485 was merged to continue iterating in smaller PRs, not because every kink was worked out. I agree with your points about html-entity encoding and restricting illegal chars. The former is now merged, the latter is in motion. The list of todos also reflects updating the docs. About this point:
I don't think there's an effective way to do that. Say we restrict $p->set_attribute( '__onclick', 'alert()' );
$markup = str_replace( '__onclick', 'onclick', $p ); Notably, the processor does not allow injecting new tags, so there's no risk of injecting a rogue |
A getter like |
Wild idea: Tag Processor could use seekable PHP streams instead of strings and internal pointers:
Potential benefits: Simpler, faster code that uses less memory (if we can get rid of |
6.2 dev note draft below. @dmsnell @gziolo @ockham feel free to edit this comment and adjust as you see fit: WordPress 6.2 ships with WP_HTML_Tag_Processor – a tool to adjust HTML tag attributes in the block markup: $p = new WP_HTML_Tag_Processor( '<a href="#"></a>' );
$p->next_tag( 'a' );
$p->remove_attribute( 'href' );
echo $p->get_updated_html();
// <a></a> Before WordPress 6.2, the block markup was typically updated using regular expressions or the
The tradeoff is that it only offers a simplified API to modify HTML attributes. If you want to replace an Here's how to use Remove the $p = new WP_HTML_Tag_Processor( $html );
$p->next_tag( 'a' );
$p->remove_attribute( 'href' ); Add a $p = new WP_HTML_Tag_Processor( $html );
$p->next_tag();
$p->set_attribute( 'style', 'display: none' ); Add a CSS class to the first tag having the $p = new WP_HTML_Tag_Processor( $html );
$p->next_tag( array(
'class_name' => 'wp-block-media-text__content'
) );
$p->add_class( 'wp-foo-bar' ); Add the $p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_tag() ) {
if (
isset( $p->get_attribute( 'src' ) ) &&
! isset( $p->get_attribute( 'srcset' )
) {
$srcset = build_srcset( $p->get_attribute( 'src' ) );
$p->set_attribute( 'srcset', $srcset );
}
} For additional documentation, refer to GitHub Pull Request. If you want to learn more about the motivation for this new API, check the post that proposed A new system for simply and reliably updating HTML attributes. |
I reviewed the listed todo items, and I have an impression that we can close this issue after landing HTML Tag Process in WordPress core. If there are any remaining issues, we can open issues that target individual use cases.
There is
This happened as part of the WP core merge.
Is that still necessary? Could we use the bookmarking system instead? |
@gziolo I refreshed the list of related issues – there's still some work related to the character decoding, the documentation, and the bookmarks API. |
Thank you for updating the description with the recently started tasks.
Documentation is going to be automatically generated from PHPDoc comments for all classes and methods. It will be exposed in the Code Reference section of Official WordPress Developer Resources, similar to for example WP_Block_Type class. What type of handbook page you would like to include in addition to that? Is that in the Block Editor Handbook? In the codebase, there is the following comment included: ### Possible future direction for this module
*
* - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c".
* This would increase the size of the changes for some operations but leave more
* natural-looking output HTML.
* - Decode HTML character references within class names when matching. E.g. match having
* class `1<"2` needs to recognize `class="1<"2"`. Currently the Tag Processor
* will fail to find the right tag if the class name is encoded as such.
* - Properly decode HTML character references in `get_attribute()`. PHP's
* `html_entity_decode()` is wrong in a couple ways: it doesn't account for the
* no-ambiguous-ampersand rule, and it improperly handles the way semicolons may
* or may not terminate a character reference. Does #47040 from the list cover points 2 and 3? It looks like we still need an issue for point 1 about removing whitespaces. There is also an unlisted PR that introduces a new class for sourcing block attributes that gets referenced in the PR from the section with new APIs: I see that some listed tasks reference or extend that work, they explore opening modifications to chunks of HTML:
What do you all see as a top priority and how people can get involved to help with efforts? |
I'm thinking of a proper guide on how to use WP_HTML_Tag_Processor. API docs don't quite cut it.
Point 3 – yes.
Good spot 👍 Added to the list
I think the two open PRs may be blockers for the draft PR. Can you confirm @ockham ? If yes, then those would be the priority to me. Then the |
Yes, completely. There's an enhancement possible noted in the decoder class which is adding a method like "decoded_strpos" or "decoded_has_substring" but when I was testing the performance wasn't impacted by calling |
Yes, that's correct. Plus the non-attribute updates one depends on the bookmark invalidation one. So currently, it's a cascade of dependencies, including one more PR that I filed only recently. They are, in order: |
Surfacing my explorations of parsing HTML into the correct tree structure, e.g. |
We can close this issue as done. The work continues in WordPress Trac: HTML API: Introduce HTML Processor, a higher-level partner to the Tag Processor The list of open tickets is grouped under HTML API component. There still are two open PRs:
This item is pretty vague, and it seems to be nice to have at this stage.
|
Thanks @gziolo. The two open PRs are good to stay open here. I don't know if we need a Trac ticket or not because I expect them to remain open for a while. #46345 will likely never merge but only serve as a guide to what we end up merging (which is one of the reasons I keep these open).
|
Can the existing |
@strarsis the Tag Processor doesn't support replacing any tags, but the HTML Processor (which is being built on top of it) will support that. because of the complexities of the semantic rules involved in replacing an element in HTML though, this is coming at a slower pace. |
The follow-up work once the #42485 is merged:
get_attribute
andset_attribute
. Define what should happen in cases such as$p->set_attribute( 'id', 'My‐id' );
and$p->get_attribute( 'id' )
whenid="My‐id"
.set_attribute( )
to make adding inline JavaScript harder (see @peterwilsoncc's comment and the reasoning behind not banning them).WP_HTML_Tag_Processor
#44478set_content_inside_balanced_tags
#47036has_next_tag
methodhas_class
method to the public WP_HTML_Tag_Processor API #46232 (rationale)(Try adding layout classnames to inner block wrapper #44600 (comment)))
(string) $w
(related comment). Done:@w->get_updated_html()
Code is located in two folders:
cc @gziolo @dmsnell @georgeh
The text was updated successfully, but these errors were encountered: