Attempt of getting <pre> to not parse inner contents similar to <script> #582

qknight · 2025-03-11T18:45:06Z

This branch is used to implement the fix required to the issue: #580

Motivation

With the current implementation the parser will evaluate arbitraty html tags inside a <pre>...</pre> and with this patch, <pre> will behave more like <script>.

This behaviour should be optional as sometimes it also makes sense to parse tags inside a <pre>, for instance for styling but most often the content inside a <pre> should be pretty much ignored and copied 1:1 from the source document into the generated output document and not reformatted (removing spaces, newlines or tabs) or should the parsed content have any influence on the overal consistenty of the document.

That said:

<html><pre></html>test foo</pre></html> should not be fixed into
<html><pre>test foo</pre></html>

Status

branch: servo_issue_580 with hash: 2094a85

This evaluates:

<hello>XML</hello><pre>\n<bad> </bad>text-in pre</pre><p>asdf</p><script>script</html> magic string</script>

into

<html><head></head><body><hello>XML</hello><pre>\n<bad> </bad>text-in pre</pre><p>asdf</p><script>script</html> magic string</script></body></html>

This shows that the content inside the <pre>...</pre> is grabbed and not parsed already. Yet the result should be no HTML escaped string but rather a 1:1 copy of the original tags.

This can be evaluated by running:

clear && cargo run --example html2html

Todo

Figure out why: process_to_completion is called for <script> but not for <pre>
Implement an option to the parser to include parsing of <pre>...</pre> content or not
Write the PreData as String and not HTML escaped.
Write a bunch of tests so make sure it works

jdm · 2025-03-11T19:10:56Z

Is this behavior specified in the HTML parsing specification?

qknight · 2025-03-12T01:39:06Z

@jdm your question is hard to answer!

html standard related to `<pre>`

i like the grok summary i created https://x.com/i/grok/share/AI7crMuXH2BoIAxC57P9v8VIg but it does not have sources.

my new understanding is now:

everything in <pre>...</pre> needs to have a fixed layout, no changes on spaces, tabs or newlines
the parser 'can' parse tags but must not do any 'fixes' if incorrect

something along these lines. i try to figure out how virtual-dom does it.

virtual-dom (works)

i write this technical blog at https://lastlog.de/blog/libnix_volth's_work.html and i'm using pandoc to generate <pre><code> sections and when i serialize and deserialize the html document using https://github.com/Matt-Esch/virtual-dom it just works correctly.

the motivation to move away from this is the usage of rust compiled to WASM. i always wanted to make modifications to the way 'new virtual-dom patches are applied' with visual cues which i can't do with virtual-dom.

rphtml (fails)

first i tried to replace virtual-dom with rphtml. but i discovered problems with rphtml: fefit/rphtml#4
i tried to fix them but the code is very hard to read and after a few days of hacking i gave up.

notable mention: the issue of rphtml was very hard to track down as it works 'half' of the time where text nodes in combination to tags sometimes yield correct html documents after doc.render(...)

qknight · 2025-03-12T15:59:25Z

@jdm I checked the tests in html-serializer.rs and they seem correct!

test!(pre_lf_0, "<pre>foo bar</pre>");
test!(pre_lf_1, "<pre>\nfoo bar</pre>", "<pre>foo bar</pre>");
test!(pre_lf_2, "<pre>\n\nfoo bar</pre>", "<pre>\nfoo bar</pre>");
test!(pre_lf_3, "<pre>\n  <p>adf</p>\nfoo\n\tbar</pre>", "<pre>  <p>adf</p>\nfoo\n\tbar</pre>");

I think that html5ever handles it correctly and in sauron, in order to do DOM / vDOM diff/patches, there needs to be a translatation into a different node abstraction.

I've added a check on the tag name for "pre" and execute this code instead:

fn process_handle<MSG>(node: &Handle) -> Result<Option<Node<MSG>>, ParseError> {
    let children: Vec<Node<MSG>> = node
        .children
        .borrow()
        .iter()
        .filter_map(|child| process_handle(child).ok().flatten())
        .collect();

    match &node.data {
        NodeData::Document => {
            let child_nodes_len = children.len();
            match child_nodes_len {
                0 => Ok(Some(node_list([]))),
                1 => Ok(Some(children.into_iter().next().unwrap())),
                _ => Ok(Some(node_list(children))),
            }
        }
        NodeData::Text { contents } => {
            let content = contents.borrow().to_string();
            Ok(Some(text(content)))
        }
        NodeData::Element { name, attrs, .. } => {
            let tag_name = name.local.to_string();
            
            if tag_name == "pre".to_string() {
                //println!("tag_name: {}", tag_name);
                let mut buffer: Vec::<u8> = vec![];
                let document: SerializableHandle = node.clone().into();
            
                serialize(&mut buffer, &document, Default::default()).expect("serialization failed");
                let writer_string = String::from_utf8(buffer).expect("Could not write buffer as string");
                println!("--- {} ---", writer_string);
                let content: String = format!("<pre>{}</pre>", writer_string);
                Ok(Some(text(content)))
            }
            else {

In words: When a <pre> tag occures, I use serialize / SerializableHandle to convert it into a correctly formated String and insert it as NodeData::Text. Not yet sure this does exactly what I want but so far it is looking good. I'll close this ticket shortly if that is the case.

test

-------------------------------
html:
<div><p> test </p>
<pre><code><p>foo1</p>
  <p>foo2</p><p>foo3</p>
  3</code></pre></div>
-------------------------------
render_to_string:
<html><head></head><body><div><p> test </p>
<!--separator--><pre><code><p>foo1</p>
  <p>foo2</p><p>foo3</p>
  3</code></pre></div></body></html>
-------------------------------
render_to_string_pretty:
<html>
  <head></head>
  <body>
    <div>
      <p> test </p>


      <!--separator-->
      <pre><code><p>foo1</p>
  <p>foo2</p><p>foo3</p>
  3</code></pre>
      </div>
    </body>
  </html>
-------------------------------

qknight · 2025-03-13T03:21:54Z

The html5ever implementation for <pre> is alright and my problem was caused by the post-processing of the let dom = parse_document(RcDom::default(), opts).one(input); later on.

Attempt of getting <pre> to not parse inner contents similar to <script>

2094a85

qknight added 2 commits March 11, 2025 20:45

process_to_completion now processes PreData correctly

35b479e

parse_pre option support for TreeBuilderOpts

f0e4e4a

qknight closed this Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Uh oh!

qknight commented Mar 11, 2025 •

edited

Loading

Uh oh!

jdm commented Mar 11, 2025

Uh oh!

qknight commented Mar 12, 2025 •

edited

Loading

Uh oh!

qknight commented Mar 12, 2025 •

edited

Loading

Uh oh!

qknight commented Mar 13, 2025

Uh oh!

Uh oh!

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Uh oh!

Conversation

qknight commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Status

Todo

Uh oh!

jdm commented Mar 11, 2025

Uh oh!

qknight commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

html standard related to <pre>

virtual-dom (works)

rphtml (fails)

Uh oh!

qknight commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

test

Uh oh!

qknight commented Mar 13, 2025

Uh oh!

Uh oh!

qknight commented Mar 11, 2025 •

edited

Loading

qknight commented Mar 12, 2025 •

edited

Loading

html standard related to `<pre>`

qknight commented Mar 12, 2025 •

edited

Loading