Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt of getting <pre> to not parse inner contents similar to <script> #582

Closed
wants to merge 3 commits into from

Conversation

qknight
Copy link

@qknight qknight commented Mar 11, 2025

This branch is used to implement the fix required to the issue: #580

Motivation

With the current implementation the parser will evaluate arbitraty html tags inside a <pre>...</pre> and with this patch, <pre> will behave more like <script>.

This behaviour should be optional as sometimes it also makes sense to parse tags inside a <pre>, for instance for styling but most often the content inside a <pre> should be pretty much ignored and copied 1:1 from the source document into the generated output document and not reformatted (removing spaces, newlines or tabs) or should the parsed content have any influence on the overal consistenty of the document.

That said:

  • <html><pre></html>test foo</pre></html> should not be fixed into
  • <html><pre>test foo</pre></html>

Status

branch: servo_issue_580 with hash: 2094a85

This evaluates:

<hello>XML</hello><pre>\n<bad> </bad>text-in pre</pre><p>asdf</p><script>script</html> magic string</script>

into

<html><head></head><body><hello>XML</hello><pre>\n&lt;bad&gt; &lt;/bad&gt;text-in pre</pre><p>asdf</p><script>script</html> magic string</script></body></html>

This shows that the content inside the <pre>...</pre> is grabbed and not parsed already. Yet the result should be no HTML escaped string but rather a 1:1 copy of the original tags.

This can be evaluated by running:

clear && cargo run --example html2html

Todo

  • Figure out why: process_to_completion is called for <script> but not for <pre>
  • Implement an option to the parser to include parsing of <pre>...</pre> content or not
  • Write the PreData as String and not HTML escaped.
  • Write a bunch of tests so make sure it works

@jdm
Copy link
Member

jdm commented Mar 11, 2025

Is this behavior specified in the HTML parsing specification?

@qknight
Copy link
Author

qknight commented Mar 12, 2025

@jdm your question is hard to answer!

html standard related to <pre>

i like the grok summary i created https://x.com/i/grok/share/AI7crMuXH2BoIAxC57P9v8VIg but it does not have sources.

my new understanding is now:

  • everything in <pre>...</pre> needs to have a fixed layout, no changes on spaces, tabs or newlines
  • the parser 'can' parse tags but must not do any 'fixes' if incorrect

something along these lines. i try to figure out how virtual-dom does it.

virtual-dom (works)

i write this technical blog at https://lastlog.de/blog/libnix_volth's_work.html and i'm using pandoc to generate <pre><code> sections and when i serialize and deserialize the html document using https://github.com/Matt-Esch/virtual-dom it just works correctly.

the motivation to move away from this is the usage of rust compiled to WASM. i always wanted to make modifications to the way 'new virtual-dom patches are applied' with visual cues which i can't do with virtual-dom.

rphtml (fails)

first i tried to replace virtual-dom with rphtml. but i discovered problems with rphtml: fefit/rphtml#4
i tried to fix them but the code is very hard to read and after a few days of hacking i gave up.

notable mention: the issue of rphtml was very hard to track down as it works 'half' of the time where text nodes in combination to tags sometimes yield correct html documents after doc.render(...)

@qknight
Copy link
Author

qknight commented Mar 12, 2025

@jdm I checked the tests in html-serializer.rs and they seem correct!

test!(pre_lf_0, "<pre>foo bar</pre>");
test!(pre_lf_1, "<pre>\nfoo bar</pre>", "<pre>foo bar</pre>");
test!(pre_lf_2, "<pre>\n\nfoo bar</pre>", "<pre>\nfoo bar</pre>");
test!(pre_lf_3, "<pre>\n  <p>adf</p>\nfoo\n\tbar</pre>", "<pre>  <p>adf</p>\nfoo\n\tbar</pre>");

I think that html5ever handles it correctly and in sauron, in order to do DOM / vDOM diff/patches, there needs to be a translatation into a different node abstraction.

I've added a check on the tag name for "pre" and execute this code instead:

fn process_handle<MSG>(node: &Handle) -> Result<Option<Node<MSG>>, ParseError> {
    let children: Vec<Node<MSG>> = node
        .children
        .borrow()
        .iter()
        .filter_map(|child| process_handle(child).ok().flatten())
        .collect();

    match &node.data {
        NodeData::Document => {
            let child_nodes_len = children.len();
            match child_nodes_len {
                0 => Ok(Some(node_list([]))),
                1 => Ok(Some(children.into_iter().next().unwrap())),
                _ => Ok(Some(node_list(children))),
            }
        }
        NodeData::Text { contents } => {
            let content = contents.borrow().to_string();
            Ok(Some(text(content)))
        }
        NodeData::Element { name, attrs, .. } => {
            let tag_name = name.local.to_string();
            
            if tag_name == "pre".to_string() {
                //println!("tag_name: {}", tag_name);
                let mut buffer: Vec::<u8> = vec![];
                let document: SerializableHandle = node.clone().into();
            
                serialize(&mut buffer, &document, Default::default()).expect("serialization failed");
                let writer_string = String::from_utf8(buffer).expect("Could not write buffer as string");
                println!("--- {} ---", writer_string);
                let content: String = format!("<pre>{}</pre>", writer_string);
                Ok(Some(text(content)))
            }
            else {

In words: When a <pre> tag occures, I use serialize / SerializableHandle to convert it into a correctly formated String and insert it as NodeData::Text. Not yet sure this does exactly what I want but so far it is looking good. I'll close this ticket shortly if that is the case.

test

-------------------------------
html:
<div><p> test </p>
<pre><code><p>foo1</p>
  <p>foo2</p><p>foo3</p>
  3</code></pre></div>
-------------------------------
render_to_string:
<html><head></head><body><div><p> test </p>
<!--separator--><pre><code><p>foo1</p>
  <p>foo2</p><p>foo3</p>
  3</code></pre></div></body></html>
-------------------------------
render_to_string_pretty:
<html>
  <head></head>
  <body>
    <div>
      <p> test </p>


      <!--separator-->
      <pre><code><p>foo1</p>
  <p>foo2</p><p>foo3</p>
  3</code></pre>
      </div>
    </body>
  </html>
-------------------------------

@qknight
Copy link
Author

qknight commented Mar 13, 2025

The html5ever implementation for <pre> is alright and my problem was caused by the post-processing of the let dom = parse_document(RcDom::default(), opts).one(input); later on.

@qknight qknight closed this Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants