Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_html ignoring white-spaces and newlines for <pre><code> ... </pre></code> html #4

Open
qknight opened this issue Mar 4, 2025 · 1 comment

Comments

@qknight
Copy link

qknight commented Mar 4, 2025

I'm having this issue ivanceras/sauron#107 and thought the cause was this parser.

However, recreating them in tests/html.rs sheds a different light on the issue and only the last test fails but the <pre><code>...</code></pre> stuff is alright as in your parser does not remove spaces, newlines or (supposedly) tabs.

So maybe you want to add these tests also and fix the implementation - if you think it is worth fixing it.

correct and working

#[test]
fn test_pre_code() {
    let html = r#"<div><p> test </p>
<pre><code>
0
  1
  <p>foo</p>
  2
3</code></pre>
</div>"#;
    let expected = "<div><p> test </p>\n<pre><code>\n0\n  1\n  <p>foo</p>\n  2\n3</code></pre>\n</div>";
    let doc = parse(html).unwrap();
    println!("html: {}", html);
    println!("render: {}", render(&doc));
    assert_eq!(expected, render(&doc));
}

#[test]
fn test_pre_code_2() {
    let html = r#"<pre><code>
<span>asdf</span>
  <span>asdf</span>
  <span>asdf</span>
</code></pre>"#;
let expected = r#"<pre><code>
<span>asdf</span>
  <span>asdf</span>
  <span>asdf</span>
</code></pre>"#;

  let doc = parse(html).unwrap();
  println!("html: {}", html);
  println!("render: {}", render(&doc));
  assert_eq!(expected, render(&doc));
}

#[test]
fn test_no_pre_no_code_2() {
    let html = r#"<span>asdf</span>
  <span>asdf</span>
  <span>asdf</span>"#;

  let expected = r#"<span>asdf</span><span>asdf</span><span>asdf</span>"#;

  let options = RenderOptions {
		lowercase_tagname: true,
		minify_spaces: true,
		..Default::default()
	};
  let doc = parse(html).unwrap();
  println!("html: {}", html);
  println!("render: {}", doc.render(&options));
  assert_eq!(expected, doc.render(&options));
}

incorrect

#[test]
fn test_pre_code3() {
    let html = r#"<div><p> test </p>
0
  1
  2
3
</div>"#;
  // it returns this
  // "<div><p> test </p>\n0\n1\n2\n3\n</div>"
  // BUT it should be like this
let expected = r#"<div><p>test</p>0 1 2 3</div>"#;
  let options = RenderOptions {
    lowercase_tagname: true,
    minify_spaces: true,
    decode_entity: true,
    encode_content: true,
    remove_endtag_space: true,
    always_close_void: true,
    remove_attr_quote: true,
    remove_comment: true,
    ..Default::default()
  };
  let doc = parse(html).unwrap();
    println!("html: {}", html);
    println!("render: {}", doc.render(&options));
    assert_eq!(expected, doc.render(&options));
}
@qknight
Copy link
Author

qknight commented Mar 7, 2025

rphtml is at fault! It parses <p> as node_type: Tag and it should be of type node_type: Text because of the <pre>

I didn't see this earlier as most of the tests use doc.render and the pre is corrected during output generation using:

struct RenderStatus {
	inner_type: RenderStatuInnerType,
	is_in_pre: bool,
	root: bool,
}

prove of parser fault

A html parser MUST not parse inside a <pre>...</pre>. These internals must be considered text only and internal tags may be interpreted by the browser, for css stylings only.

#[test]
fn test_childs() -> HResult {
	let code = r##"<pre><p>aaa</p></pre>"##;
	let doc = parse(code)?;
	let root = doc.get_root_node();
	let childs = &root.borrow().childs;
	let childs = childs.as_ref().unwrap();

  for child in childs {
    println!(" - child: {:#?}\n", child);
  }

  assert_eq!(1,2);
	Ok(())
}
---- test_childs stdout ----
 - child: RefCell {
    value: Node {
        index: 0,
        node_type: Tag,
        begin_at: 0,
        end_at: 0,
        content: None,
        childs: Some(
            [
                RefCell {
                    value: Node {
                        index: 0,
                        node_type: Tag,
                        begin_at: 5,
                        end_at: 5,
                        content: None,
                        childs: Some(
                            [
                                RefCell {
                                    value: Node {
                                        index: 0,
                                        node_type: Text,
                                        begin_at: 8,
                                        end_at: 11,
                                        content: Some(
                                            [
                                                'a',
                                                'a',
                                                'a',
                                            ],
                                        ),
                                        childs: None,
                                        meta: None,
                                        end_tag: None,
                                        parent: true,
                                        root: true,
                                        document: false,
                                    },
                                },
                            ],
                        ),
                        meta: Some(
                            RefCell {
                                value: TagMeta {
                                    code_in: Wait,
                                    is_void: false,
                                    self_closed: false,
                                    auto_fix: false,
                                    name: [
                                        'p',
                                    ],
                                    attrs: [],
                                    lc_name_map: {},
                                },
                            },
                        ),
                        end_tag: Some(
                            RefCell {
                                value: Node {
                                    index: 0,
                                    node_type: TagEnd,
                                    begin_at: 11,
                                    end_at: 15,
                                    content: Some(
                                        [
                                            'p',
                                        ],
                                    ),
                                    childs: None,
                                    meta: None,
                                    end_tag: None,
                                    parent: true,
                                    root: false,
                                    document: false,
                                },
                            },
                        ),
                        parent: true,
                        root: true,
                        document: false,
                    },
                },
            ],
        ),
        meta: Some(
            RefCell {
                value: TagMeta {
                    code_in: Wait,
                    is_void: false,
                    self_closed: false,
                    auto_fix: false,
                    name: [
                        'p',
                        'r',
                        'e',
                    ],
                    attrs: [],
                    lc_name_map: {},
                },
            },
        ),
        end_tag: Some(
            RefCell {
                value: Node {
                    index: 0,
                    node_type: TagEnd,
                    begin_at: 15,
                    end_at: 21,
                    content: Some(
                        [
                            'p',
                            'r',
                            'e',
                        ],
                    ),
                    childs: None,
                    meta: None,
                    end_tag: None,
                    parent: true,
                    root: false,
                    document: false,
                },
            },
        ),
        parent: true,
        root: true,
        document: false,
    },
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant