The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display. #184

deahhh · 2023-10-10T04:20:58Z

using EzXML

doc = EzXML.parsehtml("<body><p>hello</p><p>中国</p><p>深圳</p></body>")

primates = root(doc)

for p in eachelement(primates)
    println(nodecontent(p))
end

julia draft.jl

Out put:
helloä¸åæ·±å³

deahhh · 2023-10-10T05:27:59Z

the bug will be fixed roughly by replacing "encoding" with "utf-8" in julia.

function parsehtml(htmlstring::AbstractString)
    if isempty(htmlstring)
        throw(ArgumentError("empty HTML string"))
    end
    url = C_NULL
    encoding = C_NULL
    options = 1
    doc_ptr = @check ccall(
        (:htmlReadMemory, libxml2),
        Ptr{_Node},
        (Cstring, Cint, Cstring, Cstring, Cint),
        htmlstring, sizeof(htmlstring), url, "utf-8", options) != C_NULL
    show_warnings()
    return Document(doc_ptr)
end

noxthot · 2023-10-31T12:21:33Z

We just had the same problem using Genie.jl and boiled down the problem root to the new version of XML2_jll.jl v2.11.5. Pinning that package to the previously released version v2.10.4 makes the problem disappear:

pkg> add [email protected]

Note that versions 2.11.0 to 2.11.4 were not provided by XML2_jll.jl, so these can not be immediately tested.

Then:

julia> using EzXML

julia> doc = EzXML.parsehtml("<body><p>hello</p><p>中国</p><p>深圳</p></body>")
EzXML.Document(EzXML.Node(<HTML_DOCUMENT_NODE@0x0000000001afee70>))

julia> primates = root(doc)
EzXML.Node(<ELEMENT_NODE[html]@0x0000000001c9f680>)

julia> for p in eachelement(primates)
           println(nodecontent(p))
       end
hello中国深圳

Of course this is also a problem when using umlauts.

Not sure whether this is already (or should be) in scope by of https://gitlab.gnome.org/GNOME/libxml2/-/issues

hhaensel · 2024-12-31T14:05:01Z

Coming back to this rather old topic.
Wouldn't it be a good idea to expose a keyword argument encoding to parsehtml() and let it default to "utf-8"?
That's Julia's string format and people will expect that it simply works with any kind of string?

function parsehtml(htmlstring::AbstractString; encoding::String = "utf-8")
    if isempty(htmlstring)
        throw(ArgumentError("empty HTML string"))
    end
    url = C_NULL
    options = 1
    doc_ptr = @check ccall(
        (:htmlReadMemory, libxml2),
        Ptr{_Node},
        (Cstring, Cint, Cstring, Cstring, Cint),
        htmlstring, sizeof(htmlstring), url, encoding, options) != C_NULL
    show_warnings()
    return Document(doc_ptr)
end

Interstingly, parsexml() doesn't show this problem, although the underlying code in libxml2 looks quite similar.

julia> parsehtml("äöüϕ", ).root |> nodecontent
"Ã¤Ã¶Ã¼Ï\u95"

julia> parsexml("<xml>äöüϕ</xml>", ).root |> nodecontent
"äöüϕ"

Happy to provide a PR including tests, if desired.

hhaensel · 2024-12-31T14:47:56Z

Found the critical code in libxml2:
https://gitlab.gnome.org/GNOME/libxml2/-/blob/2.13/HTMLparser.c?ref_type=heads#L4761-4769

    xmlDetectEncoding(ctxt);

    /*
     * This is wrong but matches long-standing behavior. In most cases,
     * a document starting with an XML declaration will specify UTF-8.
     */
    if (((ctxt->input->flags & XML_INPUT_HAS_ENCODING) == 0) &&
        (xmlStrncmp(ctxt->input->cur, BAD_CAST "<?xm", 4) == 0))
        xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_UTF8);

And I verified that

julia> parsehtml("<?xml>äöüϕ", ).root |> nodecontent
"äöüϕ"

produces the expected result.
So all in all, I'd recommend to introduce the encoding keyword parameter to both parsexml() and parsehtml(), but probably default it to C_Null in case of parsexml() and to "utf-8" in case of parsehtml()

noxthot mentioned this issue Oct 31, 2023

Bug: Genie.Renderer.Html.html changes encoding when used with filepath GenieFramework/Genie.jl#687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display. #184

The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display. #184

deahhh commented Oct 10, 2023

deahhh commented Oct 10, 2023

Uh oh!

noxthot commented Oct 31, 2023

Uh oh!

hhaensel commented Dec 31, 2024 •

edited

Loading

Uh oh!

hhaensel commented Dec 31, 2024

Uh oh!

The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display. #184

The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display. #184

Comments

deahhh commented Oct 10, 2023

deahhh commented Oct 10, 2023

Uh oh!

noxthot commented Oct 31, 2023

Uh oh!

hhaensel commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hhaensel commented Dec 31, 2024

Uh oh!

hhaensel commented Dec 31, 2024 •

edited

Loading