-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The nodecontent function changes the encoding format of wide characters when they are processed, resulting in a garbled display. #184
Comments
the bug will be fixed roughly by replacing "encoding" with "utf-8" in julia. function parsehtml(htmlstring::AbstractString)
if isempty(htmlstring)
throw(ArgumentError("empty HTML string"))
end
url = C_NULL
encoding = C_NULL
options = 1
doc_ptr = @check ccall(
(:htmlReadMemory, libxml2),
Ptr{_Node},
(Cstring, Cint, Cstring, Cstring, Cint),
htmlstring, sizeof(htmlstring), url, "utf-8", options) != C_NULL
show_warnings()
return Document(doc_ptr)
end |
We just had the same problem using pkg> add [email protected] Note that versions Then: julia> using EzXML
julia> doc = EzXML.parsehtml("<body><p>hello</p><p>中国</p><p>深圳</p></body>")
EzXML.Document(EzXML.Node(<HTML_DOCUMENT_NODE@0x0000000001afee70>))
julia> primates = root(doc)
EzXML.Node(<ELEMENT_NODE[html]@0x0000000001c9f680>)
julia> for p in eachelement(primates)
println(nodecontent(p))
end
hello中国深圳 Of course this is also a problem when using umlauts. Not sure whether this is already (or should be) in scope by of https://gitlab.gnome.org/GNOME/libxml2/-/issues |
Coming back to this rather old topic. function parsehtml(htmlstring::AbstractString; encoding::String = "utf-8")
if isempty(htmlstring)
throw(ArgumentError("empty HTML string"))
end
url = C_NULL
options = 1
doc_ptr = @check ccall(
(:htmlReadMemory, libxml2),
Ptr{_Node},
(Cstring, Cint, Cstring, Cstring, Cint),
htmlstring, sizeof(htmlstring), url, encoding, options) != C_NULL
show_warnings()
return Document(doc_ptr)
end Interstingly, julia> parsehtml("äöüϕ", ).root |> nodecontent
"äöüÏ\u95"
julia> parsexml("<xml>äöüϕ</xml>", ).root |> nodecontent
"äöüϕ" Happy to provide a PR including tests, if desired. |
Found the critical code in libxml2: xmlDetectEncoding(ctxt);
/*
* This is wrong but matches long-standing behavior. In most cases,
* a document starting with an XML declaration will specify UTF-8.
*/
if (((ctxt->input->flags & XML_INPUT_HAS_ENCODING) == 0) &&
(xmlStrncmp(ctxt->input->cur, BAD_CAST "<?xm", 4) == 0))
xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_UTF8); And I verified that julia> parsehtml("<?xml>äöüϕ", ).root |> nodecontent
"äöüϕ" produces the expected result. |
Out put:
helloä¸åæ·±å³
The text was updated successfully, but these errors were encountered: