-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Maremma seems to be suffering from this UTF8 bug lostisland/faraday#139
Basically Excon does not properly encode the string as UTF8. This causes the string to be parsed as ASCII and then stripped of its special characters in the parse_response method.
Example:
url = "https://api.crossref.org/works/10.1038/nature14474/transform/application/vnd.crossref.unixsd+xml"
response = Maremma.get(url, accept: "text/xml;charset=utf-8", raw: true)
<?xml version="1.0" encoding="UTF-8"?>
<crossref_result xmlns="http://www.crossref.org/qrschema/3.0" version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/qrschema/3.0 http://www.crossref.org/schemas/crossref_query_output3.0.xsd">
<query_result>
<head>
<doi_batch_id>none</doi_batch_id>
</head>
<body>
<query status="resolved">
<doi type="journal_article">10.1038/nature14474</doi>
<crm-item name="publisher-name" type="string">Springer Science and Business Media LLC</crm-item>
<crm-item name="prefix-name" type="string">Springer Science and Business Media LLC</crm-item>
<crm-item name="member-id" type="number">297</crm-item>
<crm-item name="citation-id" type="number">75327788</crm-item>
<crm-item name="journal-id" type="number">3415</crm-item>
<crm-item name="deposit-timestamp" type="number">20191101103854578</crm-item>
<crm-item name="owner-prefix" type="string">10.1038</crm-item>
<crm-item name="last-update" type="date">2019-11-01T11:11:21Z</crm-item>
<crm-item name="created" type="date">2015-05-12T15:48:08Z</crm-item>
<crm-item name="citedby-count" type="number">290</crm-item>
<doi_record>
<crossref xmlns="http://www.crossref.org/xschema/1.1" xsi:schemaLocation="http://www.crossref.org/xschema/1.1 http://doi.crossref.org/schemas/unixref1.1.xsd">
<journal>
<journal_metadata language="en">
<full_title>Nature</full_title>
<abbrev_title>Nature</abbrev_title>
<issn media_type="print">0028-0836</issn>
<issn media_type="electronic">1476-4687</issn>
</journal_metadata>
<journal_issue>
<publication_date media_type="print">
<month>6</month>
<year>2015</year>
</publication_date>
<journal_volume>
<volume>522</volume>
</journal_volume>
<issue>7554</issue>
</journal_issue>
<journal_article publication_type="full_text">
<titles>
<title>Observation of the rare Bs0 ?????+????? decay from the combined analysis of CMS and LHCb data</title>
</titles>
...
I think there are lots of ways to solve this, but here are two suggestions
Force the encoding
Maremma.class_eval do
def self.parse_response(string, options = {})
string = string.dup
string =
if options[:skip_encoding]
string
else
string.force_encoding('utf-8').encode(
Encoding.find("UTF-8"),
invalid: :replace,
undef: :replace,
replace: "?"
)
end
return string if options[:raw]
from_json(string) || from_xml(string) || from_string(string)
end
end
Note the addtion of force_encoding('utf-8')
faraday-encoding middleware
Another option would be to use the faraday-encoding middleware. That's probably a less blunt solution, but I didn't try implementing it. https://github.com/ma2gedev/faraday-encoding
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels