Skip to content

Maremma does not encode utf8 string properly #17

@orangewolf

Description

@orangewolf

Maremma seems to be suffering from this UTF8 bug lostisland/faraday#139
Basically Excon does not properly encode the string as UTF8. This causes the string to be parsed as ASCII and then stripped of its special characters in the parse_response method.

Example:

url = "https://api.crossref.org/works/10.1038/nature14474/transform/application/vnd.crossref.unixsd+xml"

response = Maremma.get(url, accept: "text/xml;charset=utf-8", raw: true)
<?xml version="1.0" encoding="UTF-8"?>
<crossref_result xmlns="http://www.crossref.org/qrschema/3.0" version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/qrschema/3.0 http://www.crossref.org/schemas/crossref_query_output3.0.xsd">
  <query_result>
    <head>
      <doi_batch_id>none</doi_batch_id>
    </head>
    <body>
      <query status="resolved">
        <doi type="journal_article">10.1038/nature14474</doi>
        <crm-item name="publisher-name" type="string">Springer Science and Business Media LLC</crm-item>
        <crm-item name="prefix-name" type="string">Springer Science and Business Media LLC</crm-item>
        <crm-item name="member-id" type="number">297</crm-item>
        <crm-item name="citation-id" type="number">75327788</crm-item>
        <crm-item name="journal-id" type="number">3415</crm-item>
        <crm-item name="deposit-timestamp" type="number">20191101103854578</crm-item>
        <crm-item name="owner-prefix" type="string">10.1038</crm-item>
        <crm-item name="last-update" type="date">2019-11-01T11:11:21Z</crm-item>
        <crm-item name="created" type="date">2015-05-12T15:48:08Z</crm-item>
        <crm-item name="citedby-count" type="number">290</crm-item>
        <doi_record>
          <crossref xmlns="http://www.crossref.org/xschema/1.1" xsi:schemaLocation="http://www.crossref.org/xschema/1.1 http://doi.crossref.org/schemas/unixref1.1.xsd">
            <journal>
              <journal_metadata language="en">
                <full_title>Nature</full_title>
                <abbrev_title>Nature</abbrev_title>
                <issn media_type="print">0028-0836</issn>
                <issn media_type="electronic">1476-4687</issn>
              </journal_metadata>
              <journal_issue>
                <publication_date media_type="print">
                  <month>6</month>
                  <year>2015</year>
                </publication_date>
                <journal_volume>
                  <volume>522</volume>
                </journal_volume>
                <issue>7554</issue>
              </journal_issue>
              <journal_article publication_type="full_text">
                <titles>
                  <title>Observation of the rare Bs0 ?????+????? decay from the combined analysis of CMS and LHCb data</title>
                </titles>
...

I think there are lots of ways to solve this, but here are two suggestions

Force the encoding

Maremma.class_eval do
  def self.parse_response(string, options = {})
    string = string.dup
    string =
        if options[:skip_encoding]
            string
        else
            string.force_encoding('utf-8').encode(
                Encoding.find("UTF-8"),
                invalid: :replace,
                undef: :replace,
                replace: "?"
            )
        end
    return string if options[:raw]

    from_json(string) || from_xml(string) || from_string(string)
  end
end

Note the addtion of force_encoding('utf-8')

faraday-encoding middleware

Another option would be to use the faraday-encoding middleware. That's probably a less blunt solution, but I didn't try implementing it. https://github.com/ma2gedev/faraday-encoding

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions