Maremma does not encode utf8 string properly

Maremma seems to be suffering from this UTF8 bug https://github.com/lostisland/faraday/issues/139
Basically Excon does not properly encode the string as UTF8.  This causes the string to be parsed as ASCII and then stripped of its special characters in the parse_response method.

Example:
```
url = "https://api.crossref.org/works/10.1038/nature14474/transform/application/vnd.crossref.unixsd+xml"

response = Maremma.get(url, accept: "text/xml;charset=utf-8", raw: true)
```

```
<?xml version="1.0" encoding="UTF-8"?>
<crossref_result xmlns="http://www.crossref.org/qrschema/3.0" version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/qrschema/3.0 http://www.crossref.org/schemas/crossref_query_output3.0.xsd">
  <query_result>
    <head>
      <doi_batch_id>none</doi_batch_id>
    </head>
    <body>
      <query status="resolved">
        <doi type="journal_article">10.1038/nature14474</doi>
        <crm-item name="publisher-name" type="string">Springer Science and Business Media LLC</crm-item>
        <crm-item name="prefix-name" type="string">Springer Science and Business Media LLC</crm-item>
        <crm-item name="member-id" type="number">297</crm-item>
        <crm-item name="citation-id" type="number">75327788</crm-item>
        <crm-item name="journal-id" type="number">3415</crm-item>
        <crm-item name="deposit-timestamp" type="number">20191101103854578</crm-item>
        <crm-item name="owner-prefix" type="string">10.1038</crm-item>
        <crm-item name="last-update" type="date">2019-11-01T11:11:21Z</crm-item>
        <crm-item name="created" type="date">2015-05-12T15:48:08Z</crm-item>
        <crm-item name="citedby-count" type="number">290</crm-item>
        <doi_record>
          <crossref xmlns="http://www.crossref.org/xschema/1.1" xsi:schemaLocation="http://www.crossref.org/xschema/1.1 http://doi.crossref.org/schemas/unixref1.1.xsd">
            <journal>
              <journal_metadata language="en">
                <full_title>Nature</full_title>
                <abbrev_title>Nature</abbrev_title>
                <issn media_type="print">0028-0836</issn>
                <issn media_type="electronic">1476-4687</issn>
              </journal_metadata>
              <journal_issue>
                <publication_date media_type="print">
                  <month>6</month>
                  <year>2015</year>
                </publication_date>
                <journal_volume>
                  <volume>522</volume>
                </journal_volume>
                <issue>7554</issue>
              </journal_issue>
              <journal_article publication_type="full_text">
                <titles>
                  <title>Observation of the rare Bs0 ?????+????? decay from the combined analysis of CMS and LHCb data</title>
                </titles>
...
```

I think there are lots of ways to solve this, but here are two suggestions

## Force the encoding

```
Maremma.class_eval do
  def self.parse_response(string, options = {})
    string = string.dup
    string =
        if options[:skip_encoding]
            string
        else
            string.force_encoding('utf-8').encode(
                Encoding.find("UTF-8"),
                invalid: :replace,
                undef: :replace,
                replace: "?"
            )
        end
    return string if options[:raw]

    from_json(string) || from_xml(string) || from_string(string)
  end
end
```

Note the addtion of `force_encoding('utf-8')`

## faraday-encoding middleware

Another option would be to use the faraday-encoding middleware. That's probably a less blunt solution, but I didn't try implementing it. https://github.com/ma2gedev/faraday-encoding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maremma does not encode utf8 string properly #17

Force the encoding

faraday-encoding middleware

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Maremma does not encode utf8 string properly #17

Description

Force the encoding

faraday-encoding middleware

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions