Data is double-encoded #5

mjpieters · 2018-07-31T19:40:44Z

The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by \xhh escape sequences to help readability):

#StandForOurAnthem\xc3\xb0\xc2\x9f\xc2\x87\xc2\xba\xc3\xb0\xc2\x9f\xc2\x87\xc2\xb8

Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:

#StandForOurAnthem\xf0\x9f\x87\xba\xf0\x9f\x87\xb8

which in turn can be decoded as UTF-8 to the text #StandForOurAnthem🇺🇸.

This double-encoding makes the files needlessly bigger and harder to work with.

The text was updated successfully, but these errors were encountered:

mjpieters · 2018-07-31T19:51:32Z

Workaround is to use iconv:

for file in IRAhandle_tweets_*.csv; do
  echo -n "Converting $file... "
  iconv -f utf8 -t latin1 $file > $file.corrected &&
  mv -f $file.corrected $file
  echo "Done"
done

This decodes once then writes out the result as Latin-1 (mapping Unicode codepoints to bytes one-on-one). This gives us single-encoded UTF-8 data again.

This shaves of 10% of the total bytecount, dropping from 731MB to 656MB.

dmil · 2018-07-31T22:14:09Z

Thank you for your suggestion @mjpieters. We have updated the data to remove the double encoding using the script you suggested.

EvanCarroll · 2018-08-27T07:07:36Z

Cool work, seems there is more to do though (if we can recover this) #20

dmil closed this as completed in d944a52 Jul 31, 2018

dmil mentioned this issue Jul 31, 2018

character encoding for content column #6

Closed

dmil mentioned this issue Aug 27, 2018

Data update #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data is double-encoded #5

Data is double-encoded #5

mjpieters commented Jul 31, 2018 •

edited

Loading

mjpieters commented Jul 31, 2018 •

edited

Loading

dmil commented Jul 31, 2018

EvanCarroll commented Aug 27, 2018

Data is double-encoded #5

Data is double-encoded #5

Comments

mjpieters commented Jul 31, 2018 • edited Loading

mjpieters commented Jul 31, 2018 • edited Loading

dmil commented Jul 31, 2018

EvanCarroll commented Aug 27, 2018

mjpieters commented Jul 31, 2018 •

edited

Loading

mjpieters commented Jul 31, 2018 •

edited

Loading