-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data is double-encoded #5
Comments
Workaround is to use for file in IRAhandle_tweets_*.csv; do
echo -n "Converting $file... "
iconv -f utf8 -t latin1 $file > $file.corrected &&
mv -f $file.corrected $file
echo "Done"
done This decodes once then writes out the result as Latin-1 (mapping Unicode codepoints to bytes one-on-one). This gives us single-encoded UTF-8 data again. This shaves of 10% of the total bytecount, dropping from 731MB to 656MB. |
Thank you for your suggestion @mjpieters. We have updated the data to remove the double encoding using the script you suggested. |
Cool work, seems there is more to do though (if we can recover this) #20 |
The data is double-encoded to UTF8. For example, line 5 of the first file contains, in part (non-ASCII bytes represented by
\xhh
escape sequences to help readability):Those bytes are each UTF-8 sequences for UTF-8 bytes, decoding those bytes gives us:
which in turn can be decoded as UTF-8 to the text
#StandForOurAnthem🇺🇸
.This double-encoding makes the files needlessly bigger and harder to work with.
The text was updated successfully, but these errors were encountered: