-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
external_author_id is rounded as a floating point #4
Comments
This can be seen in the first line in |
This only seems to be an issue in IRAhandle_tweets_1.csv. |
@edsu I see it sporadically in the other files as well. A grep for scientific notation in the first field returned 461,421 lines across the whole collection. |
Observed that (a) all of these are e+17, no other exponent; (b) none of the 'integer' fields are anywhere close to 18 digits long (c) many of the the e+17 contain only a few nonzero digits and a lot of trailing zeros e.g. 9.06000000000e+17 |
This is a common problem that often arises when importing/exporting twitter id fields from/to excel. In my experience, when this happens converting the "+17" observations back to integer does not restore the original ids. |
I've also seen this happen when using jq. |
Sorry, everyone. I learned this the hard way. I think we can fix them, at least for the vast majority of the accounts. |
@patrick-lee-warren There are 454 authors whose correct external_author_id_rounding_fixes.zip The CSV has three columns:
If this data would be more useful in a different format, please let me know. Unfortunately this method doesn’t work for authors who were newly added to the June 2018 list, because that PDF doesn’t include user IDs. I actually called the House Intelligence Committee’s Minority Staff office to ask if they had this information and could make it public. According to the person I spoke with, they just don’t have it… Twitter only provided the Committee with IRA-linked account names (and not corresponding user IDs) for the recent 2018 list. In other words, the Committee just publishes whatever info they get from Twitter, so blame Twitter for not including user IDs in their updated list. I’ve had moderate success finding uncovering some of these |
I have figured out a way to get almost all of them, using some data I didn't originally provide but want to provide (the article url). I've sent the update to 538 and they are getting it processed to update. Thanks for all your work on this. |
@patrick-lee-warren What’s missing from your list? Literally within the past 10 minutes, I think I stumbled upon a reliable way to retrieve the user ID for any suspended account Edit: It worked! I’ve resolved full user IDs for every author in the dataset, save one. The only exception is author HENY_AMBERH with |
@patrick-lee-warren do you have a fork? This is the kind of thing that would be super useful as a pull request if you've resolved and/or corrected for Captain Genius opening it up and saving it in Excel. |
@bet4a do you have a fork? And if so that would be super useful. Just rename the other one with stupid scientific notation to something sane (like -1) so we can load them all in a signed bigint column. |
@bet4a I was unable to get this working, even after applying the patch I have 79 distinct author with scientific notation in their |
I've committed a patch and schema for this, I can dump the datafiles if we want an updated copy. https://github.com/EvanCarroll/russian-troll-tweets/blob/master/PostgreSQL/patch.psql |
I'm new to GitHub, but I hope I did this right. I forked the original repository, and created a new version that fixes many of the problems pointed out, including the rounding problem, as well as adding several new variables. I put in a pull request to ask 538 to update the main branch with it, but ??? It's https://github.com/patrick-lee-warren/russian-troll-tweets/tree/Version_2 if you want to play around with it. |
They are in the format of "9.06000000000e+17" which I assume is incorrect and instead should be a "large integer".
The text was updated successfully, but these errors were encountered: