Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external_author_id is rounded as a floating point #4

Closed
mthomas opened this issue Jul 31, 2018 · 15 comments · Fixed by #28
Closed

external_author_id is rounded as a floating point #4

mthomas opened this issue Jul 31, 2018 · 15 comments · Fixed by #28

Comments

@mthomas
Copy link

mthomas commented Jul 31, 2018

They are in the format of "9.06000000000e+17" which I assume is incorrect and instead should be a "large integer".

@akobyl
Copy link

akobyl commented Jul 31, 2018

This can be seen in the first line in IRAhandle_tweets_1.csv for example

@edsu
Copy link

edsu commented Jul 31, 2018

This only seems to be an issue in IRAhandle_tweets_1.csv.

@driscoll
Copy link

driscoll commented Aug 1, 2018

@edsu I see it sporadically in the other files as well. A grep for scientific notation in the first field returned 461,421 lines across the whole collection.

@gsmith-to
Copy link

Observed that (a) all of these are e+17, no other exponent; (b) none of the 'integer' fields are anywhere close to 18 digits long (c) many of the the e+17 contain only a few nonzero digits and a lot of trailing zeros e.g. 9.06000000000e+17
Though I did see 8.02673000000e+17
It looks like these numbers were originally left-justified and zero padded in an 18-character field and got converted as floats; i.e. 906000000000000000 and 802673000000000000

@giova-p
Copy link

giova-p commented Aug 3, 2018

This is a common problem that often arises when importing/exporting twitter id fields from/to excel. In my experience, when this happens converting the "+17" observations back to integer does not restore the original ids.

@edsu
Copy link

edsu commented Aug 3, 2018

I've also seen this happen when using jq.

@patrick-lee-warren
Copy link

patrick-lee-warren commented Aug 3, 2018

Sorry, everyone. I learned this the hard way. I think we can fix them, at least for the vast majority of the accounts.

@bet4a
Copy link

bet4a commented Aug 10, 2018

@patrick-lee-warren There are 454 authors whose correct external_author_ids can be fairly easily resolved by matching authors’ handles with the data from the November 2017 HPSCI PDF. Not sure if this will help you, but I’ve made a CSV file containing the relevant data for all 454 of these users:

external_author_id_rounding_fixes.zip

The CSV has three columns:

  • author: Same as original dataset (Twitter handle in ALL CAPS)
  • external_author_id: Same as original dataset. All of these are in the rounded scientific notation format (e.g. 8.95000000000e+17)
  • user_id_from_hpsci_nov17: The actual non-rounded user IDs, courtesy of the HPSCI Nov 2017 document (e.g. 895257961387446272). All of these precise user IDs should match up with the rounded external_author_ids. For example, if the external_author_id from the dataset is 8.95e+17, then the user ID in this column should begin with 895, or possibly 894 or 896 due to rounding.

If this data would be more useful in a different format, please let me know.


Unfortunately this method doesn’t work for authors who were newly added to the June 2018 list, because that PDF doesn’t include user IDs.

I actually called the House Intelligence Committee’s Minority Staff office to ask if they had this information and could make it public. According to the person I spoke with, they just don’t have it… Twitter only provided the Committee with IRA-linked account names (and not corresponding user IDs) for the recent 2018 list. In other words, the Committee just publishes whatever info they get from Twitter, so blame Twitter for not including user IDs in their updated list.

I’ve had moderate success finding uncovering some of these new_june_2018 users’ non-rounded IDs through manual search methods… but this is a much slower process. At the moment I’ve only gotten about halfway through this list of users. If it helps, I can post the non-rounded ID matches I’ve been able to find for these authors once I’m done.

@patrick-lee-warren
Copy link

I have figured out a way to get almost all of them, using some data I didn't originally provide but want to provide (the article url). I've sent the update to 538 and they are getting it processed to update. Thanks for all your work on this.

@bet4a
Copy link

bet4a commented Aug 10, 2018

@patrick-lee-warren What’s missing from your list? Literally within the past 10 minutes, I think I stumbled upon a reliable way to retrieve the user ID for any suspended account


Edit: It worked! I’ve resolved full user IDs for every author in the dataset, save one. The only exception is author HENY_AMBERH with external_author_id = 8.89661000000e+17. (There are only two tweets with that author/external_author_id combination; HENY_AMBERH is also associated with a different external_author_id that isn’t affected by the rounding issue.)

@EvanCarroll
Copy link

EvanCarroll commented Aug 25, 2018

@patrick-lee-warren do you have a fork? This is the kind of thing that would be super useful as a pull request if you've resolved and/or corrected for Captain Genius opening it up and saving it in Excel.

@EvanCarroll
Copy link

@bet4a do you have a fork? And if so that would be super useful. Just rename the other one with stupid scientific notation to something sane (like -1) so we can load them all in a signed bigint column.

@EvanCarroll
Copy link

@bet4a I was unable to get this working, even after applying the patch I have 79 distinct author with scientific notation in their external_author_id

@EvanCarroll
Copy link

I've committed a patch and schema for this, I can dump the datafiles if we want an updated copy.

https://github.com/EvanCarroll/russian-troll-tweets/blob/master/PostgreSQL/patch.psql

@patrick-lee-warren
Copy link

I'm new to GitHub, but I hope I did this right. I forked the original repository, and created a new version that fixes many of the problems pointed out, including the rounding problem, as well as adding several new variables. I put in a pull request to ask 538 to update the main branch with it, but ???

It's https://github.com/patrick-lee-warren/russian-troll-tweets/tree/Version_2 if you want to play around with it.

@dmil dmil mentioned this issue Aug 27, 2018
@dmil dmil closed this as completed in #28 Aug 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants