-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Irregularities matching authors in the dataset with the 2017/2018 Congressional lists #16
Comments
This is super helpful! Thanks for all there care you've taken, here. A major reason we wanted to post all these data is because we knew more eyes would find errors we missed. Nice job. A couple responses.
1.1) The first is the rounding problem you mentioned, above. Sometime in the sequence from Twitter->Social Studio->CSV->STATA-> CSV, some very large integers got converted into scientific notation and truncated. If this happened for some tweets and not for others, you could get two account_ids. I need to go back and see if I can reconstruct some of these and will as soon as I can. 1.2) Some accounts came out of social studio with multiple external id's under the same in handle, either consecutively (like 4MYSQUAD) or interlaced (like MeggieONeil). For some of those, the follower counts, update counts, and behaviors indicated that these were, in fact, the same account, and we kept both in. For some, there were dramatic changes in stats and/or behavior, and we presumed that we simply had two accounts that shared the same handle, and we tried to include the one with the account number indicated in the Congressional release. When in doubt when included both to allow the users to decide, but care should be taken with these.
2.1 As you suspected, we simply overlooked BABCHENKOVA_EVA and KARUCZ_00 when cleaning our data. Our method of gathering including a keyword search using the handle, and those were accidentally swept up. They should be discarded. 2.2 JennaTraveller is a super-interesting case that we went back and forth about including. Jenn_Abrams is one of the best known of the trolls. We believe that JennaTraveller was the first handle of the Jenn_Abrams account. It shares the external_author_id, and the updates and follower counts transition smoothly as the handle changes. In a few cases (which we don't really understand), we were able to trace accounts back to old handles in this way. Tourettesn is the same situation, the opening handle for PigeonToday (Interestingly... it also used the handle Politweecs, which shows up independently on the list, but with a different external_author_id). This is a yet another way you could get two external_author_id for the same handle. We had actually meant to strip these early aliases out before posting the data, but now that it's out there, I have a few more I can post early next week. 2.3 Taraforma is a mixture of 1.2 and 2.2. The account "Taraformation" has three external_author_ids. There were two very early ones that we judged to be sufficiently different from the indicated account (external_author_id=1534083420) that we did not include them in our data. One of those accounts had an alias, Taraforma, that we failed to remove when we removed that version of Taraformation. It is not a troll and should be removed. |
Thanks so much for the detailed response! It definitely clears up the confusion I had about these unusual cases. In case you find it useful, I should also mention that I’ve added some additional columns to my author summary Google Sheet. For each author, it now shows:
It’s not really anything revolutionary in and of itself. But it may be useful as a jumping-off point for further data exploration (e.g., what are the general characteristics of the most active accounts?, etc.) |
I can confirm that JennaTraveller was Jenn_Abrams original handle. When her material was still up, you could see cases where users had replied to her and the original handle was shown. Got a screenshot somewhere, I think. Was never as certain on the Politweecs and PigeonToday thing. There was massive overlap in the handles, but I couldn't tell if it was because one was constantly retweeting the other and then getting replies. At some point, Politweecs definitely became its own thing, and was operational a good bit longer than PigeonToday. |
In a related discrepency, there are a handful of authors whose The following authors have
And these two authors have
|
BABCHENKOVA_EVA, KARUCZ_00 and TARAFORMA should be discarded according to fivethirtyeight/russian-troll-tweets#16 (comment)
Another related discrepency—there are 5 authors that appear in the Nov 2017 House Intel list PDF, but whose Phew, that was a mouthful. Sorry if that didn’t make sense. Hopefully the data itself is a bit more self-explanatory:
Digging a bit further, for two of these cases, it seems the
|
- removes accounts that were accidentally included #16 - adds alt_external_id, tweet_id, and article_url - adds fields that follow http(s)://t.co/ links to their first redirect if they exist in a tweet - fixes some issues about how ids are displayed - fixes double encoding issue - drops some duplicate observations
I’ve made a public Google Sheet that attempts to match up account information from this dataset with the November 2017 and June 2018 lists published by the House Intelligence Committee. Perhaps others will find it helpful for a number of reasons…
author
has multiple tweets. But some info (should) remain constant for a particular author across all their tweets:external_author_id
,account_type
,account_category
and thenew_june_2018
flag. This spreadsheet provides a summary of all theauthor
s and their associated properties.external_author_id
. This can be used to resolve some of the floating pointexternal_author_id
s raised in issue external_author_id is rounded as a floating point #4. (It can’t be used for accounts that were added in the 2018 list because that PDF doesn’t list user_ids.)author
; the lists from Congress retain account names’ original capitalization. It’s often trivial, but the distinction can be semantically meaningful. For example, we can see that the CURTISBIGMAN account from the dataset actually had a Twitter handle of “CurtisBigMan”, not CurtisBigman or CurtIsBigMan.However, there are two problems that I ran across:
There are 17
author
s in the dataset who each have two differentexternal_author_id
s. Example: in IRAhandle_tweets_1.csv, there are some rows withauthor
4MYSQUAD andexternal_author_id
4036537452. Other rows have the sameauthor
4MYSQUAD, but a differentexternal_author_id
3312143142. (FWIW, the Nov 2017 PDF shows 4MySquad having a user_id of 4036537452.)For some authors it appears this seems to be tied to floating point
external_author_id
s being rounded—such as KRISTINADRUCKER, who has some tweets withexternal_author_id
7.15893000000e+17 and others 7.16000000000e+17. But for other authors, such as the 4MYSQUAD example above, the discrepency doesn’t appear to be a rounding issue.Is there a legit reason these accounts are associated with two different
external_author_id
s, or is this a mistake?There are 5
author
s in the dataset that are not listed in either of the PDFs from Congress:It’s not clear why tweets from these five accounts are included in the dataset.
I’ve attached a screenshot below which shows all tweet authors affected by either issue. For a copy-pastable version, go to the Google Sheet, then click the Data menu → Filter views… → author problems.
The text was updated successfully, but these errors were encountered: