Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 2.0, fixing several issues and adding new data #26

Closed
patrick-lee-warren opened this issue Aug 25, 2018 · 10 comments · Fixed by #28
Closed

Version 2.0, fixing several issues and adding new data #26

patrick-lee-warren opened this issue Aug 25, 2018 · 10 comments · Fixed by #28

Comments

@patrick-lee-warren
Copy link

I'm new to GitHub, but I hope I did this right. I forked the original repository, and created a new version that fixes many of the problems pointed out, including the rounding problem, as well as adding several new variables. I put in a pull request to ask 538 to update the main branch with it, but ???

It's https://github.com/patrick-lee-warren/russian-troll-tweets/tree/Version_2 if you want to play around with it.

@EvanCarroll
Copy link

Do you have any experience with a database? I can get you set up with PostgreSQL if you'd like. You should consider giving it a shot. You're version has some improvements, I can help you make it better.

@EvanCarroll
Copy link

BTW, you should not be distributing text as zips or compressed using github it gets in the way of the delta creation.

@patrick-lee-warren
Copy link
Author

No experience with DBs. Tough to see if it's worth the startup cost. Re: .zips, it's odd that github doesn't have a utility to unpack them. From my home connection, it's just a huge pain to send big text files. I'll change them to uncompressed .csvs tomorrow at work.

@EvanCarroll
Copy link

I can do better. I'll just create a chunked CSV and schema for you. If you want to use PostgreSQL you can, if not you can do whatever. If they're all in the database, you would simply dump them to a single file and use split

@patrick-lee-warren
Copy link
Author

I just work with the whole file, but some folks asked it cut in pieces under 100M in a prior issue, so I tried to keep it that way.

@EvanCarroll
Copy link

I've got my version going up now.

https://github.com/EvanCarroll/russian-troll-tweets

This is self-hosted: there are dump files in the base of the project for people to use. They will load them up in PostgreSQL. Those flies can again be dumped back out using dump.sh. I'm clustering by date.

If you download PostgreSQL and you want to load the database just jump in the directory and run the run.psql script. It'll set up the schema, load the data, and configure the indexes.

@EvanCarroll
Copy link

EvanCarroll commented Aug 27, 2018

In this dump there are 28,105 duplicate tweets, such as

593916934229340161
593917011974950912
593918224883941376
593918272791310337
593918312431669248
593918330626510848
593918374264119296
593918387157336065
593918435081461761
593918450155786240
593918485736116224
593918533450473472
593918546473803776
593918573459943424
593918593735229440
593918637175615488
593918685053591552
593918733061595136
593918772496506880
593918803525926913
593918854935490560
593918904856125440
593918923348824064
593918950074937345
593918963010142208
593919059499962370
593925710051344384
593926455676968961
593927023514537984
593927058612428800
593927089008549888
593927102921101312
593927116716187650
593927160391475200
593927241043734529
593927246848630784
593927253693747200

Can you explain that? Or, should I clean it up and delete of each that has a duplicate pair?

@patrick-lee-warren
Copy link
Author

One of the pair should be dropped. Duplicated probably arose due to the 50k/day download limit. When we cut into chunks by date, we may have accidentally overlapped our windows by a minute or two.

@patrick-lee-warren
Copy link
Author

I tried to change the .zips to .csvs, but it won't let me upload files above 25M.

@EvanCarroll
Copy link

@patrick-lee-warren I fixed the zips to CSVs for you just pull from patrick-lee-warren#1

You need a script similar to this to pull out the txt files into 100mb chunks.

https://github.com/EvanCarroll/russian-troll-tweets/blob/version_2/PostgreSQL/dump.sh

@dmil dmil mentioned this issue Aug 27, 2018
@dmil dmil closed this as completed in #28 Aug 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants