-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 2.0, fixing several issues and adding new data #26
Comments
Do you have any experience with a database? I can get you set up with PostgreSQL if you'd like. You should consider giving it a shot. You're version has some improvements, I can help you make it better. |
BTW, you should not be distributing text as zips or compressed using github it gets in the way of the delta creation. |
No experience with DBs. Tough to see if it's worth the startup cost. Re: .zips, it's odd that github doesn't have a utility to unpack them. From my home connection, it's just a huge pain to send big text files. I'll change them to uncompressed .csvs tomorrow at work. |
I can do better. I'll just create a chunked CSV and schema for you. If you want to use PostgreSQL you can, if not you can do whatever. If they're all in the database, you would simply dump them to a single file and use |
I just work with the whole file, but some folks asked it cut in pieces under 100M in a prior issue, so I tried to keep it that way. |
I've got my version going up now. https://github.com/EvanCarroll/russian-troll-tweets This is self-hosted: there are dump files in the base of the project for people to use. They will load them up in PostgreSQL. Those flies can again be dumped back out using If you download PostgreSQL and you want to load the database just jump in the directory and run the |
In this dump there are 28,105 duplicate tweets, such as
Can you explain that? Or, should I clean it up and delete of each that has a duplicate pair? |
One of the pair should be dropped. Duplicated probably arose due to the 50k/day download limit. When we cut into chunks by date, we may have accidentally overlapped our windows by a minute or two. |
I tried to change the .zips to .csvs, but it won't let me upload files above 25M. |
@patrick-lee-warren I fixed the zips to CSVs for you just pull from patrick-lee-warren#1 You need a script similar to this to pull out the txt files into 100mb chunks. https://github.com/EvanCarroll/russian-troll-tweets/blob/version_2/PostgreSQL/dump.sh |
I'm new to GitHub, but I hope I did this right. I forked the original repository, and created a new version that fixes many of the problems pointed out, including the rounding problem, as well as adding several new variables. I put in a pull request to ask 538 to update the main branch with it, but ???
It's https://github.com/patrick-lee-warren/russian-troll-tweets/tree/Version_2 if you want to play around with it.
The text was updated successfully, but these errors were encountered: