EATM

Acronym for ~~eat meat~~ Extract Archive Tweet Media. It's a set of Python scripts to extract basic information about your tweets from your archive.

Features

Summarize your tweet database into a CSV file
Download all the tweet media you have tweeted/retweeted/replied

Requirements

Python 3.10
Modules in requirements.txt (version may vary though)
Experience using terminal/console

First steps

Twitter Archive

First of, you must request your Twitter Archive, which will be sent to you once it's finished (time varies depending on the amount of data you have).

Extracting tweets

You will soon notice there are three elements inside this archive file.

    Your Twitter Archive
(1) │   Your archive.html
(2) ├───assets
(3) └───data
        │   account-creation-ip.js
        │   ...
(>)     │   tweets.js
        │   ...
        │   verified.js
        │   
        ├───community_tweet_media
        ├─── - - -
        └───tweets_media

Go inside the data folder, and copy the tweets.js file, since is the only one you'll need to use this script.

Format file

Despite the extension, this file could be easily "converted" to a JSON file. Consider the file to have a similar structure as the following.

window.YTD.tweets.part0 = [
  {
    "tweet" : {
        ...
    }
  },
]

You only have to delete the first part of the line (the one with window.YTD). After doing so, the result should be like the one below.

[
  {
    "tweet" : {
        ...
    }
  },
]

Finally, change the extension from .js to .json, and then it's ready to go!

Commands

Assuming you've already installed Python, configured a venv, and installed the requirements.txt modules, you can run any of the following three scripts.

Tip. Whenever you are unsure how to run the commands, -h option will be your friend!

`get_tweet_csv.py`

This script will collect very basic stuff, like tweet_id, tweet_date, tweet_text and tweet_url (among others).

The complete command is the following:

py get_tweet_csv.py --input_filename <input_filename> --output_filename <output_filename>

All argument default options are the following:

Argument	Default value	Description
`input_filename`	input/tweets.json	The input JSON file containing the Twitter archive.
`output_filename`	output/tweet_info.csv	The output CSV file containing all info of the tweets.

`get_media_csv.py`

This script is similar to the previous one. However, this one is specialized in getting tweet's media ONLY. Contains data like tweet_id, tweet_date, tweet_url, media_id, media_type, media_url, etc.

Its arguments are basically the same as the previous command. So the complete command is the following:

py get_media_csv.py --input_filename <input_filename> --output_filename  <output_filename>

All argument default options are the following:

Argument	Default value	Description
`input_filename`	input/tweets.json	The input JSON file containing the Twitter archive.
`output_filename`	output/media_info.csv	The output CSV file containing the media information.

Note that this script won't download any media.

`download_media.py`

As you may deduce by the name, this script lets you download all type of media from these tweets/retweets/replies.

This script, however, requires an output file from get_media_csv.py. The complete command is the following:

py download_media.py --column_tweet_id <column_tweet_id> --column_tweet_unix_timestamp <column_tweet_unix_timestamp> --column_media_id <column_media_id> --column_media_url <column_media_url> --input_filename <input_filename> --output_directory <output_directory>

All argument default options are the following:

Argument	Default value	Description
`column_tweet_id`	tweet_id	The column name where the Tweet ID is stored.
`column_tweet_unix_timestamp`	tweet_unix_timestamp	The column name where the Tweet UNIX timestamp is stored.
`column_media_id`	media_id	The column name where the media ID is stored.
`column_media_url`	media_url	The column name where the media URL is stored.
`input_filename`	output/media_info.csv	The input CSV file containing the Twitter media URLs.
`output_directory`	output/media/	The ouput folder name which the media will be saved.

Don't panic! If you are, indeed, using the get_media_csv.py script output, the basic usage often will be the following:

py download_media.py --input_filename <input_filename> --output_directory <output_directory>

Pretty neat, right?

Tips

The result will contain tweets ordered by its UNIX timestamp ascendingly. However, you can also order them by its tweet_id column.

If you still insist on the UNIX timestamp, I recommend converting tweet_unix_timestamp by adding another column with the following formula (assuming its column is D2): =D2/(24*60*60) + DATE(1970,1,1).

Notes

If there's any error with the script or this README, let me know by opening an issue, or maybe just throw me a message at my Twitter profile!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
download_media.py		download_media.py
get_media_csv.py		get_media_csv.py
get_tweet_csv.py		get_tweet_csv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EATM

Features

Requirements

First steps

Twitter Archive

Extracting tweets

Format file

Commands

`get_tweet_csv.py`

`get_media_csv.py`

`download_media.py`

Tips

Notes

About

Uh oh!

Uh oh!

Languages

ComplexRalex/extract-archive-tweet-media

Folders and files

Latest commit

History

Repository files navigation

EATM

Features

Requirements

First steps

Twitter Archive

Extracting tweets

Format file

Commands

get_tweet_csv.py

get_media_csv.py

download_media.py

Tips

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

`get_tweet_csv.py`

`get_media_csv.py`

`download_media.py`