Skip to content

Commit

Permalink
Update readme with CLI docs
Browse files Browse the repository at this point in the history
  • Loading branch information
betsybookwyrm committed Feb 22, 2022
1 parent 1ca16bb commit 57ee955
Showing 1 changed file with 114 additions and 8 deletions.
122 changes: 114 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,50 +16,138 @@ database file.

- [Collecting Twitter Data](#collecting-twitter-data)
- [Input and Output](#input-and-output)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Feedback and Contributions](#feedback-and-contributions)
- [About tidy_tweet](#about-tidy_tweet)

## Collecting Twitter data

If you do not have a preferred Twitter collection tool already, we recommend [Twarc](https://github.com/DocNow/twarc/).
If you do not have a preferred Twitter collection tool already, we recommend [Twarc][twarc].
tidy_tweet is designed to work directly with Twarc output. Other collection methods may work with tidy_tweet as long
as they output the API result from Twitter with minimal alteration (see [Input and Output](#input-and-output)), however
at this time we do not have the resources to support Twitter data outputs from tools other than Twarc.

## Input and Output

### Input: Twitter results pages
### Input: Twitter tweet results pages

tidy_tweet takes as input a series of JSON/dict objects, each object of which is a page of Twitter API v2 search or
timeline results. Typically, this will be a JSON file such as those output by `twarc2 search`.
timeline results. Typically, this will be a JSON file such as those output by `twarc2 search`. At present, API endpoints
oriented around things other than tweets, such as the `liking-users` endpoint, are not properly supported, though we
hope to support them in future.

JSON files with multiple pages of results are expected to be newline-delimited, with each line being a distinct results
page object, and no commas between top-level objects.

### Output: Sqlite database of tweets and metadata

After processing your Twitter results pages with tidy_tweet (see [Usage](#usage)), you will have an
[SQLite](https://sqlite.org/index.html) database file at the location you specified.
[SQLite][sqlite] database file at the location you specified.

Database schema will be published here as soon as the initial schema is finalised.

## Prerequisites

- Python 3.8+
- A command line shell/terminal, such as bash, Mac Terminal, Git Bash, Anaconda Prompt, etc

This tool requires Python 3.8 or later, the instructions assume you already have Python installed. If you haven't
installed Python before, you might find [Python for Beginners][python_beginners] helpful - note that tidy_tweet is a
command line application, you don't need to write any Python code to use it (although you can if you want to), you just
need to be able to run Python code!

The instructions assume sufficient familiarity with using a command line to change directories, list files and find
their locations, and execute commands. If you are new to the command line or want a refresher, there are some good
lessons from [Software Carpentry][sc_unix_intro] and the [Programming Historian][ph_bash_intro].

The instructions assume you are working in a suitable Python
[virtual environment][py_venv]. RealPython has a relatively straightforward
[primer on virtual environments][realpy_venv] if you are new to the concept. If you installed Python with
Anaconda/conda, you will want to manage your virtual environments through [Anaconda][anaconda_venv]/[conda][conda_venv]
as well. If you have a virtual environment already set up for using [Twarc][twarc], you can install tidy_tweet in that
same environment.

## Installation

tidy_tweet is a Python package and can be installed with pip.

Short version of installation instructions:
1. Ensure you are using an appropriate Python or Anaconda environment (see [Prerequisites](#prerequisites))

2. Install tidy_tweet and its requirements by running:

```bash
python -m pip install tidy_tweet
```

3. Run the following to check that your environment is ready to run tidy_tweet:

```bash
tidy_tweet --help
```


If you wish to install a specific version of tidy_tweet, for example to replicate past results, you can specify the
desired version when installing with pip, for example to install tidy_tweet version 1.0.1 (which does not currently
exist):

```bash
pip install tidy-tweet
python -m pip install tidy-tweet==1.0.1
```

## Usage

A command-line interface (CLI) is planned for the future, but is not yet implemented.
tidy_tweet may be used either as a [command line application](#command-line-interface) or as
a [Python library](#python-library). The command line interface (CLI) is recommended for general use and is intended to
be more straightforward to use. The Python library interface is designed for use cases such as integrating tidy_tweet
usage into other tools, scripts, and notebooks.

### Command line interface

After [installing tidy_tweet](#installation), you should be able to run `tidy_tweet` as a command line application:

```bash
tidy_tweet --help
```

Running the above will show you a summary of how to use the tidy_tweet command line interface (CLI). The
tidy_tweet CLI expects you to provide specific arguments in a specific order, as follows:

```bash
tidy_tweet DATABASE JSON_FILE
```

**DATABASE**: This is the filename where you want to save the tidied data as a database. As this is an [SQLite][sqlite]
database, it is conventional for the filename to end in ".db". Example: `my_dataset.db`

### Using tidy_tweet as a Python library
**JSON_FILE**: This is the file of tweets you wish to tidy into the database. For more information,
see [Input and Output](#input-and-output) Example: `my_search_results.json`

Example:

```bash
tidy_tweet tree_search_2022-02-22.db tree_search_2022-02-22.json
```

#### Loading multiple JSON files into a database

tidy_tweet can accept more than one JSON file at a time. If you have multiple JSON files, for example resulting
from different search terms or Twitter accounts, you can list them all in a single `tidy_tweet` command:

```bash
tidy_tweet DATABASE JSON_FILE_1 JSON_FILE_2 JSON_FILE_3
```

For example:

```bash
tidy_tweet tree_searches_2022-02-22.db pine_tree_2022-02-22.json eucalypt_2022-02-22.json jacaranda_2022-02-22.json
```

At present, there is no metadata to tell what data came from which file, but we plan to fix this soon!

### Python library

Here is an example using the test data file included with tidy_tweet:

Expand All @@ -86,9 +174,27 @@ Found an issue with tidy_tweet? [Find out how to let us know](contributing.md#fi

Interested in contributing? Find out more in our [contributing.md](contributing.md)

## Acknowledgements

Some of this documentation is copied from [Gab Tidy Data](https://github.com/QUT-Digital-Observatory/gab_tidy_data),
and much of the structure and functionality is also modelled on gab_tidy_data, which was our initial foray into
developing a tool like this.

## About tidy_tweet

Tidy_tweet is created and maintained by the [QUT Digital Observatory](https://www.qut.edu.au/digital-observatory) and
is open-sourced under an MIT license. We welcome contributions and feedback!

A DOI and citation information will be added in future.


[twarc]: https://github.com/DocNow/twarc/
[sqlite]: https://sqlite.org/index.html
[python_beginners]: https://www.python.org/about/gettingstarted/
[sc_unix_intro]: https://swcarpentry.github.io/shell-novice/
[ph_bash_intro]: https://programminghistorian.org/en/lessons/intro-to-bash
[py_venv]: https://docs.python.org/3/tutorial/venv.html
[realpy_venv]: https://realpython.com/python-virtual-environments-a-primer/
[conda_venv]: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
[anaconda_venv]: https://docs.anaconda.com/anaconda/navigator/getting-started/#navigator-managing-environments

0 comments on commit 57ee955

Please sign in to comment.