Skip to content

Commit

Permalink
Merge pull request #100 from CCBR/pdq_db
Browse files Browse the repository at this point in the history
adding pdq db creation and data ingestion features
  • Loading branch information
kopardev authored May 12, 2024
2 parents fdb0d0a + db186c7 commit ab322f0
Show file tree
Hide file tree
Showing 13 changed files with 544 additions and 2 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,16 @@

### Bug fixes

## spacesavers2 v0.13.0

### New features

- adding new commands `spacesavers2_pdq_create_db` and `spacesavers2_pdq_update_db`
- output TSV files from `spacesavers2_pdq` can be saved into a sqlite3 db with these commands
- future integration with Grafana will now be possible

### Bug fixes

## spacesavers2 v0.12.1

### New features
Expand Down
1 change: 1 addition & 0 deletions bin/spacesavers2_pdq_create_db
1 change: 1 addition & 0 deletions bin/spacesavers2_pdq_update_db
Binary file added docs/assets/images/pdq_db_schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 6 additions & 1 deletion docs/pdq.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,9 @@ Here is an example output:
}
}
}
```
```
`spacesavers2_pdq` creates TSV (or JSON) file per-datamount per-run (typically per-date). If run daily, this soon creates a lot of files to keep track of. Hence, it is best to save the data in a sqlite db using:
- [`spacesavers2_pdq_create_db`](pdq_create_db.md) and
- [`spacesavers2_pdq_update_db`](pdq_update_db.md)
70 changes: 70 additions & 0 deletions docs/pdq_create_db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
## spacesavers2_pdq_create_db

pdq = Pretty Darn Quick

[`spacesavers2_pdq`](pdq.md) creates TSV (or JSON) file per-datamount per-run (typically per-date). If run daily, this soon creates a lot of files to keep track of. Hence, it is best to save the data in a sqlite db. This command create the basic schema for that db. The schema looks like this:

![pdq schema](assets/images/pdq_db_schema.png)

### Inputs
- `--filepath`: where to create the ".db" file.
- `--overwrite`: toggle to overwrite if the ".db" file already exists.

```bash
usage: spacesavers2_pdq_create_db [-h] -f FILEPATH [-o | --overwrite | --no-overwrite] [-v]

spacesavers2_pdq_create_db: create a sqlitedb file with the optimized schema.

options:
-h, --help show this help message and exit
-f FILEPATH, --filepath FILEPATH
spacesavers2_pdq_create_db will create this sqlitedb file
-o, --overwrite, --no-overwrite
overwrite output file if it already exists. Use this with caution as it will delete existing file and its contents!!
-v, --version show program's version number and exit
Version:
v0.13.0-dev
Example:
> spacesavers2_pdq_create_db -f /path/to/sqlitedbfile
```
### Output
## db file
sqlite ".db" file with 4 tables
```bash
% sqlite3 pdq.db
SQLite version 3.26.0 2018-12-01 12:34:55
Enter ".help" for usage hints.
sqlite> .table
datamounts datapoints dates users
sqlite> .schema
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
username TEXT NOT NULL,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL
);
CREATE TABLE dates (
date_int INTEGER PRIMARY KEY,
date_text TEXT UNIQUE NOT NULL
);
CREATE TABLE datamounts (
datamount_id INTEGER PRIMARY KEY,
datamount_name TEXT UNIQUE NOT NULL
);
CREATE TABLE datapoints (
datapoint_id INTEGER PRIMARY KEY,
date_int INTEGER,
datamount_id INTEGER,
user_id INTEGER,
ninodes INTEGER,
nbytes INTEGER,
FOREIGN KEY (user_id) REFERENCES users(user_id),
FOREIGN KEY (datamount_id) REFERENCES datamounts(datamount_id),
FOREIGN KEY (date_int) REFERENCES dates(date_int)
);
```
47 changes: 47 additions & 0 deletions docs/pdq_update_db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
## spacesavers2_pdq_update_db

pdq = Pretty Darn Quick

[`spacesavers2_pdq`](pdq.md) creates TSV (or JSON) file per-datamount per-run (typically per-date). If run daily, this soon creates a lot of files to keep track of. Hence, it is best to save the data in a sqlite db. [`spacesavers2_pdq_create_db`](pdq_create_db.md) command creates the basic schema for that db. Then this command can be used to populate the database.

![pdq schema](assets/images/pdq_db_schema.png)

### Inputs
- `--tsv`: `.tsv` or `.tsv.gz` created using `spacesavers2_pdq`
- `--database`: `.db` file created using `spacesavers2_pdq_create_db`
- `--datamount`: eg. `CCBR` or `CCBR_Pipeliner`
- `--date`: integer date in YYYYMMDD format

```bash
usage: spacesavers2_pdq_update_db [-h] -t TSV -o DATABASE -m DATAMOUNT -d DATE [-v]

spacesavers2_pdq_create_db: update/append date from TSV to DB file

options:
-h, --help show this help message and exit
-t TSV, --tsv TSV spacesavers2_pdq output TSV file
-o DATABASE, --database DATABASE
database file path (use spacesavers2_pdb_create_db to create if it does not exists.)
-m DATAMOUNT, --datamount DATAMOUNT
name of the datamount eg. CCBR or CCBR_Pipeliner
-d DATE, --date DATE date in YYYYMMDD integer format
-v, --version show program's version number and exit
Version:
v0.13.0-dev
Example:
> spacesavers2_pdq_update_db -t /path/to/tsv -o /path/to/db -m datamount_name -d date
```
### Output
## updated db file
sqlite ".db" file with 4 tables is updated.
> NOTE:
>
> - new users are automatically added to "users" table
> - new datemounts are automatically added to "datamounts" table
> - new dates are automatically added to "dates" table
> - if >0 datapoints exist in the ".db" for a (date + datamount) combination then warning is displayed and no data is appended
1 change: 1 addition & 0 deletions extras/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Location to store extra scripts!
40 changes: 40 additions & 0 deletions extras/create_and_append_db.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash
# This script:
# 1. creates a sqlite3 database using `spacesavers2_pdq_create_db`
# 2. updates the database for "CCBR" mount related datapoints
# 3. updates the database for "CCBR_Pipeliner" mount related datapoints

module load ccbrpipeliner/6
BIN="/data/CCBR_Pipeliner/Tools/spacesavers2/pdq_db/bin"
DB="/data/CCBR_Pipeliner/userdata/spacesavers2_pdq/pdq.db"

if [[ "1" == "0" ]];then
# Step 1.
${BIN}/spacesavers2_pdq_create_db -f $DB
fi

# Step 2.
for f in `ls /data/CCBR_Pipeliner/userdata/spacesavers2_pdq/_data_CCBR.*.tsv*`
do
bn=$(basename $f)
echo $bn
dt=$(echo $bn|awk -F"." '{print $2}')
dm="CCBR"
${BIN}/spacesavers2_pdq_update_db \
--tsv $f \
--database $DB \
--datamount $dm --date $dt
done

# Step 3.
for f in `ls /data/CCBR_Pipeliner/userdata/spacesavers2_pdq/_data_CCBR_Pipeliner.*.tsv*`
do
bn=$(basename $f)
echo $bn
dt=$(echo $bn|awk -F"." '{print $2}')
dm="CCBR_Pipeliner"
${BIN}/spacesavers2_pdq_update_db \
--tsv $f \
--database $DB \
--datamount $dm --date $dt
done
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,6 @@ nav:
- grubbers: grubbers.md
- usurp: usurp.md
- e2e: e2e.md
- pdq: pdq.md
- pdq_create_db: pdq_create_db.md
- pdq_update_db: pdq_update_db.md
109 changes: 109 additions & 0 deletions spacesavers2_pdq_create_db
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env python3
# pqd = pretty darn quick

from src.VersionCheck import version_check
from src.VersionCheck import __version__
from src.utils import *

version_check()

# import required modules
import sqlite3
import textwrap
import argparse
from pathlib import Path

def main():
elog = textwrap.dedent(
"""\
Version:
{}
Example:
> spacesavers2_pdq_create_db -f /path/to/sqlitedbfile
""".format(
__version__
)
)
parser = argparse.ArgumentParser(
description="spacesavers2_pdq_create_db: create a sqlitedb file with the optimized schema.",
epilog=elog,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"-f",
"--filepath",
dest="filepath",
required=True,
type=str,
help="spacesavers2_pdq_create_db will create this sqlitedb file",
)
parser.add_argument(
"-o",
"--overwrite",
dest="overwrite",
required=False,
action=argparse.BooleanOptionalAction,
help="overwrite output file if it already exists. Use this with caution as it will delete existing file and its contents!!",
)
parser.add_argument("-v", "--version", action="version", version=__version__)

global args
args = parser.parse_args()

filepath = args.filepath
p = Path(filepath).absolute()
pp = p.parents[0]
if not os.access(pp, os.W_OK):
exit("ERROR: {} folder exists but cannot be written to".format(pp))
if os.path.exists(p):
if not args.overwrite:
exit("ERROR: {} file exists and overwrite argument is not selected!".format(p))
if not os.access(p, os.W_OK):
exit("ERROR: {} file exists but is not writeable/appendable".format(p))
if args.overwrite and os.access(p, os.W_OK):
os.remove(p)

# Connect to the SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect(p)
cursor = conn.cursor()

# Create the "users" table
cursor.execute('''CREATE TABLE IF NOT EXISTS users (
user_id INTEGER PRIMARY KEY,
username TEXT NOT NULL,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL
)''')

# Create the "dates" table
cursor.execute('''CREATE TABLE IF NOT EXISTS dates (
date_int INTEGER PRIMARY KEY,
date_text TEXT UNIQUE NOT NULL
)''')

# Create datamounts table
cursor.execute('''CREATE TABLE IF NOT EXISTS datamounts (
datamount_id INTEGER PRIMARY KEY,
datamount_name TEXT UNIQUE NOT NULL
)''')


# Create the "orders" table with a foreign key constraint
cursor.execute('''CREATE TABLE IF NOT EXISTS datapoints (
datapoint_id INTEGER PRIMARY KEY,
date_int INTEGER,
datamount_id INTEGER,
user_id INTEGER,
ninodes INTEGER,
nbytes INTEGER,
FOREIGN KEY (user_id) REFERENCES users(user_id),
FOREIGN KEY (datamount_id) REFERENCES datamounts(datamount_id),
FOREIGN KEY (date_int) REFERENCES dates(date_int)
)''')

# Commit changes and close the connection
conn.commit()
conn.close()

if __name__ == "__main__":
main()
Loading

0 comments on commit ab322f0

Please sign in to comment.