-
Notifications
You must be signed in to change notification settings - Fork 81
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into add-coveralls-badge
- Loading branch information
Showing
17 changed files
with
608 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ sudo: false | |
|
||
python: | ||
- '2.7' | ||
- '3.4' | ||
- '3.5' | ||
- '3.6' | ||
|
||
|
@@ -13,4 +14,3 @@ install: | |
|
||
script: | ||
- flake8 . | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,203 @@ | ||
# Pshtt as an HTTPS status checker | ||
|
||
Welcome! This is the documentation on how to run pshtt to scan sites for their | ||
HTTPS status. These instructions are mostly about how to run it at scale, but at | ||
the end, there are instructions on how to run on a local instance. | ||
|
||
This document goes over how to both run pshtt on multiple instances on google | ||
cloud engine and also how to run it as a singular instance on your local | ||
machine. It takes about 30 minutes to set up from start to finish. | ||
|
||
Running pshtt on 150 instances takes about 12 - 15 hours for a million sites. | ||
Assume at worst that each site will take 10 seconds (which is the default | ||
timeout) and scale up to whatever timeframe you want to run in based off of | ||
that. | ||
|
||
Example: 1000 sites in 2 hours would take 2 instances. | ||
|
||
# How to run Pshtt on Google Cloud Engine | ||
|
||
## Before you run | ||
|
||
1. Set up a [google compute engine | ||
account](https://cloud.google.com/compute/docs/access/user-accounts/). | ||
|
||
2. Make sure you have the correct quota allowances. | ||
|
||
* Go to the [quotas page](https://cloud.google.com/compute/quotas) | ||
and select the project that you want to run this under. | ||
* Request quotas --- click on the following items in the list and click | ||
"edit qutoas" at the top of the page: | ||
* CPUS (all regions) --> 150 | ||
* In use IP addresses --> 150 | ||
* One Region's in use IPs (ex us-west1) --> 150 | ||
* Same Region's CPUs (ex. us-west1) --> 150 | ||
|
||
3. Create Instance Group Template. | ||
|
||
You will want to run multiple instances (presumably), and creating an | ||
Instance Group template allows you to make up to 150 machines under the same | ||
template. | ||
|
||
* Go to Compute Engine, then click on the Instance templates | ||
tab and click "Create Instance Template". | ||
* Name --> "pshtt-template" | ||
* Machine type -- 1 CPU (n1-standard-1 (1 vCPU, 3.75 GB memory)). | ||
* Check allow HTTP and HTTPS traffic. | ||
* Boot Disk --- Ubuntu 14.04 LTS. | ||
* automatic restart (under management tab) -- off. | ||
* Hit create. | ||
|
||
# How to run Pshtt on Google Cloud Engine | ||
|
||
1. Create a ssh key ONLY for the google cloud instances and upload to your | ||
profile. | ||
|
||
This is a security measure. ***DO NOT USE YOUR REGULAR SSH KEY.*** | ||
|
||
* `cd ~/.ssh && ssh-keygen -t rsa -f gce_pshtt_key` | ||
* Go to the [metadata | ||
tab](https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys) and hit edit. | ||
* `cd ~/.ssh && cat gce_pshtt_key.pub` | ||
* Copy the output of the above command and paste it into the console. | ||
|
||
2. Create the instance group. | ||
|
||
It is important to name your instance group something identifiable, | ||
especially if you are sharing a project with others. Remember this instance | ||
group name for a later step. ***We recommend that you try 1 instance at | ||
first to make sure it works***. | ||
|
||
* Go to the instance group tab. | ||
* Click Multi-Zone, and select the region that you requested your | ||
instances for. | ||
* Chose "pshtt-template" under instance template. | ||
|
||
* Hit create. | ||
|
||
* Welcome to your new instance group! | ||
|
||
## Updating Data Files and Setting up to Run | ||
|
||
The following is a set of commands to run to make your running directory. | ||
|
||
1. Download the gcloud command line tool. | ||
|
||
* follow the [download | ||
link](https://cloud.google.com/sdk/docs/#install_the_latest_cloud_tools_version_cloudsdk_current_version) | ||
and install the correct sdk for your OS. | ||
* If this is your first time installing the gcloud command line tool, | ||
follow the instructions on the page. Do not set any default zones. | ||
* If you already have this installed, following the following | ||
instructions: | ||
* `gcloud init` | ||
* Click `2` create a new configuration. | ||
* Enter `pshtt-configuration` | ||
* Choose the appropriate account | ||
* Click the appopriate number corresponding to your google project | ||
* If it complains that the API is not enabled, hit enabled and retry. | ||
* Do not set default zone or region | ||
* at this point, your default project should be this google project. | ||
You can switch to any of your previous projects by running `gcloud | ||
config set project PROJECTNAME` | ||
|
||
2. Setting up your directory. | ||
|
||
* `mkdir ~/pshtt_run` | ||
* Creates the dir that you will run your program out of. | ||
* `gcloud compute instances list | sed -n '1!p' | grep | ||
"<instance-group-name>" | awk '{print $5}' > ~/pshtt_run/hosts.txt` | ||
* `<instance-group-name>` is what you named the instance group you created | ||
above. | ||
|
||
3. Copy all .sh scripts from this directory: | ||
|
||
* Keep the name of the scripts the same. | ||
* `chmod +x ~/pshtt_run/*.sh` | ||
* which will make all the scripts executable. | ||
* `touch domains.csv` | ||
* Your domain list, one domain per line, with the input list ending in | ||
`.csv`. | ||
* domains must have the schema stripped of them and no trailing '/', | ||
such as: | ||
* domain.tld | ||
* subdomain.domain.tld | ||
* www.subdomain.domain.tld | ||
* `mkdir ~/pshtt_run/data_results/` | ||
* `mv ~/pshtt_run/combine_shards.py ~/pshtt_run/data_results` | ||
* Places combine_shards.py into data_results/. | ||
* `mkdir ~/pshtt_run/input_files/` | ||
|
||
4. roots.pem | ||
|
||
We want to use our own CA file when running pshtt. We use the mozilla root | ||
store for this purpose. Follow instructions on this | ||
[PR](https://github.com/agl/extract-nss-root-certs). | ||
|
||
5. Updating ssh key | ||
|
||
* If your new ssh key is called "gce_pshtt_key", skip this step. | ||
* If you did not name your ***new*** ssh key gce_pshtt_key, then you will | ||
need to go through and rename the gce_pshtt_key in all the .sh files to | ||
whatever you named your key. | ||
* in vim, this is ":%s/gce_pshtt_key/yourkeynamehere/g <enter>". | ||
|
||
## How to run | ||
|
||
1. `screen -S pshtt_running` | ||
2. `cd ~/pshtt_run/` | ||
3. `./run_all_scripts <input_file_name> <number_of_shards> <shard_name> > | ||
log.out` | ||
* number of shards == number of hosts | ||
* each machine will contain a shard of the data to run. | ||
* This is the script that sets up all machines and puts all datafiles on | ||
the machines for running. | ||
* `./run_all_scripts top-1m.nocommas.8.31.2017 100 alexa` | ||
* will produce 100 shards all starting with "alexa" in the input_files | ||
dir. | ||
* ex. alexa000.csv | ||
* NOTE: you can ONLY create 999 shards. If you need more than 999 shards, | ||
you will need to change the split_up_dataset.sh file. | ||
4. exit screen `cntr+a+d` | ||
|
||
## During the run | ||
|
||
* `./check_instances.sh` | ||
* will print the ip of each host, as well as FINISHED or NOT FINISHED. | ||
|
||
## After the run | ||
|
||
* `./grab_and_combine_data.sh` | ||
|
||
* will grab all log and result data files, combine data files into one | ||
large result file, and put these into data_results/. | ||
|
||
* Delete your instance group. If you want to run data analysis, jump down to | ||
the data analysis portion. | ||
|
||
# Running Pshtt on your local machine | ||
|
||
1. Copy packages_to_install.sh and install the packages_to_install.sh. | ||
* `sudo ./packages_to_install.sh` | ||
2. Clone pshtt. | ||
* `git clone https://github.com/dhs-ncats/pshtt.git` | ||
3. Put roots.pem, running_script.sh, and your input file in the same dir as | ||
pshtt. | ||
* Follow directions under Updating data files above on how to get a | ||
roots.pem. | ||
* domains must have the schema stripped of them and no trailing '/', such | ||
as: | ||
* domain.tld | ||
* subdomain.domain.tld | ||
* www.subdomain.domain.tld | ||
* `chmod +x running_script.sh` to make it executable. | ||
4. Run `./running_script.sh <input_filename>` | ||
5. Results and profit. | ||
* Results can be found in `<input_filename>.json`. | ||
* If you want to be able to use this json file with any of the colab | ||
notebooks (like the one listed below), you will also need to run | ||
combine_shards.py.into the same dir as the json file. | ||
* Copy combine_shards.py into the same dir as the json file. | ||
* `echo <input_filename>.json > to_combine.txt` | ||
* `python combine_shards.py to_combine.txt > final_results.json` | ||
* Log can be found in `time_<input_filename>.txt`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
#!/bin/bash | ||
|
||
# Checks all the instances in hosts and checks the end of the log file | ||
# to see if it's finished. The script prints out FINISHED or NOT FINISHED | ||
# for each host respectively. | ||
|
||
hosts_file='hosts.txt' | ||
list_of_files=$(ls -1q input_files) | ||
i=1 | ||
|
||
# Grab the correct input file for the corresponding machine. | ||
for z in $list_of_files; | ||
do | ||
machine=$(sed "${i}q;d" $hosts_file) | ||
# Check if the file has 'Wrote Results', which indicates that it's finished. | ||
ssh -i ~/.ssh/gce_pshtt_key ubuntu@"${machine}" tail pshtt/time_"${z}".txt | grep -q 'Wrote results' | ||
finished=$(echo $?) | ||
if [[ "${finished}" -eq 0 ]]; then | ||
echo 'server '"${machine}"' FINISHED' | ||
else | ||
echo 'server '"${machine}"' NOT FINISHED' | ||
fi | ||
ssh -i ~/.ssh/gce_pshtt_key ubuntu@"${machine}" cat pshtt/time_"${z}".txt | grep -q 'Traceback' | ||
error=$(echo $?) | ||
if [[ "${error}" -eq 0 ]]; then | ||
echo 'server '"${machine}"' ERROR ON THIS MACHINE. CHECK INSTANCE.' | ||
else | ||
echo 'server '"${machine}"' NO ERROR.' | ||
fi | ||
((i=i+1)) | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
"""Combines pshtt shards into one final data file.""" | ||
import json | ||
import sys | ||
|
||
|
||
def main(): | ||
if (len(sys.argv)) < 2: | ||
print('you need a filename!') | ||
exit(1) | ||
# Master file is the file with the list of filenames to intake. | ||
# Fileception. | ||
master_file = sys.argv[1] | ||
filenames = [] | ||
|
||
# Read in the filenames that are the different shards. | ||
with open(master_file, 'r') as input_file: | ||
for line in input_file: | ||
filenames.append(line.rstrip()) | ||
# For each shard, read it in and append to the final list to | ||
# print out. | ||
for item in filenames: | ||
with open(item, 'r') as input_file: | ||
json_data = json.load(input_file) | ||
for item in json_data: | ||
print(json.dumps(item)) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
#!/bin/bash | ||
|
||
# If pshtt is done on all machines, it grabs both | ||
# the log file and the output file from the machines and | ||
# places them in the data_results/ directory. | ||
|
||
# This script also sets up the files to be combined by | ||
# the combine_shards script. Because pshtt outputs the results | ||
# as a list of dicts, we need to combine all of those lists. | ||
# We output the dicts as a file of dicts, one per line. | ||
hosts_file='hosts.txt' | ||
list_of_files=$(ls -1q input_files) | ||
i=1 | ||
|
||
for z in $list_of_files; | ||
do | ||
machine=$(sed "${i}q;d" $hosts_file) | ||
echo 'Kicking off '"${machine}"' number '$i | ||
# Grab the actual result file. | ||
echo 'grabbing result file' | ||
scp -i ~/.ssh/gce_pshtt_key ubuntu@"${machine}":~/pshtt/"${z}".json data_results/ | ||
echo $? | ||
# Grab the log file from that machine. | ||
echo 'grabbing log file' | ||
scp -i ~/.ssh/gce_pshtt_key ubuntu@"${machine}":~/pshtt/time_"${z}".txt data_results/ | ||
echo $? | ||
echo 'creating to_combine.txt' | ||
touch data_results/to_combine.txt | ||
echo $? | ||
echo 'putting file name into combine script' | ||
echo "${z}"'.json' >> data_results/to_combine.txt | ||
echo $? | ||
((i=i+1)) | ||
done | ||
|
||
cd data_results | ||
python combine_shards.py to_combine.txt > final_results.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
#!/bin/bash | ||
|
||
# Installs all the necessary packages for pshtt to run. | ||
# Logs which package it is installing as well as it's success (0) or failure | ||
# (1). | ||
echo 'UPDATE' | ||
apt-get -y update -qq | ||
echo $? ' ERROR CODE' | ||
echo 'GIT' | ||
apt-get -y install git -qq | ||
echo $? ' ERROR CODE' | ||
echo 'PYTHON3-PIP' | ||
apt-get -y install python3-pip -qq | ||
echo $? ' ERROR CODE' | ||
echo 'LIBFFI6' | ||
apt-get -y install libffi6 libffi-dev -qq | ||
echo $? ' ERROR CODE' | ||
echo 'LIBSSL' | ||
apt-get -y install build-essential libssl-dev libffi-dev python3-dev -qq | ||
echo $? ' ERROR CODE' | ||
echo 'SETUPTOOLS' | ||
pip3 install --upgrade setuptools -qq | ||
echo $? ' ERROR CODE' | ||
echo 'CFFI' | ||
pip3 install cffi -qq | ||
echo $? ' ERROR CODE' | ||
echo 'SSLYZE' | ||
pip3 install sslyze -qq | ||
echo $? ' ERROR CODE' | ||
echo 'PUBLIC SUFFIX' | ||
pip3 install publicsuffix -qq | ||
echo $? ' ERROR CODE' | ||
echo 'REQUESTS' | ||
pip3 install --upgrade requests -qq | ||
echo $? ' ERROR CODE' | ||
echo 'DOCOPT' | ||
pip3 install docopt -qq | ||
echo $? ' ERROR CODE' | ||
echo 'PYOPENSSL' | ||
pip3 install pyopenssl -qq | ||
echo $? ' ERROR CODE' | ||
echo 'PYTABLEWRITER' | ||
pip3 install pytablewriter -qq | ||
echo $? ' ERROR CODE' | ||
echo 'TYPING' | ||
pip3 install typing -qq | ||
echo $? ' ERROR CODE' | ||
echo 'FINISHED INSTALLING PACKAGES' |
Oops, something went wrong.