GitHub - cipher1729/js-crawler: For crawling the web using scrapy, collecting javascripts and training a classifier with extracted features

cipher1729 / js-crawler Public

Notifications You must be signed in to change notification settings
Fork 1
Star 2

For crawling the web using scrapy, collecting javascripts and training a classifier with extracted features

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data_crawler		data_crawler
ml		ml
README		README
URLinfo.py		URLinfo.py
client.py		client.py
crawl.csv		crawl.csv
dropDb.py		dropDb.py
features_and_other_notes.txt		features_and_other_notes.txt
getAllJS.py		getAllJS.py
getFeatures.py		getFeatures.py
page_link_analysis.py		page_link_analysis.py
populate.py		populate.py
read.py		read.py
scrapy.cfg		scrapy.cfg
structural_analysis.py		structural_analysis.py
testClass.py		testClass.py
wrapper_script.py		wrapper_script.py
writeToFile.py		writeToFile.py

Repository files navigation

------------------------------------
1. Running the spiders

#start aquarium
#In a new terminal run aquarium as root or sudo, logs will print to screen
cd /home/group9p1/aquarium
docker-compose up

#first download new alexa 1M list into the crawler directory
python read.py

#populate a section of the list for the crawler to scrape
#the populate script take a start number, end number, and a padding number
#to scrap sites 1-100 with a padding of 5
python populate.py 1 100 5

#run the spider
#In a new terminal run the spider as root or sudo, logs will print to screen and
scrapy crawl getData

------------------------------------
2. Database

All scripts will be stored in /home/group9p1/javascripts/http/<url>/
Database collection 'currDB' is currently used. This can be changed by
cd data_crawler/settings.py
change MONGODB_COLLECTION to the new collection name

To view the database content
python client.py

....................................
3. Training the classifier

To train the classifier with the training data set in ml/train2.csv
cd ml
python train.py

...................................
4. Testing the classifier
cd ml
python classifier_test.py <path to test csv>

The test csv should have 28 comma separated values with an expected label at the 29th positions. The first line of the file should be of the form '<no of records>, 28'

..................................
HELPER files:
python testClass.py :to read data from a delimiter separated list of scripts and write their features to a csv file
python read.py: update alexa list to latest in alexaUrls.txt
python client.py: print entries in mongo
checkScript.py is for running input scripts through VirusTotal