Skip to content

For crawling the web using scrapy, collecting javascripts and training a classifier with extracted features

Notifications You must be signed in to change notification settings

cipher1729/js-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

d15c003 · Jun 3, 2016

History

1 Commit
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016
Jun 3, 2016

Repository files navigation

------------------------------------
1. Running the spiders

#start aquarium
#In a new terminal run aquarium as root or sudo, logs will print to screen
cd /home/group9p1/aquarium
docker-compose up

#first download new alexa 1M list into the crawler directory
python read.py

#populate a section of the list for the crawler to scrape
#the populate script take a start number, end number, and a padding number
#to scrap sites 1-100 with a padding of 5
python populate.py 1 100 5

#run the spider
#In a new terminal run the spider as root or sudo, logs will print to screen and 
scrapy crawl getData


------------------------------------
2. Database

All scripts will be stored in /home/group9p1/javascripts/http/<url>/
Database collection 'currDB' is currently used. This can be changed by
cd  data_crawler/settings.py
change MONGODB_COLLECTION  to the new collection name

To view the database content
python client.py

....................................
3. Training the classifier

To train the classifier with the training data set in ml/train2.csv
cd ml
python train.py

...................................
4. Testing the classifier
cd ml
python classifier_test.py <path to test csv>

The test csv should have 28 comma separated values with an expected label at the 29th positions. The first line of the file should be of the form '<no of records>, 28'


..................................
HELPER files:
python testClass.py :to read data from a delimiter separated list of scripts and write their features to a csv file
python read.py: update alexa list to latest in alexaUrls.txt
python client.py: print entries in mongo
checkScript.py is for running input scripts through VirusTotal



About

For crawling the web using scrapy, collecting javascripts and training a classifier with extracted features

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages