This application is developed to crawl a website using queue with request concurrency at max 5 and find all possible hyperlinks present within it and save it to CSV file.
For this two modules is developed to accomplish it. First is by using async library, name of a file is withAsyncLibrary.JS and second is without using async library, name of file is withOutAsyncLibrary.js.
- Clone or download repository. Refer
- Go to application folder.
- Install dependencies. Run
npm install. - Run
npm run asyncappto test functionality of withAsyncLibrary.JS module. - Run
npm run emitterappto test functionality of withOutAsyncLibrary.js module. - To test the status of url on which we start crawling, run
npm run test. It will return success if statuscode of website is 200 else it will throw an error.
Hyperlinks found after crawling a website is maintened in a *.CSV file and along with it log file *.txt is maintened to check specific logs with time of occurrences.
- If you run
npm run asyncapp, thenasyncapp.csvandasyncapp.txtfile will be generated. - If you run
npm run emitterapp, thenemitterapp.csvandemitterapp.txtfile will be generated.
- A config.json file is created to configure specific objects.
urltocrawlobject can be change to any web address where you want to crawl.queueworkersobject can be change to any number of concurrent request connections requried to crawl a website.logfilenameandcsvfilenamecan be change to any file name require for log and csv file.logfilewritemodeflagandlogfilewritemodeflagdetermines mode at which file (log and csv) file is opened for operation. It can also be change as per requirement.
- Don’t spam web servers with too many requests, their servers might ban your ip.
- Only html pages is crawled, but .CSV file consists of all hyperlinks found on its way (.css,*.png and etc.).