Skip to content

spandansingh/url_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

URL Scraper Microservice

It is a scalable microservice which scrapes the urls and stores its response. We can setup it using the following two methods.

First Method - Using Docker Compose

The recommended way to install URL Scraper microservice is through Docker Compose.

# Clone the Repository
git clone https://github.com/spandansingh/url_scraper.git

Build & Run

docker-compose up --build -d

Yay! Everything is now up and running. It will now build and run three services in separate docker containers -

  • Microservice ( Lumen Micro-Framework - The stunningly fast micro-framework by Laravel )
    • worker to process the URLs
    • HTTP Server to expose the API to see the report of the failed urls
  • Database Server (MySQL)
  • Database Client (PhpMyAdmin)

which can be listed by the following command-

docker-compose ps

Docker compose creates a local network between these containers.

If the worker failed to scrape any url it retries to scrape it. The Threshold number of retries could be modified by changing the environment variable RETRIES_THRESHOLD inside the docker-compose.yml. Default number of retries is 3. In the docker-compose.yml, we can also modfiy other environment variables like database credentials.

Docker Compose automatically pulls the docker image. However, the docker image could also built locally using the Dockerfile inside the root folder. Run the following command to build the docker image locally.

docker build -t spandy/url_scraper .

Let's populate some urls in database now!! Please navigate to phpmyadmin which is running at http://localhost:8181.

Create a database moveinsync and import the sql file that is in the root folder of this repository.

Note: Since worker is already running so you will be able to see the results inside the table.

API to get the report for the failed urls - http://localhost:8000/urls/failed

Database default credentails

Username - root
Password - moveinsync
Database - moveinsync

Stop Everything

docker-compose down

Second Method - Using Composer

# Clone the Repository
git clone https://github.com/spandansingh/url_scraper.git
# Install composer
curl -sS https://getcomposer.org/installer | php

Next, run the composer command inside the /app folder.

# Install dependencies
composer install

Now, set the MySQL database credentials in the app/.env file to connect with database.

Please find the exported sql file in the root folder.

The Threshold number of retries could be modifiled by changing the environment variable RETRIES_THRESHOLD inside the /app/.env file Default number of retries is 3.

Now, change the directory to /app and start the worker to process urls by running the following command.

# Start the worker
php artisan moveinsync:url_scraper

Now open another terminal instance and start the http server.

# Start the HTTP Server
php -S localhost:8000 -t public

To get the report of failed urls use the api - http://localhost:8000/urls/failed

About

Scalable URL Scraper Microservice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published