A Node.js application for scraping business information from YellowPages.com (US only). Available with both command-line and web interfaces, now featuring Firebase authentication and clean URL routing.
- Features
- Installation
- Firebase Authentication Setup
- Usage
- Deployment
- Web Interface Features
- Output Format
- Important Notes
- Limitations
- Customization
- How It Works
- License
User Authentication:
- Firebase email/password login
- Password reset functionality
Search Capabilities:
- Search for businesses by type and location
- Collect information like:
- Business names
- Phone numbers
- Websites
- Complete addresses
Data Management:
- Save results as JSON or CSV files
- Browse, preview, and manage saved results
- Mobile-friendly web interface w/ light and dark themes
- Command-line interface for scripts and automation
Clone the repository:
git clone [email protected]:DevManSam777/yp_scraper.git
cd yp-webscraper-docker
Install dependencies:
npm install
Make sure the output directories exist:
mkdir -p json_results csv_results
Note: You can run the scraper in the command-line without dockerizing or spinning up the server. To get started with CLI scroll down to usage and follow the instructions.
Clone the repository:
git clone [email protected]:DevManSam777/yp_scraper.git
cd yp-webscraper-docker
Run using Docker Compose:
docker-compose up
This will:
- Build the Docker image with all dependencies (including Chrome)
- Create and start the container
- Mount the necessary volumes for file storage
- Map port 3000 to the container
- Create a Firebase project at console.firebase.google.com
- Enable Email/Password authentication
- Register a web app in your Firebase project
- Add users manually from firebase console since we don't want sign ups via the web app
- Update the Firebase configuration (these are public keys that can be safely exposed) in:
public/login.html
public/index.html
(logout functionality)
- Add your development and production domains to Firebase authorized domains
Run the scraper in interactive mode:
npm run search
You'll be prompted to enter:
- What you're looking for (e.g., "pizza")
- Where (e.g., "Los Angeles, CA")
- Number of results to collect
- How to save the results (JSON or CSV)
Start the web server:
npm start
Open your browser and go to:
http://localhost:3000
Log in with your Firebase credentials
Use the interface to:
- Configure and start searches
- Monitor real-time progress
- View and manage results
- Preview and download files
Render offers native support for running containerized apps and services, making it a great choice for this Dockerized application.
Prerequisites:
- GitHub/GitLab repository with your code
- Firebase project configured for authentication
Step-by-Step Render Deployment:
-
Create Web Service on Render
- Go to render.com and sign up/login
- Click "New +" → "Web Service"
- Choose "Build and deploy from a Git repository"
- Connect your GitHub/GitLab account
- Select your YP Scraper repository
-
Configure Service Settings
- Name:
yp-scraper
(or your preferred name) - Region: Choose closest to your users
- Branch:
main
(or your default branch) - Runtime: Docker (Render auto-detects your Dockerfile)
- Build Command: Leave empty (Docker handles this)
- Start Command: Leave empty (uses Dockerfile CMD)
- Name:
-
Set Environment Variables (Optional)
- Add any custom environment variables you need
PORT
is automatically set by Render
-
Configure Firebase
- In your Firebase console, add your Render domain to authorized domains
- Format:
your-app-name.onrender.com
-
Deploy
- Click "Create Web Service"
- Render builds your Docker image on every push to your repo, storing the image in a private and secure container registry
- First deployment takes 5-10 minutes
- Subsequent deployments are faster due to layer caching
Important Render Considerations:
Free Tier Limitations:
- Apps sleep after 15 minutes of inactivity (causes ~50 second cold start)
- 750 hours/month of runtime total for ALL projects, not EACH
- No persistent disk storage on free tier
Recommended: Upgrade to a paid plan for persistent storage and no sleep mode.
Free Tier Pro-Tip: Use cron-job.org to send an HTTP request to your app every 10 minutes to prevent it from spinning down due to inactivity (use at your own risk).
File Storage Strategy: Since Render uses ephemeral storage:
- Generated files (JSON/CSV) are temporary and lost on restarts
- Recommended approach: Users should download files immediately after generation
- Alternative: Upgrade to paid plan with persistent disk storage
- Advanced users: Link a database to store file metadata and results for persistence
Automatic Deployments:
- Every git push to your main branch triggers a new deployment
- Zero-downtime deployments ensure no service interruption
The application can also be deployed to other Docker-supporting platforms:
- Railway: Similar git-based deployment process
- Fly.io: Global deployment with static IPs
- DigitalOcean App Platform: Managed infrastructure
- Heroku: Using container registry deployment
The web interface provides:
- Clean URL Routing: User-friendly URLs without .html extensions:
/login
- Authentication page/
- Main application (protected)
- Login Screen: Secure access to the application
- Search Tab: Configure and run searches
- Results Tab: View detailed business information
- Files Tab: Manage saved JSON and CSV files
- Real-time Progress: Monitor search status
- File Preview: Quick view of saved results
- Responsive Design: Works on mobile devices
[
{
"businessName": "Pizza Place",
"businessType": "Pizza, Italian Restaurant",
"phone": "(555)123-4567",
"website": "https://example.com",
"streetAddress": "123 Main St",
"city": "Los Angeles",
"state": "CA",
"zipCode": "90001"
}
]
Results are saved with the following columns:
- Business Name
- Business Type
- Phone
- Website
- Street Address
- City
- State
- ZIP Code
- Neither this application nor it's creator are affiliated in any way, shape, or form with Yellowpages.com
- For educational and demonstration purposes only
- Only works with YellowPages.com
- Please refer to YellowPages.com Terms of Service before using
- Might break if the website structure changes
- Use carefully and responsibly
- Use at your own discretion and risk
- Limited to ~1000 results per search
- Rotating proxies recommended for extensive use
- Consider file storage persistence for production deployments
- Port: Change the web server port by setting the PORT environment variable
- Results Limit: Modify the maximum results in
puppeteer-scraper-module.js
and adjust UI values inpublic/app/index.html
- URL Routing: The application uses clean URL paths without file extensions
The scraper uses Puppeteer with stealth plugins to navigate YellowPages search results and extract business information. The application architecture includes:
puppeteer-scraper-module.js
: Core scraper functionalitypuppeteer-scraper-cli.js
: Command-line interfaceweb-server.js
: Web server & API endpoints with clean URL routingpublic/index.html
: Web interfacepublic/login.html
: Authentication interface
Copyright (c) 2025 DevManSam