A high-performance web scraping system that collects news articles from Google News and Bing News across multiple AWS regions concurrently. The system uses region-based context awareness to gather diverse news perspectives from different geographical locations.
- Multi-Source Collection: Scrapes from both Google News and Bing News
- Region-Based Context: Collects data from 6 different AWS regions for geographical diversity
- Concurrent Processing: Parallel execution for maximum efficiency
- Real-time Saving: Articles are saved immediately as they are collected
- AWS Lambda Integration: Serverless scraping to avoid rate limiting and IP blocking
- Comprehensive Logging: Detailed logging with automatic cleanup and rotation
- Data Processing: Built-in data cleaning, deduplication, and format conversion
- Python 3.8+
- AWS Account with Lambda functions deployed
- Required Python packages (see
requirements.txt)
-
Clone the repository
git clone <repository-url> cd "Context-Aware Concurrent Data Collection"
-
Install dependencies
pip install -r requirements.txt
-
Configure AWS credentials
- Set up your AWS credentials using AWS CLI or environment variables
- Ensure Lambda functions are deployed in the target regions
Context-Aware Concurrent Data Collection/
├── README.md
├── requirements.txt
├── start.py # Main entry point
├── aws/
│ ├── aws_functions.json # AWS Lambda configuration
│ ├── lambda_function.py # Lambda function code
│ └── lambda_updater.py # Lambda deployment tool
├── config/
│ ├── topic.csv # Topics to scrape
│ ├── google_news/
│ │ ├── cookies.json # Cookie configuration
│ │ └── headers.json # HTTP headers
│ └── bing_news/
│ ├── cookies.json
│ └── headers.json
├── core/
│ ├── aws_client.py # AWS Lambda client
│ └── utils.py # Utility functions
├── parsers/
│ ├── base_parser.py # Base parser class
│ ├── google_news_parser.py # Google News parser
│ └── bing_news_parser.py # Bing News parser
├── context_aware_concurrent_collector/
│ ├── detail_content_scraper.py # Main scraping engine
│ ├── data_processor.py # Data processing and saving
│ └── requester.py # HTTP request utilities
├── user_context_controller/
│ └── config.py # Configuration management
└── datasets/ # Output directory (auto-created)
└── YYYY-MM-DD/ # Date-based folders
├── google_news/
│ └── region/
└── bing_news/
└── region/
Edit config/topic.csv to specify the topics you want to scrape:
query
<topic-1>
<topic-2>
<topic-3>Update aws/aws_functions.json with your Lambda function ARNs:
{
"us-west-1": [
{
"region": "us-west-1",
"arn": "arn:aws:lambda:us-west-1:<account-id>:function:<function-name>"
}
],
"us-east-2": [
{
"region": "us-east-2",
"arn": "arn:aws:lambda:us-east-2:<account-id>:function:<function-name>"
}
],
"ap-northeast-1": [
{
"region": "ap-northeast-1",
"arn": "arn:aws:lambda:ap-northeast-1:<account-id>:function:<function-name>"
}
],
"ap-northeast-2": [
{
"region": "ap-northeast-2",
"arn": "arn:aws:lambda:ap-northeast-2:<account-id>:function:<function-name>"
}
],
"eu-west-3": [
{
"region": "eu-west-3",
"arn": "arn:aws:lambda:eu-west-3:<account-id>:function:<function-name>"
}
],
"eu-west-2": [
{
"region": "eu-west-2",
"arn": "arn:aws:lambda:eu-west-2:<account-id>:function:<function-name>"
}
]
}Customize cookies and headers in:
config/google_news/cookies.jsonconfig/google_news/headers.jsonconfig/bing_news/cookies.jsonconfig/bing_news/headers.json
# Run single test scraper
python start.py
# Test configurations
python start.py test
# Run region mode immediately
python start.py all
# Start scheduled scraper (runs daily at 00:01)
python start.py schedule
# Start collection url to detail content
python url2content.py from user_context_controller.config import ConfigManager
from context_aware_concurrent_collector.detail_content_scraper import DetailContentScraper
# Initialize
config_manager = ConfigManager('google_news', mode='region')
scraper = DetailContentScraper('google_news')
# Load configuration
topics = config_manager.load_topics()
config = config_manager.get_config_by_mode()
# Run scraping
articles = scraper.sequential_scraping(topics, queries, config, mode='region')title,description,url,published_date,source,query,topic,perspective,scraper,user_agent
"<article-title>","<description>","<article-url>","<date>","<source>","<query>","<topic>","default","google_news","Chrome-Windows"{
"title": "<article-title>",
"description": "<article-description>",
"url": "<article-url>",
"published_date": "<date>",
"source": "<source>",
"query": "<query>",
"topic": "<topic>",
"perspective": "default",
"scraper": "google_news",
"user_agent": "Chrome-Windows"
}datasets/
└── YYYY-MM-DD/
├── google_news/
│ └── region/
│ ├── <topic>_us-west-1.csv
│ ├── <topic>_us-east-2.csv
│ └── ...
└── bing_news/
└── region/
├── <topic>_ap-northeast-1.csv
├── <topic>_ap-northeast-2.csv
└── ...
The system collects data from 6 AWS regions to ensure geographical diversity:
| Region | Location | Purpose |
|---|---|---|
| us-west-1 | US West Coast | North American perspective |
| us-east-2 | US East Coast | North American perspective |
| ap-northeast-1 | Tokyo | Asian perspective |
| ap-northeast-2 | Seoul | Asian perspective |
| eu-west-3 | Paris | European perspective |
| eu-west-2 | London | European perspective |
- Location:
logs/directory - Format:
YYYY-MM-DD_scraper_mode_HHMMSS.log - Automatic rotation: 10MB max, 5 backup files
- Auto cleanup: Files older than 7 days
- INFO: General operation status
- WARNING: Non-critical issues
- ERROR: Critical errors
- DEBUG: Detailed debugging information
The system provides real-time progress updates:
Scraping start - processing N topics total
Mode: region mode (6 parallel)
[1/N] Processing: <topic>
<timestamp> | <lambda-arn> | <scraper> | news | <topic> | <region> | Page 1 | <count> collected | Total <total>
Config 1 (<region>): <count> articles
Waiting for next query...
-
AWS Lambda Timeout
- Increase Lambda timeout in AWS console
- Reduce
max_articles_per_queryin code
-
Rate Limiting
- Adjust
random_sleep()parameters - Increase delays between requests
- Adjust
-
Cookie/Header Issues
- Update cookies and headers in config files
- Use browser developer tools to get fresh values
-
Memory Issues
- Process smaller batches
- Enable real-time saving to reduce memory usage
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)- Concurrent Processing: 6 parallel Lambda executions per topic
- Real-time Saving: Immediate file writing to prevent data loss
- Efficient Pagination: Smart page limit calculation
- Memory Management: Streaming data processing
- Error Recovery: Automatic retry with exponential backoff
- AWS credentials should be stored securely
- Lambda functions run in isolated environments
- No sensitive data stored in logs
- Rate limiting to respect target websites
This project is for research and educational purposes. Please ensure compliance with the terms of service of the websites being scraped.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For issues and questions:
- Check the troubleshooting section
- Review log files for error details
- Open an issue with detailed error information
Version: 2.0 (Region Mode Only)