🎯 Is your feature request related to a problem?
As more spiders are introduced to the crawler system (such as the recent Glassdoor and Internshala additions), data consistency becomes a challenge. Different job boards structure their raw scraped payloads slightly differently (e.g., varying date formats, mismatched casing, or missing optional fields like salary ranges or exact company locations).
Currently, pushing raw dictionaries directly down the pipeline can cause silent database insertion failures or structural inconsistencies in PostgreSQL.
✨ Describe the proposed solution
I propose introducing a strict data validation and normalization layer using Pydantic v2 right before data is dispatched to Redis Streams or the Postgres database layer.
By defining unified data models, we can:
- Guarantee runtime type safety and fail-fast validation for all inbound scraped jobs/contacts.
- Implement custom Pydantic validators (
@field_validator) to normalize data on the fly (e.g., converting strings to standard datetime objects, stripping whitespace, and enforcing lowercased email fields).
- Provide safe fallback defaults for non-mandatory missing attributes.
🛠️ Technical Implementation Steps
- Define Schemas: Create a centralized
schemas/ directory or update models in the backend to define JobIngestModel and ContactIngestModel using Pydantic.
- Data Cleansing: Add field validators to clean and sanitize text fields, format URLs, and enforce structural constraints.
- Pipeline Integration: Wrap the incoming message consumer or spider output pipeline in a validation block:
try:
validated_job = JobIngestModel(**raw_scraped_data)
except ValidationError as e:
logger.error(f"Drop invalid job payload: {e.json()}")
🎯 Is your feature request related to a problem?
As more spiders are introduced to the crawler system (such as the recent Glassdoor and Internshala additions), data consistency becomes a challenge. Different job boards structure their raw scraped payloads slightly differently (e.g., varying date formats, mismatched casing, or missing optional fields like salary ranges or exact company locations).
Currently, pushing raw dictionaries directly down the pipeline can cause silent database insertion failures or structural inconsistencies in PostgreSQL.
✨ Describe the proposed solution
I propose introducing a strict data validation and normalization layer using Pydantic v2 right before data is dispatched to Redis Streams or the Postgres database layer.
By defining unified data models, we can:
@field_validator) to normalize data on the fly (e.g., converting strings to standard datetime objects, stripping whitespace, and enforcing lowercased email fields).🛠️ Technical Implementation Steps
schemas/directory or update models in the backend to defineJobIngestModelandContactIngestModelusing Pydantic.