The Reddit Data Pipeline Project is designed to stream data from Reddit using Python, process it through Apache Kafka, analyze it with Apache Spark, and orchestrate the entire pipeline using Docker. This project aims to demonstrate real-time data processing and analytics capabilities.
- Real-Time Data Streaming: Streams live data from Reddit.
- Message Brokering with Kafka: Efficient handling and queuing of streaming data.
- Data Processing with Spark: Real-time data analysis and processing.
- Dockerized Environment: Each component runs in a Docker container for easy setup and deployment.
Before you begin, ensure you have the following installed:
- Docker and Docker Compose
- Git (for version control)
-
Clone the Repository:
git clone https://github.com/YelamanKarassay/RedditDataPipelineProject.git cd RedditDataPipelineProject
-
Set up Environment Variables:
- Copy the
.env.example
file to a new file named.env
. - Fill in your Reddit API credentials and other necessary environment variables.
- Copy the
-
Build and Run with Docker Compose:
docker-compose up --build
Once the system is running:
- The Reddit Streaming Application will start streaming posts and comments from the specified criteria.
- Kafka will queue these messages, which will then be processed by the Spark application.
- The processed data can be viewed in the logs or extended to be stored in a database or displayed on a dashboard.
Refer to the architecture.md file for a detailed description of the project's architecture.
Contributions to the project are welcome. Please follow the standard fork, branch, and pull request workflow.
This project is licensed under the LICENSE.md - see the LICENSE file for details.
- Reddit API for providing the data source.
- Open-source communities of Apache Kafka, Apache Spark, and Docker for their invaluable resources.