Reddit Data Pipeline Project

Overview

The Reddit Data Pipeline Project is designed to stream data from Reddit using Python, process it through Apache Kafka, analyze it with Apache Spark, and orchestrate the entire pipeline using Docker. This project aims to demonstrate real-time data processing and analytics capabilities.

Features

Real-Time Data Streaming: Streams live data from Reddit.
Message Brokering with Kafka: Efficient handling and queuing of streaming data.
Data Processing with Spark: Real-time data analysis and processing.
Dockerized Environment: Each component runs in a Docker container for easy setup and deployment.

Prerequisites

Before you begin, ensure you have the following installed:

Docker and Docker Compose
Git (for version control)

Installation and Setup

Clone the Repository:

git clone https://github.com/YelamanKarassay/RedditDataPipelineProject.git
cd RedditDataPipelineProject

Set up Environment Variables:
- Copy the .env.example file to a new file named .env.
- Fill in your Reddit API credentials and other necessary environment variables.
Build and Run with Docker Compose:
```
docker-compose up --build
```

Usage

Once the system is running:

The Reddit Streaming Application will start streaming posts and comments from the specified criteria.
Kafka will queue these messages, which will then be processed by the Spark application.
The processed data can be viewed in the logs or extended to be stored in a database or displayed on a dashboard.

Architecture

Refer to the architecture.md file for a detailed description of the project's architecture.

Contributing

Contributions to the project are welcome. Please follow the standard fork, branch, and pull request workflow.

License

This project is licensed under the LICENSE.md - see the LICENSE file for details.

Acknowledgments

Reddit API for providing the data source.
Open-source communities of Apache Kafka, Apache Spark, and Docker for their invaluable resources.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
kafka		kafka
reddit-stream		reddit-stream
spark		spark
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Pipeline Project

Overview

Features

Prerequisites

Installation and Setup

Usage

Architecture

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

YelamanKarassay/RedditDataPipelineProject

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Pipeline Project

Overview

Features

Prerequisites

Installation and Setup

Usage

Architecture

Contributing

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages