Рударење на масивни податоци: Домашна работа 2

This project runs a set of scripts to train and eval a recommendation system (ALS, Alternating Least Squares model) in the Spark MLlib with the MovieLens dataset. Below are the project details, setup instructions, and other highlights.

Project Structure

.
├── data
│   └── get_data_script.sh
├── docker-compose.yml
├── main_script.sh
├── poetry.lock
├── pyproject.toml
├── README.md
└── scripts
    ├── best_model
    ├── checkpoints
    ├── run_spark_scripts.sh
    └── spark_script.py

Setup and Execution

Prerequisites

Docker and Docker Compose installed on your machine.
Python 3.10 or above with poetry for dependency management.
Spark installed in the Docker container.

Build Project Guide

Pull project from Github

git clone https://github.com/lupusruber/RNMP_homework2.git
cd RNMP_homework2

Run the main script

source main_script.sh

This script performs the following:

Downloads and preprocesses the data.
Sets up a Spark cluster in Docker.
Trains an ALS model on the MovieLens dataset.
Evaluates the model's performance using the metrics: RMSE, Precision@K, Recall@K, and NDCG.

Scripts Overview

`main_script.sh`

Main script to orchestrate data loading, Spark cluster setup, and model training.

`get_data_script.sh`

Script to download and preprocess the MovieLens dataset.

`run_spark_script.sh`

Script to submit the Spark job (spark_script.py) to the Spark cluster.

`spark_script.py`

Implements the following:

Loads the MovieLens dataset.
Splits the data into training and testing sets.
Trains the ALS model using a cross-validator to determine the best hyperparameters.
Evaluates the model using Spark's and MLlib's evaluation tools.

Results for the best model

Root Mean Squared Error: 0.9165499757845692
Precision@10: 0.04861612515042121
Recall@10: 0.024782289580880097
NDCG@10: 0.05164765467524519
Mean Average Precision: 0.00887359068973789

Key Features

Data Preprocessing
- Adds headers to the MovieLens dataset files for easier readability.
- Converts data encoding to UTF-8.
ALS Model
- Trained with user-item interaction data.
- Hyperparameter tuning using cross-validation.
Evaluation
- Metrics: RMSE, Precision@K, Recall@K, and NDCG are logged for model performance assessment.

Docker Integration

The project uses docker-compose for running a Spark cluster. Why?

Simplified environment setup.
Consistency across development and production.

Dependencies for dev

Dependencies are managed using poetry. Install them with:

poetry install

Outputs

Best Model Saved under scripts/best_model/best_model.model.
Metrics Logged to the console.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Рударење на масивни податоци: Домашна работа 2

Project Structure

Setup and Execution

Prerequisites

Build Project Guide

Scripts Overview

`main_script.sh`

`get_data_script.sh`

`run_spark_script.sh`

`spark_script.py`

Results for the best model

Key Features

Docker Integration

Dependencies for dev

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
main_script.sh		main_script.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

lupusruber/RNMP_homework2

Folders and files

Latest commit

History

Repository files navigation

Рударење на масивни податоци: Домашна работа 2

Project Structure

Setup and Execution

Prerequisites

Build Project Guide

Scripts Overview

main_script.sh

get_data_script.sh

run_spark_script.sh

spark_script.py

Results for the best model

Key Features

Docker Integration

Dependencies for dev

Outputs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`main_script.sh`

`get_data_script.sh`

`run_spark_script.sh`

`spark_script.py`

Packages