Python ↔ C++ Integration
Powered by pybind11
The Flash Recommender System is a high-performance, scalable recommendation engine powered by cpp backend to the heavy lifting of training Alternating Least Squares (ALS) matrix factorization. It is the fastest implementation training a recommender system with ALS.
- Compiled with Clang LLVM’s optimisations - aggressive inlining, loop unrolling, auto-vectorisation, ...
- cpp backend uses Eigen, a highly optimised library for matrix algebra — fast Cholesky decomposition and other linear algebra routines.
- OpenMP threading for parallelism, enabling shared-memory multithreading with dynamic workload balancing.
🏗️ CLI UI still under construction
this is how to use ...
your system needs python3-dev git-lfs cmake unzip axel
and a compiler with LLVM toolchain sudo [pkg] install clang lldb lld libomp-dev
-
Clone the repository:
git clone --recurse-submodule https://github.com/soot-bit/MovieRecommender.git git lfs pull pip install -U "pybind11[global]" tqdm optuna
-
run
$ source build.sh
-
Train the model or load trained matrices and vectors for making predictions:
-
Make predictions:
to download the the full datasets yourself Million Users large data set run bypass git lfs pull
$ > ./download_ds.sh
example usage
time python main.py --dataset "ml-latest-small"
add flag --flash
for flash training. you might not think anything happened but it did train
add flag ---plot
to confirm if indead it did exectute.
Operation | Python 🐢 + NumPy | Flash System ⚡ | Speed-up |
---|---|---|---|
Matrix Factorization | - | - | 100.1× |
Recommendation Batch | - | - | 2005.7× |
to be done properly
Using DataIndx
This class provides a very efficient way to index and preprocess user-movie ratings data for training recommendation models ALS.
It has:
- very clever memory management to avoid duplicating snapTensor
- Data loading and index mappings for users/movies
- Feature engineering for movies
- Train-test splitting with per-user stratification
- O(1) retrieval of user/movie ratings
- efficient sampling of per-user ratings using sklearn
train_test_split
arg:
dataset
: the path to the data("ml-latest" 25 M or "ml-latest-small" 100K)
Initialization
from src.data import DataIndx
# Load dataset (e.g., ml-latest or ml-latest-small)
data = DataIndx("ml-latest", cache=True)
- Load the dataset from
Data/ml-latest/
- Create user/movie ID mappings
- Perform a stratified train/test split
- Caches all processed data to disk (cache only the large dataset)
You can access core components of the class after initialization:
print(data.ratings.head()) # Raw ratings DataFrame
print(len(data.idx_to_user)) # Total number of unique users
print(len(data.idx_to_movie)) # Total number of unique movies
# Get internal tensor for ALS training
tensor = data.snap_tensor
The tensor has methods:
.add_train(user_idx, movie_idx, rating)
.add_test(user_idx, movie_idx, rating)
- Use with your ALS trainer directly
🏷️ Get Movie Features
movie_id = 123
features = data.get_features(movie_id)
print(features)
# Output example:
# {
# 'title': 'Toy Story (1995)',
# 'genres': ['Animation', 'Children', 'Comedy'],
# 'tags': ['Pixar', 'funny', 'great animation']
# }
The Data directory should look something like..
Data/
└── ml-latest/
├── ratings.csv
├── movies.csv
└── tags.csv
🧠 ALS algorithm
- Update
$U \rightarrow V \rightarrow b_i \rightarrow b_j$ iteratively.
User Vector
Movie Vector
User Bias
Movie Bias
Key Notes:
-
Interdependence: Biases
$b_i$ and$b_j$ depend on each other, so update them sequentially using the most recent values and split updates -
Regularization:
$\gamma$ controls the strength of bias regularization.$\tau$ controls strength of latent vec regularisation