This repo consists of the following:
- Luigi
tasks.py
workflow management. - A task to download IMDb movie datasets from the internet.
- PySpark tasks to process the datasets and save parquet files.
- PySPark ML code to produce a similarity index of all films. ** This task can take hours and be costly to run, unless on high end hardware or a Spark cluster. See below.
- A task to create an embedable DuckDB database file from the previous two parquet files.
- A FastAPI server to provide fast asynchronous access to the films and similarity data.
- The API server provides JSON search and similarity search functions.
- Tests for the API server. Really "integration tests", rather than unit tests.
Install Poetry if you don't have it already, preferably with your system's package manager (brew on macOS) or pipx
.
A python poetry pyproject.toml
is provided.
$ poetry check -v
# install the necessary pre-requisites
$ poetry config --list
$ poetry install
$ poetry shell
# Notebooks if needed
$ poetry add Jupyter
Before starting any Luigi tasks or running the API server, start a shell in the newly configured environment:
# enter the python virtual env
$ poetry shell
This project uses Spotify's (Luigi)[https://luigi.readthedocs.io/en/stable/index.html].
The task definitions are contain in tasks.py
. There are task defintions for all the necessary steps to produce the DuckDB database neded by the JSON API server. The FastAPI server is started and managed separately.
All commands should be run from the project's top-level (root) directory.
$ mkdir -p ./datasets ./output ./duckdb
A simple way to run all the tasks is by calling the as follows:
$ luigi --local-scheduler --module tasks AllTasks
The tasks.py
modules in the dir's top-level contains the following luigi.Task and luigi.contrib.spark.PySparkTask runnable classes:
AllTasks
which runs all jobs in dependcy order, which have yet to run. This task can be run as needed until the final DuckDB database file has been generated. This meta task is the only task that needs to be run.SourceDataset(url, path)
downloads a dataset, it is called byPrepFilms
if necessary. Data sets are expected to be in thedataset/
directory and retain their.tsv.gz
or.csv.zip
extensionsPrepFilms
a PySpark job to process the source datasets into anoutput/films
parquet file.CosineSimilarity
a long-running and task to calculate the cosine-similarity of each film compared to every other film. See separate description.CreateDuckDBFilms
task to talkoutput/films/*.parquet
and create afilms
table in./duckdb/films.duckdb
CreateDuckDBFilmsCosSim
task to talkoutput/csfilms/*.parquet
and create afilms_cos_sim
table in./duckdb/films.duckdb
database for the FastAPI.GenreParquets
createsoutput/genres
partitioned by genre.
Individual tasks can be started from the CLI, triggering any dependencies, using a the same pattern:
$ luigi --local-scheduler --module tasks PrepFilms
Currently this project is configured to use four non-commercial TSV (tab-separated-values) tabulated files from the IMDb free corpus.
- title.basics.tsv.gz
- title.principals.tsv.gz
- name.basics.tsv.gz
- title.ratings.tsv.g
Currently there are two Luigi PySparkTask
which define an app
external python script to perform the Spark transformations:
- workflow/
task_prep_films.py
- workflow/
task_cosine_similarity.py
Both use DataFrames for the tabular data instead of RDDs, and should benefit from (Catalyst) query optimization.
The IMDb datasets are de-normalised and given new column names to form the parquet schema below.
The films
DataFrame could be small enough to broadcast() for any joins, while the csfilms
could be chunked by film_id and iteratively broadcast.
The schema for the films
parquet looks like this:
root
├── film_id: integer (nullable = true)
├── title: string (nullable = true)
├── year: date (nullable = true)
├── duration: integer (nullable = true)
├── genres: array (nullable = true)
│ ├── element: string (containsNull = false)
├── rating: decimal(4,2) (nullable = true)
├── vote_count: integer (nullable = true)
├── persons: array (nullable = false)
│ ├── element: string (containsNull = false)
film_id
an integer of the IMDbtconst
id stripped of the leading tt string.genres
is a list of all the genres the film has been tagged with.persons
is a list of all major principals for the production, including directors, writers, producers, actors, musicians, conceivers, etc. in the order of provided in the title.principals dataset.
The resulting parquet file would have about 1.5M rows, but it this is reduced down to about 40k films only keeping film titles where:
- the title is of type
movie
- has am IMDb rating above 6
- has over 500 votes
- has at least 1 genre specified, such as Drama or Documentary
In order to find similar films to any given film, it is necessary to calculate (pre-compute) a similarity metric of every film compared to every other film in films
. The metric neads to measure similarity across a range of features such that films with common cast and crew, writers, directors, genres, themes, titles and so on, have a higher corelation than others.
The feature columns used for similarity metric are 'title', 'genres', 'rating' and 'persons'.
The approach used to calculate cosine similarity index...
Spark ML functions from pyspark.ml.feature.
- concatenate the elements of the feature columns in to one long array column
- Compute the TF-IDF of the columnTerm Frequency-Inverse Document Frequency to produce dense vector matrices for each film_id.
- Compute the cross join of the dot product or Cosine Similarity of the 40k or so films dataframe. This is a length task, taking about 2 hours across 32 worker nodes.
Other similarity measures such as the Jaccard Index and Euclidean Distance wouldn't produce the same results. Word2Vec is not necessary here.
This parquet file tends to be over 7G in size.
Making data stored in parquet files available for rapid query via an API is challenging, particularly with the start-up time of a pyspark session. This called for an embedded database, such as SQLite, libSQL or DuckDB.
DuckDB reads parquet files, is easy to install, and is supported by SQLAlchemy using the duckdb dialect.
The database consists of two tables:
- films
┌─────────────┬──────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼──────────────┼─────────┼─────────┼─────────┼─────────┤
│ film_id │ INTEGER │ YES │ NULL │ NULL │ NULL │
│ title │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ year │ DATE │ YES │ NULL │ NULL │ NULL │
│ duration │ INTEGER │ YES │ NULL │ NULL │ NULL │
│ genres │ VARCHAR[] │ YES │ NULL │ NULL │ NULL │
│ rating │ DECIMAL(4,2) │ YES │ NULL │ NULL │ NULL │
│ vote_count │ INTEGER │ YES │ NULL │ NULL │ NULL │
│ persons │ VARCHAR[] │ YES │ NULL │ NULL │ NULL │
└─────────────┴──────────────┴─────────┴─────────┴─────────┴─────────┘
- films_cos_sim
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ film_id │ INTEGER │ YES │ NULL │ NULL │ NULL │
│ other_id │ INTEGER │ YES │ NULL │ NULL │ NULL │
│ similarity │ DOUBLE │ YES │ NULL │ NULL │ NULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Start the uvicorn
server, from the project's main root folder, optionally specifying binding --host 0.0.0.0 and --port 8000 for example:
uvicorn main:app --reload
FastAPI has both Swagger-UI and /redoc.ly self service documented OpenAPI front-ends.
Browser to localhost:8000/docs to get started with the self servce API.
Note that /docs uses an external resource swagger-ui-bundle.js, and pings
The routes related to film search and similarity search are in ./api/routers/films.py
.
The API is implemented in ./api/db/films.py
along with Pydantic validation. The ORM models are in ./api/db/core.py
for DuckDB.
SlowAPI is used as a rate limter to limit the number of requests per client per second or minute to an endpoint.
- POST
/film/search
- GET
/film/{film_id}
- GET
/film/{film_id}/similar/{threshold}
The /film/search
endpoint supports searching on any number of optional criteria. Additionally, the genre
, person
and title
fields provide sub-string regular expression matching, where the list of persons is ,
joined. For example, the following are valid searches:
{
"person": "J\\w+ Will",
"persons": ["George Lucas", "Harrison Ford"]
}
* note the backslash escaping \\\w
{
"title": "Santa(''s)?"
}
* apostrophe escaping is by doubling
Currently mostly integration tests.
Once the FastAPI uvicorn server is running, check the API is working as expected by running:
pytest
.
├── api
│ ├── __init__.py
│ ├── db
│ └── routers
├── datasets
│ ├── name.basics.tsv.gz
│ ├── title.basics.tsv.gz
│ ├── title.principals.tsv.gz
│ └── title.ratings.tsv.gz
├── duckdb
│ ├── films_cos_sim.done
│ ├── films.done
│ └── films.duckdb [5.7G]
├── notebooks
│ ├── task1-1-import-imdb-datasets.ipynb
│ ├── task1-2-write-genre-parquets.ipynb
│ ├── task2-1-cosine-similarity.ipynb
│ ├── task2-2-similarity-functions.ipynb
│ ├── task2-3-export_duckdb.ipynb
│ └── task3-search-query.ipynb
├── output
│ ├── csfilms [6.5G]
│ ├── films [11M]
│ └── genres [9.3M]
├── tests
│ ├── __init__.py
│ ├── integration
│ │ ├── __init__.py
│ │ └── test_api.py
│ ├── unit
│ │ ├── __init__.py
│ └── test_fastapi_up.py
├── workflow
│ ├── __init__.py
│ ├── task_cosine_similarity.py
│ ├── task_create_duckdb.py
│ ├── task_fetch_datasets.py
│ ├── task_jupyter_ipynb.py
│ └── task_prep_films.py
├── luigi.cfg
├── poetry.lock
├── pyproject.toml
├── README.md
├── tasks.py
- Make better use of
luigi.contrib.*
for ETL tasks like table creation. - Consider lib
runpy
in Luigi, orExternalProgramTask
. Run notebooks? - Better logging.
- DuckDB provides
dot
product and other relevant search, index and ML capabilites, such as the VSS extension for Vector Similarity Search. - Do away with PySpark entirely?
- Use SQLAlchemy Databricks dialect and dbutils as a comparison.
- Optimise the Cosine Similarity task...
- Testing on a large Spark Cluster.
- Get parameter support for search functionality.
- Write more tests, particularly Unit Tests of the DB api.
- Table, columns indexes could improve search speed if the number of titles in films grows much larger.
- The films should have their original title, besides the English translation.
- The original language should be sourced, as well as other attributes.
- Finally a React/Native front-end could be added.
- Search results paging.