For this hackathon, we will be using the MovieLens data set. This dataset consists of movie ratings compiled from nearly 300,000 users from the MovieLens service. There are also additional data tables with movie tags, genre, and movie "genome" information.
The data set is contained in multiple .csv files that can be linked together using various ID fields. Teams are not required to use all of the data files -- they can choose to use any and all of them. Teams can also incorporate outside data as they wish.
The data sets should be downloaded from the grouplens website:
There are two versions of the data set:
- Small - a subset of the full dataset (1 MB, zipped); could be useful if you have limited compute resources or want to test your analysis on a small version of the data first.
- Full - the full data set (265 MB, zipped)
- Teams can use either the small or full data set for their submitted presentation, but be sure to specify which you used.
- Each data set has a README file describing the data -- be sure to read these in detail!
- Be sure to reference the data set in the last side of your presentation.
The MovieLens data set has been approved for use for educational/non-commercial use.