Text embedding models have demonstrated immense potential in semantic understanding, which recommendation systems can leverage to discern subtle differences between items, thereby enhancing performance. Although general text embedding models have achieved broad success, there is currently a lack of an embedding model specifically designed for recommendation systems that excels across diverse recommendation scenarios, rather than being exclusively developed for specific downstream tasks or datasets. To bridge these gaps, we introduce the General Recommendation-Oriented Text Embedding (GRE) and a comprehensive benchmark, GRE-B, in this repo. Figure 1 illustrates the overview of our work.
We pre-trained GRE on a wide array of data specifically curated from various recommendation domains covering e-commerce, catering, fashion, books, games, videos and more. To ensure the quality and balance of the data, we employed the Coreset method by selecting high-quality texts to maintain balance across various domains. Then, GRE is further refined by extracting high-quality item pairs through collaborative signals and directly integrating such signals into the model through contrastive learning.
After fine-tuning, you can use the model to generate textual item embeddings for recommendation tasks.
The trained GRE models can be downloaded here: small, base, and large.
To comprehensively assess our general recommendation-oriented embedding, we have established a benchmark using diverse recommendation datasets which are distinct from the training data to guarantee fairness in evaluation. Our benchmark includes a total of 26 datasets, which are categorized into six recommendation scenarios: e-commerce, fashion, books, games, video, and catering.
We utilize SASRec and DSSM for the retrieval task, and the results are evaluated using metrics including NDCG and Recall. As for the ranking task, DIN and DeepFM are employed, while AUC and Logloss are chosen for evaluation.
The statistics of the datasets:
For each test dataset, execute process.py
and filter.py
Notice: for Goodreads, GoogleLocalData, Yelp, change the low_rating_thres
in config's *.yaml
to ~
for retrieval
Processed data can be downloaded here.
conda env create -f rec.yaml
Set the DATA_MOUNT_DIR
in your environment:
export DATA_MOUNT_DIR=[DOWNLOAD_PATH]/data
for naive:
bash ctr.sh [CUDA_ID_0] [CUDA_ID_1] [DATASET_NAME]
for text-embedding-enhanced:
bash ctrwlm.sh [CUDA_ID_0] [CUDA_ID_1] [DATASET_NAME] [TEXT_EMBEDDING_PATH] [SAVE_PREFIX]
for naive:
cd RecStudio
bash gre.sh SASRec [DATASET_PKL_PATH] [TEXT_EMBEDDING_PATH]
for text-embedding-enhanced:
cd RecStudio
bash gre.sh TE_ID_SASRec [DATASET_PKL_PATH] [TEXT_EMBEDDING_PATH]
You can change the configuration in RecStudio/recstudio/model/seq/config/din.yaml
, RecStudio/recstudio/model/seq/config/sasrec.yaml
and RecStudio/recstudio/model/seq/config/te_id_sasrec.yaml
to search the best results.