-
This is the official repository of the paper Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders.
-
Authors: Wei-Hsiang Huang, Chen-Wei Ke, Wei-Ning Chiu, Yu-Xuan Su, Chun Chun Yang, Chieh-Yuan Cheng, Yun-Nung Chen, Pu-Jen Cheng.
- π Overview
- π Pure LLM Recommenders
- π Augmented LLM Recommenders
Experiment
- π The Challenge of LLM Recommenders
- π£ Future Direction
LLM Recommenders utilize LLM to do the recommendations. In this survey, we further concentrate to LLM as final decision maker. That is, given
where
With the growing interest and parallel development in both pure LLM-based approaches and those augmented with non-LLM techniques, it is crucial to systematically understand the different aspects of both scenarios. Therefore, we categorize LLM Recommenders into Pure and Augmented approaches based on whether the augmentation map
Pure LLM Recommenders refer to method that leverage the capabilities of LLMs to perform recommendation tasks. These methods can be further categorized into classes such as Naive Embedding Utilization,Naive Pretrained LM Finetuning, Instruction Tuning, Model Architectural Adaptations, Reflect-and-Rethink, and Others.
Naive Embedding Utilization refers to methods that directly leverage the final hidden state or aggregated embeddings produced by LLMs for recommendation tasks.
Naive Pretrained LM Finetuning refers to approaches that formulate recommendation as a natural language task and directly fine-tune pretrained language models.
Instruction tuning adapts LLMs to recommendation tasks by expressing them as instructional prompts.
In addition to standard applications of LLMs, numerous studies have proposed novel architectural adaptations of LLM backbones, specifically designed for recommendation systems.
Venue | Code | Paper |
---|---|---|
Arxiv'24 | None | Rethinking Large Language Model Architectures for Sequential Recommendations |
Inf. Process. Manag.'25 | Code | Sequential recommendation by reprogramming pretrained transformer |
Arxiv'25 | None | MoLoRec: A Generalizable and Efficient Framework for LLM-Based Recommendation |
Reflect-and-Rethink methods go beyond standard supervised learning by reflecting on outputs, refining prompts, or interpreting user intent to guide prompts design.
Others focus on designing suitable training objectives, metadata summarization, data essence extraction, among others.
Augmented LLM Recommenders refer to methods that enhance LLM Recommenders by incorporating non-LLM techniques. These methods can be further categorized into Semantic Identifiers Augmentation, Collaborative Modality Augmentation, Prompts Augmentation, and Retrieve-and-Rerank.
Semantic Identifiers (or Semantic IDs) augmentation methods represent user or item IDs as implicit semantic sequences with the help of auxiliary coding techiques.
Collaborative Modality Augmentation methods seek to align collaborative information with language, usually by projecting embeddings derived from traditional collaborative models into the language space.
Prompts Augmentation methods utilize non-LM techniques to improve the quality of prompts.
Retrieve-and-Rerank methods first retrieve top-ranked candidates using non-LM techniques, and then apply LLMs to rerank them for final recommendation.
Venue | Code | Paper |
---|---|---|
Arxiv'23 | Code | Zero-Shot Next-Item Recommendation using Large Pretrained Language Models |
PGAI@CIKM'23 | Code | LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking |
Arxiv'23 | None | PALR: Personalization Aware LLMs for Recommendation |
Although many benchmarks exist for recommender systems, there remains a lack of comprehensive comparisons between Pure and Augmented LLM Recommenders under consistent, fair, and modern evaluation settings. To fill this gap, we design a unified experimental framework and use it to systematically assess the performance of both categories. The details of dataset benchmark can be referred to the paper and Benchmark Formulation. Following are the results of existing representative papers.
For results discussion, please also refer to the paper.
The goal of recommendation systems is to provide accurate suggestions based on collaborative information, such as user-item interaction patterns. To achieve this, it is essential for recommenders to effectively model user underlying behavior. LLMs, trained on vast text corpora, are expected to implicitly encode some aspects of such patterns. However, recent research has shown that directly leveraging the implicit collaborative knowledge within LLMs remains a challenge.
Even with exhaustive tuning, LLM Recommenders may still be influenced by the pretrained language semantics. This can prevent LLMs from faithfully capturing the true collaborative semantics.
The echo chamber effect refers to a situation in which individuals are predominantly exposed to information that reinforces modelsβ preexisting beliefs, often due to selective exposure, algorithmic filtering, or even the underlying social biases inherent in LLMs. In recommender systems, this can result in users repeatedly receiving a narrow range of items, irrespective of their current intent.
Position bias refers to the tendency for the perceived relevance or importance of recommended items to be influenced by their position in the prompt input list, which should ideally yield symmetric outputs under permutations. In recommendation systems, especially in zero-shot prompting scenarios, the position of the ground-truth item within the candidate set is significantly affected by this bias.
Cold-Start and Cross-Domain Generalizability are long-standing challenges in recommendation. LLM Recommenders offer a promising solution due to their ability to understand rich textual metadata. Current trend of LLM Recommenders tries to solve this problem. Although recent approaches tackle these challenges, opportunities for enhancement remain.
-
The remaining unsolved issue 1: The conditional probabilities objective of decoder tends to overfit to items seen during training, leading to a significantly reduced capability to generate cold-start items.
-
The remaining unsolved issue 2: Whether incorporating collaborative signals may degrade performance as collaborative filtering based methods tends to suffer more from cold-start scenarios.
- The remaining unsolved issue: This direction remains largely underexplored, with relatively few studies addressing and analyzing the issue.
The dataset preprocessing methods for the experiments can be reproduced by following the instructions below."
To avoid different random.seed
mechanism in different python
version or different env
,
we stored our used dataset for naive numerical IDs dataset and the reranking dataset
in the Google Drive. Notice that one might need to preprocess other information from Dataset Preparation.
cd prepare_dataset
sh download_data.sh
sh prepare_dataset.sh
This will create ./data/
in the main directory with corresponding downloaded dataset.
- Dataset Explanation
{
"preprocessed_*.train.json": "Training dataset for sequential recsys.",
"preprocessed_*.valid.json": "Validation dataset for sequential recsys.",
"preprocessed_*.test.json": "Testing dataset for sequential recsys.",
"preprocessed_meta_*.json": "Filtered item meta data. (only left those in train + valid + test)",
"preprocessed_review_*.json": "Train + Valid + Test."
}
We gave each user and item an random, unique naive numerical IDs. To preprocess it, you can do
cd prepare_dataset
sh random_hashing.sh
This will create Random_*
folder with random naive numerical IDs for users and items inside ./data/
.
- Dataset Explanation
{
"user_item_hash_table.json": "The table between naive numerical IDs and the original user_id or parent_asin.",
"meta.json": "Meta data of the given item.",
"review_*.json": "Review data for [train / valid /test] scope.",
"review.json": "All review data (train + valid + test).",
}
Besides for sequential recommendation, other recommendation task includes reranking
, binary
, rating
, explanation
, conversational
and etc. We further provided the dataset setup for reranking task.
The reranking task aims to recommend items from a set of candidates. For LLM Recommenders, number of candidate set is usually set to n
items, with 1
positive item and n-1
non-interacted random negative items. In the construction, we set n=20
.
cd prepare_dataset && sh prepare_random_negative.sh
- Dataset Explanation
{
"random_numerical or original_ids": "The key is the user and the corresponding list is the items candidate pool.",
"label_random_numerical or label_original_ids": "The key is the user and the corresponding value is the positive item."
}
If you find our survey and this repository beneficial for your research, please kindly cite our paper.
@misc{huang2025augmentnotcomparativestudy,
title={Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders},
author={Wei-Hsiang Huang and Chen-Wei Ke and Wei-Ning Chiu and Yu-Xuan Su and Chun-Chun Yang and Chieh-Yuan Cheng and Yun-Nung Chen and Pu-Jen Cheng},
year={2025},
eprint={2505.23053},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.23053},
}