Since the early years of the internet, spam has always been a side-effect of this global and public platform. Most people probably associate the word spam with annoying e-mails, but other forms of spam do exist. Review spam is a phenomenon encountered at online platforms where participants can rate and express their opinion on a particular product or service. Spammers leave untruthful reviews in order to promote their own or denigrate competitors’ offerings. This type of spam has not achieved much attention in the academic literature so far. In this work, we take a closer look at the state of academic research on this topic with the objective of filling some of the gaps identified. There are several obstacles to overcome when doing research on review spam. Firstly, real-world data is mostly unlabelled, i.e. the use of either ex post labelling techniques or unsupervised learning is necessary. Our analysis is based on two real-world data sets from Amazon and Yelp, where we identify untruthful reviews using a labelling technique based on reviews’ Jaccard similarity. As far as we know, with about 50.000 and 10.000 fake reviews detected in each data set, respectively, this is the largest set of review spam used in academic research so far. Existing studies take into consideration different features, models or data, which limits the comparability of their results. We employ a wide range of features, pre-processing techniques and models in order to find promising techniques to successfully detect review spam. This includes sophisticated deep-learning models that recently produced state-of-the-art benchmarks on related text classification tasks. In addition, we make use of the large amount of unlabelled data and train continuous word representations from scratch using a word2vec algorithm. These highly problem-specific word vectors improve the performance of our deep- learning model, but eventually we find that a simple linear classifier based on 2-gram bag-of- word input features outperforms most other approaches, including our deep-learning models.
All code needed to reproduce the results obtained during the experiments is included in this repository.
Software requirements:
- Vowpal Wabbit
- R (3.2.1)
- Python (2.7.10)
- Theano (0.8.2)
The following steps have to be undertaken in order to reproduce results:
Use clean_json_amazon.py to convert reviews_Electronics.json.gz to strict json to be able to work with it in R.
Amazon reviews can be found here: http://jmcauley.ucsd.edu/data/amazon/
Yelp academic dataset: https://www.yelp.com/dataset_challenge (2016 data not available)
Use find_duplicates.R to generate candidate pairs of near-duplicate reviews.
Use create_finale_dataset.R to filter based on similarity threshold and word length.
Use feature_extraction.R to create features and select reviews using index files created previously
This step produces:
- reviews_final_features_Electronics_full.RData
- reviews_final_feature_0.9_Electronics.RData
- reviews_final_yelp.RData
POS annotation can be performed using POS-script.R.
Use vowpal_input.R to transform data to a format suitable for Vowpal Wabbit
This produces files in the form of:
- vw_input_train_[...].vw
- vw_input_test_[...].vw
Apply the following commands at the VW command line to achieve best performance:
Training: vw -d [path/to/train/data.vw] -c -k -b 28 --ngram 2 --loss_function logistic --passes 300 -f [path/to/model.vw]
Testing: vw -d [path/to/train/data.vw] -t -i [path/to/model.vw] --link logistic -p [path/to/predictions.txt]
To assess model performance outside the VW tool, use vowpal_output.R.
Use theano_input.R to produce input data for CNN model training (truncated length and downsampled)
Use word2vec.ipynb to produce word embeddings based on reviews (use data as produced by theano_input.R; pre-trained Google News vectors can be downlaoded from https://github.com/mmihaltz/word2vec-GoogleNews-vectors)
This produces files in the form of:
- spam_... / ham_... / test_... / test_results_...
And data in VW format to compare results based on same data:
- vw_input_train_... / vw_input_test_...
To train a CNN use trainGraph.ipynb.