By using PySpark, Google Colab, PgAdmin, AWS RDS and S3
The Amazon Vine program is a service that allows manufacturers and publishers to receive reviews for their products. Companies like SellBy pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to publish a review.
In this project, we have access to approximately 50 datasets. Each one contains reviews of a specific product, from clothing apparel to wireless products. We picked one of these datasets and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Also, we used PySpark to determine if there is any bias toward favorable reviews from Vine members in the dataset.
"https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Home_Entertainment_v1_00.tsv.gz"
- The customers_table DataFrame
- The products_table DataFrame
- The review_id_table DataFrame
- The vine_table DataFrame
- customers_table
- products_table
- review_id_table
- vine_table
- Vine reviews 261 and non-Vine reviews 24040
- There were total 11005 five star reviews. Vine reviews were five stars 106 and non-Vine reviews were five stars 10899
-
Percentage of Vine reviews there were five stars is 40.61% and for non-Vine reviews five stars is 45.34%.
The output of both percentages does not have enough margin to decide whether they contain any bias or not in the Vine program. The vine sample size still has a decent number while it is important to note that the non-vine sample has not much difference. The output of both percentages does not have enough margin to decide whether they contain any bias or not in the Vine program. We can perform further analysis on verified purchases to determine the percentage and compare them to see if we can decide if this reveals any positivity bias.