Project Part 2: Amazon Beauty Products

1.0 Introduction

Importing some libraries.

2.0 Loading Datasets

Getting familiar with the data we have

2.1 meta_All_Beauty.json

This dataset contains all the product details

2.2 Review_Beauty.json

This file contains all the review given by customers.

2.3 All_Beauty.csv

csv file provide by lecturer.

2.4 Luxury_Beauty.json

Additional dataset that is used mainly for predictive modelling part. This file contains all the review given by customers for luxury beauty product

2.5 meta_Luxury_Beauty.json

Additional dataset that is used mainly for predictive modelling part. This file contains all the luxury beauty product details.

3.0 Data Cleaning

Cleaning our data, which includes the steps below:

Step 1 : Remove duplicates or irrelevant observations

Step 2 : Fix structural errors

Step 3 : Identify and handle outliers

Step 4 : Handle missing data

3.1 Cleaning meta_All_Beauty.json

Which is now loaded into Product (a DataFrame)

3.2 Cleaning Review_Beauty.json

Which is now loaded into review (a DataFrame)

3.3 Cleaning Luxury_Beauty.json

Cleaning luxury beauty product review datasets

3.4 Cleaning meta_Luxury_Beauty.json

Cleaning luxury beauty product datasets

4.0 Data Transformation

4.1 Normalization

4.2 Merging

Merge review_table with product_table

4.2.1 Merging review_table and product_table

Merging the two tables.

After merging both tables, we found that there are 2 existing products without any reviews. Hence we drop them from product_table.

For review_table, there are 5120 reviews on products that were not in our dataset, we drop them too.

After cleaning and merging the data, 32485 beauty products and 357283 reviews are remained in our dataset.

4.2.2 Final Merged Table

Final merged table has been prepared for future use in data processing

We will be using product_table and final_review_table for our data analysis.

product_table: Contains 32485 unique beauty products selling on Amazon. All listed product have at least one review in our dataset. It contains sales rank and/or price that is 0.

final_review_table: Contains 357283 product reviews. All recorded reviews have a matching product in our dataset.

4.3 Useful Functions

4.3.1 process_merged_review

4.3.2 process_verified_review

Generate merged_review_table

Generate mv_review_table

Here we have two processed tables:

merged_review_data: All cleaned products with average rating, total number of reviews and more.

mv_review_table: The same product table, but only verified reviews are considered.

4.3.3 time frame

4.3.4 Calculate Density

4.4 Merging For Luxury Beauty

4.4.1 Final_luxuryBeautyReview_table

4.4.2 merged_luxuryBeautyReview_table

4.4.3 mv_luxuryBeautyReview_table

4.4.4 Process for final luxury beauty table

4.5 Binning

4.5.1 Binning for All Beauty Product

4.5.2 Binning for Luxury Beauty Product

5.0 Data Analysis

Important Note: All beauty productis our main dataset. This project is mainly analysing all beauty product dataset. Luxury beauty product is mainly used for predictive modelling section.

5.1 Corr: avg rating && sales rank

Unlike merged_review_table, the num_review_am200 only contains product that has at most 200 reviews.

5.2 Corr: num review && sales rank

5.2.1 num review 4 bins

5.2.2 num of high/low rating

5.2.3 correlation at different year (2000-2018)

Animation of sales rank (y_axis) against number of reviews (x_axis) from 2000 to 2018 in every year

5.3 Analysing "Very Good" Products

Further analysing product binned in "Very Good". These product ranking falls into the range 35 ~ 870386.75

There are 11379 products that falls in the "Very Good" category, with the highest rank being 35, and the lowest rank is 870359.

We will further bin the products.

5.4 Slice Sales Rank < 5000

5.5 Price VS Sales Rank

5.6 Analysing Review Upvotes

What 'vote' means? Check out https://www.amazon.com/review/top-reviewer-faq.html

Summary: if people found your review helpful, they can give you a vote, if people found your review no helpful and decide to downvote you, your vote will get substracted

The vote score of each available product is calculated from all reviews given to it. From all of these reviews, only those with at least one vote are selected and applied weighted average to them.

Vote score formula: $$W=\frac {\sum_{i=1}^{n} w_i X_i} {\sum_{i=1}^{n} w_i}$$

$W$: weight average of the score (vote score)

$n$: number of reviews to be averaged

$w_i$: number of votes per review

$X_i$: rating of the review

5.7 Analysing Time-Weighted-Average

5.8 Bivariate Attribute Analysis

5.8.1 Sales Rank

5.8.2 Average Rating

5.8.3 Average Verified Rating

Comparing Average Rating and Average Verified Rating

5.8.4 Number of Reviews

5.8.5 Vote Score

5.8.6 Time Weighted Overall

5.8.7 Price

5.9 Correlation of All Attributes

6.0 Data Mining

6.1 Association Rule Mining (ARM)

6.1.1 Also Buy Products

These products are bought together

[B018J05XSQ]: Mia secret acrylic powder is highly adaptable for any nail tech experience level.

[B00C4207LY]: Nail Art Accessories Real Dry Dried Flowers 12 Colors Bundle Set in Wheel - Ready to Use by Winstonia

Both are nail products

6.1.3 Also View Products

6.1.3 Also Buy and Also View Frequency Analysis

Also Buy Products

Also View Products

6.2 Predictive Modeling

6.2.1 Predictive Model for All Beauty Product's Sales Rank

It is evident from the plot that the AUC for the Multilayer Perceptron Classifier ROC curve is higher than that for the KNeighborsClassifier ROC curve. Multilayer Perceptron Classifier has a higher true positive rate than KNeighborsClassifier. Therefore, we can say that Multilayer Perceptron Classifier did a better job of classifying the positive class in the dataset.

6.2.2 Predictive Model for Luxury Beauty Product's Sales Rank

We tried to predict for luxury beauty product's sales rank in order to verify the effectiveness of review in predicting product's sales rank.

7.0 Data Visualization

7.1 Number of High Ratings VS Sales Rank

Will an increase in the number of high ratings (rating >= 4) affect the sales rank of a beauty product?

The plotted graph is observed to have a weak negative association. This may be due to the imbalance of number of high rating reviews( 4) and number of low rating reviews(3). Although this graph showed a weak association between the two variables, we can observe that most of the points are distributed loosely at the lower right corner of the graph from the scatter plot above. This implies that the higher the ratio of the number of high rating reviews to the total number of reviews, the higher the sales rank the product is.

If a product receives a good amount of high rating reviews, the sales rank is believed to be better.

7.2 Average Rating VS Sales Rank

What is the correlation between average rating of products and sales rank? Does higher average rating implies higher sales?

Based on our finding, the average rating of products has weak correlation with sales rank. This is because there are many confounding factors, as we need to take in consideration the time of the review, verification, content of the review as well. Simply take average rating cannot clearly imply the sales rank.

7.3 Number of Reviews VS Sales Rank

What is the correlation between number of reviews and sales rank? Does higher number of reviews implies higher sales rank?

From the scatter plot above, the graph showed a moderate non linear association between the total number of reviews and sales rank. When the total number of reviews increases, the sales rank of the product will be better until the total number of reviews reaches approximately 50 reviews. After 50 number of reviews, the sales rank of the product started to remain constant. This can be concluded that 50 reviews are the optimal number of reviews for a beauty product in our case. However, there are some exceptions to some products which are located at the top right corner of the graph.

7.4 Price V.S. Sales Rank

What is the correlation between the pricing of products and sales rank?

Based on the scatter plot above, the price of products has a weak negative correlation with sales rank. This is because there are many confounding factors, as we need to take in consideration the time of the review, verification, content of the review as well. Simply taking the price of the product cannot clearly imply the sales rank.

7.5 Review Up Votes V.S. Sales Rank

Will the number of review upvotes affect the number of buyers?

The vote score has -0.3511 of correlation value with sales rank. The higher the vote score, the lower (better) the sales rank. This shows that customers might somehow feel more “secure” to buy products that have a good up vote. The sense of having support from the community will enhance the chances of customers buying the product.

7.6 Timeframe and Reviews

Do people in the current decade prioritise online reviews more compared to the last decade?

From the bar graph above, the bar chart shows a cumulative number of reviews for each year since 2000 to 2018. The total cumulative number of reviews increased steadily from the year 2000 to the year 2012. It began to go up dramatically ever since the year 2013, until the year 2017. In the end, the total cumulative number of reviews reached 357283 in the year 2018. We can observe that people in the last decade(year 2001 Jan 1 - 2009 Dec 31) make less reviews than the people in the current decade(year 2010 Jan 1 - 2019 Dec 31).

A conclusion can be drawn that people in the current decade prioritise online reviews more compared to the last decade.

7.7 Also Buy and Also View of the Product Reviews

Will a product with a high average rating appear more frequently in other products’ also_view / also_buy?

High Average Rating appearing more frequently in also_buy?

This scatter plot shows the average rating of the beauty products against the sales rank. Each point is colored in a different hue to show the frequency of their appearance in other products’ also_buy section. It is shown that products with a high average rating do not appear very often in other products’ also_buy. However, those products that show up more in other products’ also_buy are scattered around different ranges of average rating.

From our dataset, it is clear that a product with high average rating does not appear more frequently in other products’ also_buy.

High Average Rating appearing more frequently in also_view?

This scatter plot shows the average rating of the beauty products against the sales rank. Each point is colored in a different hue to show the frequency of their appearance in other products’ also_view section. It is shown that products with a high average rating do not appear very often in other products’ also_view. However, those products that show up more in other products’ also_buy are scattered around different ranges of average rating.

From our dataset, it is clear that a product with a high average rating does not appear more frequently in other products’ also_view.

7.8 Prediction Accuracy for Luxury Beauty Products

Given the proposed hypothesis, the classification model is able to make correct predictions for the sales rank of beauty products with a high accuracy. Is this hypothesis also true for luxury beauty products? By using the same model, can we predict the sales rank of luxury beauty products with high accuracy as well?

The model, Multilayer Perceptron Classifier that we have built has been used to predict the sales rank of the luxury beauty product. In this study, we want to verify the effectiveness of this model in predicting the sales rank of a product. From the previous study, an accuracy of 68% has been obtained in predicting the sales rank of all beauty products by using a Multilayer Perceptron Classifier with review data. We hypothesize that this model can be used to predict sales rank of other products. Finally, the classification model has been tested using the prepared test sets and obtained an accuracy score of 75% for this model. Below showed the confusion matrix for the Multilayer Perceptron Classifier

Based on the confusion matrix, both of the models have a very high accuracy in predicting very good and good sales rank compared to other types of sales rank. There are no moderate and bad sales ranks that have been predicted correctly. This may be caused by the data imbalance issue in our dataset. Most of the luxury beauty products in our dataset have a very high sales rank, indicating that the products are selling well. This problem can be solved by using data augmentation to balance the dataset in the future.

We can draw a conclusion that the hypothesis is correct. By using the same combination of descriptors which consists of review data, we are able to predict the sales rank of luxury beauty products with high accuracy of 75%.

Conclusion

Our first finding is that sales rank is affected by numbers of reviews, high rating reviews, verified true reviews, voted review and timing of the reviews. This finding gives us a better view on how product reviews would actually affect sales. Marketing executives should spend efforts on getting quality reviews from customers as it will boost their product trustworthiness as well as increase sales indirectly. Beside that, marketing executives have to keep their review “updated” as well. Old reviews are less effective than latest reviews.

Beside that we also built a model that could predict which product can potentially receive high sales rank based on its reviews. We developed a model as shown in 5.2 that has an accuracy of 68% by using a Multilayer Perceptron Classifier with review data after compared with KNeighborsClassifier which has lower value in our evaluation matrix, ROC. We used the same model and tuned again with its data on another category of beauty products called “Luxury Beauty” products and received the accuracy of 75%. This is a small leap of success for us to potentially predict which product has good potential according to their reviews.

Last but not least, we still have a huge room of improvement. Our future improvements should focus on continuously improving the accuracy of our model. Our data might be slightly skewed to high rating products. We might need to perform data augmentation to train for lower rating classification. Through the exploration of the data, we also identified a few interesting outliers that don't follow our models. Most of them are because they are not exactly the same category with the beauty product. Same some of them might have really outdated reviews.