Sentiment Analysis on the new Samsung Galaxy S9 YouTube ad by studying the comments to understand the different aspects people are eager to see on Samsung's flagship smartphone. The sentiments for these aspects are rated on a scale of 1-5 (extremely unhappy - extremely happy).
The YouTube Ad
Getting the Comments
With just a day after the phone's launch, over 10k comments were downloaded to a .csv file from this website
The Script
You can scroll down to the end to see the results :)
Part 1: Data Cleansing
## Importing Libraries
import re
import pandas as pd
import numpy as np
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.feature_extraction import stop_words
from textblob import TextBlob
import nltk
import operator
from functools import reduce
nltk.download('punkt')
nltk.download('brown')
#Importing the dataset
df = pd.read_csv('samsung_comments.csv')
comments = df.iloc[:,0:1].values
replies = df.iloc[:,4:5].values
The comments and its replies are imported as two different columns. Both are relevant to our analysis and needs to be concatenated.
#Adding the replies to comments list
comments_replies = []
for i in comments:
if str(i[0]) != 'nan':
comments_replies.append(str(i[0]).lower())
for i in replies:
if str(i[0]) != 'nan':
comments_replies.append(str(i[0]).lower())
Now we define the function that cleans the comments for easy analysis. PorterStemmer converts all plural words to its singular versions.
################### Cleaning Data ###################
stemmer = PorterStemmer()
def clean_text(text):
text = re.sub(r"i'm", "i am", text)
text = re.sub(r"he's", "he is", text)
text = re.sub(r"she's", "she is", text)
text = re.sub(r"that's", "that is", text)
text = re.sub(r"what's", "what is", text)
text = re.sub(r"where's", "where is", text)
text = re.sub(r"how's", "how is", text)
text = re.sub(r"\'ll", " will", text)
text = re.sub(r"\'ve", " have", text)
text = re.sub(r"\'re", " are", text)
text = re.sub(r"\'d", " would", text)
text = re.sub(r"n't", " not", text)
text = re.sub(r"won't", "will not", text)
text = re.sub(r"can't", "cannot", text)
text = re.sub(r"[-()\"#/@;:<>{}`$+=~|'.!?,]", "", text)
text = stemmer.stem(text)
return text
for i,c in enumerate(comments_replies):
x = clean_text(c)
comments_replies[i] = [re.sub(' +',' ',x)]
There were more than 2000 comments that have the keywords - 'apple' and 'iphone'. The sentiments on these comments needn't necessarily reflect people's opinion on the S9 itself and were hence removed.
apple_comments = []
samsung_comments = []
for i in comments_replies:
if 'apple' in i[0] or 'iphone' in i[0]:
apple_comments.append(i[0])
else:
if len(i[0]) > 0:
samsung_comments.append(i[0])

Part 2: Predicting Polarity of the comments
TextBlob is a very good python package for unsupervised sentiment analysis. After predicting the polarity, it is then converted to a scale of 1-5 stars.
polarity = []
for i,c in enumerate(samsung_comments):
blob = TextBlob(c)
for sentence in blob.sentences:
blob = sentence.sentiment.polarity
polarity.append(float(blob))
max_, min_ = max(polarity), min(polarity)
stars = []
for i in range(len(samsung_comments)):
x = int(round((((polarity[i] - min_) * (5)) / (max_ - min_))))
stars.append(x)
stars = np.asarray(stars)
for i in range(len(stars)):
if stars[i] == 0:
stars[i] = 1
The last part of the above code is to convert all zero stars to one stars.
# Comments to Pandas DataFrame
samsung_comments = pd.DataFrame(samsung_comments)[0]
For further cleaning of the comments, stop_words are imported.
# Importing StopWords
nltk_stop = stopwords.words('english')
sklearn_stop = [list(x) for x in [stop_words.ENGLISH_STOP_WORDS]][0]
stop_words = list(set(nltk_stop + sklearn_stop))
Splitting datasets based on their corresponding star ratings.
# Getting indeces based on predicted star ratings #################
one_star_ind = np.where(stars == 1)[0]
two_star_ind = np.where(stars == 2)[0]
three_star_ind = np.where(stars == 3)[0]
four_star_ind = np.where(stars == 4)[0]
five_star_ind = np.where(stars == 5)[0]
# Splitting comments to words
samsung_split = []
for i in range(len(samsung_comments)):
x = samsung_comments[i].split()
samsung_split.append(x)
# Comments based on Predicted stars
one_star = []
two_star = []
three_star = []
four_star = []
five_star = []
for i in one_star_ind:
one_star.append(samsung_split[i])
for i in two_star_ind:
two_star.append(samsung_split[i])
for i in three_star_ind:
three_star.append(samsung_split[i])
for i in four_star_ind:
four_star.append(samsung_split[i])
for i in five_star_ind:
five_star.append(samsung_split[i])
Now that the comments are split to words, stop_words can be removed from the comments.
# Removing stopwords from comments
for i in range(len(one_star_ind)):
one_star[i] = [x for x in one_star[i] if x not in stop_words]
for i in range(len(two_star_ind)):
two_star[i] = [x for x in two_star[i] if x not in stop_words]
for i in range(len(three_star_ind)):
three_star[i] = [x for x in three_star[i] if x not in stop_words]
for i in range(len(four_star_ind)):
four_star[i] = [x for x in four_star[i] if x not in stop_words]
for i in range(len(five_star_ind)):
five_star[i] = [x for x in five_star[i] if x not in stop_words]
Part 3: Data Visualization with Word Cloud
This visualization helps to verify the sanity of the predicted star values. And also to identify the key aspects fans are looking forward in the new phone.
# WordCloud generator
def word_cloud_gen(data, font_size = 40, title = ''):
data = data.lower()
wordcloud = WordCloud().generate(data)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(title, fontsize = 24)
word_cloud_gen(str(one_star), title = 'one_star')
word_cloud_gen(str(two_star), title = 'two_star')
word_cloud_gen(str(three_star), title = 'three_star')
word_cloud_gen(str(four_star), title = 'four_star')
word_cloud_gen(str(five_star), title = 'five_star')
all_words = one_star + two_star + three_star + four_star + five_star
word_cloud_gen(str(all_words), title = 'all_words')

To get the Buzz words, the words from the comments are mapped to its number of occurrence. Words that are said for less than 5 times are ignored.
# Creating a dictionary that maps each word to its number of occurrence
word2count = {}
for comment in all_words:
for word in comment:
if word not in word2count:
word2count[word] = 1
else:
word2count[word] += 1
word2count = sorted(word2count.items(), key=operator.itemgetter(1), reverse=True)
while True:
try:
for i in range(len(word2count)):
if word2count[i][1] in (1,2,3,4):
del word2count[i]
except:
continue
else:
break
To find the aspects, two words and one word nouns are extracted from the comments using TextBlob. The phrase are then cleaned by removing words that dont help.
# Finding the Key Aspects
aspects_orginal = []
for i,r in enumerate(samsung_comments):
blob = TextBlob(r)
noun = blob.noun_phrases
aspects_orginal.append(noun)
aspects_filtered = []
for i in range(len(aspects_orginal)):
for i in aspects_orginal[i]:
if str(i) not in aspects_filtered:
aspects_filtered.append(str(i))
aspects_split = []
for i in range(len(aspects_filtered)):
x_split = aspects_filtered[i].lower().split()
x_split = [x for x in x_split if x not in stop_words]
aspects_split.append(x_split)
aspects_cleaned = []
for i in range(len(aspects_split)):
for x in range(len(word2count)):
if word2count[x][0] in aspects_split[i]:
if len(aspects_split[i]) < 3:
aspects_cleaned.append(aspects_split[i])
aspects_cleaned = [list(x) for x in set(tuple(x) for x in aspects_cleaned)]
aspects_final = []
futile_words = ['good', 'great', 'bad', 'awesome', 'poor', 'use', 'new', 'need', 'samsung', 'simple',
'lol', 'high', 'low', 'work', 'works', 'quick', 'lot', 'excellent', 'wow',
'people','s9','upgrade','yeah', 'phone', 'phones', 'phon', 'galaxy', 's8']
for i in range(len(aspects_cleaned)):
minus = [x for x in aspects_cleaned[i] if x not in futile_words]
if len(minus) > 0:
aspects_final.append(minus)
full_list = [list(x) for x in set(tuple(x) for x in aspects_final)]
full_list = reduce(lambda a,b:a+[0]+b,full_list)
full_list = list(set(full_list))
aspects_selected = []
for i in range(len(word2count)):
for x in full_list:
if str(x) in word2count[i][0]:
if len(str(x)) > 2:
if x not in aspects_selected:
aspects_selected.append(x)
features = []
for i in range(len(aspects_final)):
for x in aspects_selected:
if str(x) in aspects_final[i]:
features.append(aspects_final[i])
features = [list(x) for x in set(tuple(x) for x in features)]
word_cloud_gen(str(features), title = 'features')

Part 4: Evaluation
The top words from the generated features wordcloud were selected as Aspects.
# The aspects have words that convey the same
camera = ['camera','cameras']
display = ['screen','display','edge']
fingerprint = ['fingerprint','finger','scanner','sensor']
price = ['price','cost','expensive','money','afford']
headphone_jack = ['headphone','jack']
emoji = ['emoji', 'ar','animoji','emojis']
battery = ['battery']
aspects = [camera, display, fingerprint, price, headphone_jack, emoji, battery]
Not all comments contain the aspect_words. We are interested in comments that help understand what people think about various improvements and features in the new phone. And thus only containing these words are selected.
# Getting Indeces of comments containing the aspects
camera_i = []
display_i = []
fingerprint_i = []
price_i = []
headphone_i = []
emoji_i = []
battery_i = []
for i in range(len(samsung_split)):
for x in aspects[0]:
if x in samsung_split[i]:
if i not in camera_i:
camera_i.append(i)
for x in aspects[1]:
if x in samsung_split[i]:
if i not in display_i:
display_i.append(i)
for x in aspects[2]:
if x in samsung_split[i]:
if i not in fingerprint_i:
fingerprint_i.append(i)
for x in aspects[3]:
if x in samsung_split[i]:
if i not in price_i:
price_i.append(i)
for x in aspects[4]:
if x in samsung_split[i]:
if i not in headphone_i:
headphone_i.append(i)
for x in aspects[5]:
if x in samsung_split[i]:
if i not in emoji_i:
emoji_i.append(i)
for x in aspects[6]:
if x in samsung_split[i]:
if i not in battery_i:
battery_i.append(i)
# Getting the corresponding ratings from the retrieved indeces
camera_stars = []
display_stars = []
fingerprint_stars = []
price_stars = []
headphone_stars = []
emoji_stars = []
battery_stars = []
for i in camera_i:
camera_stars.append(stars[i])
for i in display_i:
display_stars.append(stars[i])
for i in fingerprint_i:
fingerprint_stars.append(stars[i])
for i in price_i:
price_stars.append(stars[i])
for i in headphone_i:
headphone_stars.append(stars[i])
for i in emoji_i:
emoji_stars.append(stars[i])
for i in battery_i:
battery_stars.append(stars[i])
Average of these ratings give an idea about people's opinion on the chosen aspects.
# Calculating the average of ratings wrt to the aspect
camera_stars = float("{0:.2f}".format(sum(camera_stars) / len(camera_stars)))
display_stars = float("{0:.2f}".format(sum(display_stars) / len(display_stars)))
fingerprint_stars = float("{0:.2f}".format(sum(fingerprint_stars) / len(fingerprint_stars)))
price_stars = float("{0:.2f}".format(sum(price_stars) / len(price_stars)))
headphone_stars = float("{0:.2f}".format(sum(headphone_stars) / len(headphone_stars)))
emoji_stars = float("{0:.2f}".format(sum(emoji_stars) / len(emoji_stars)))
battery_stars = float("{0:.2f}".format(sum(battery_stars) / len(battery_stars)))
Inference
With the predicted sentiments(scale 1-5) from the above code, a Tableau bar chart was generated to visualize the results
Sentiment

From the above chart, it is clear that people are happy with the presence of the "headphone jack" and also the position of the "fingerprint sensor". Camera, the main focus of this year's edition got generally positive reviews.
"The Samsung Galaxy S8 is a beautiful phone with a critical flaw. The fingerprint sensor is in the wrong damn spot. A phone is only as good as its usability and, to me, the S8 is crippled with its fingerprint sensor located off center. It drives me mad but Samsung finally righted the wrong with the S9." - techcrunch.com

Another new feature, AR Emoji, got below average reviews from the audience, with many labelling it creepy.
Most talked about

This chart gives us the hot topics among people about the S9.