How the world reacted to Samsung Galaxy S9

prediction using Text Blob

Posted on February 26, 2018

Sentiment Analysis on the new Samsung Galaxy S9 YouTube ad by studying the comments to understand the different aspects people are eager to see on Samsung's flagship smartphone. The sentiments for these aspects are rated on a scale of 1-5 (extremely unhappy - extremely happy).

The YouTube Ad

Getting the Comments

With just a day after the phone's launch, over 10k comments were downloaded to a .csv file from this website

The Script

You can scroll down to the end to see the results :)

Part 1: Data Cleansing
## Importing Libraries 
import re
import pandas as pd
import numpy as np
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.feature_extraction import stop_words
from textblob import TextBlob
import nltk
import operator
from functools import reduce
nltk.download('punkt')
nltk.download('brown')
#Importing the dataset

df = pd.read_csv('samsung_comments.csv')
comments = df.iloc[:,0:1].values
replies = df.iloc[:,4:5].values

The comments and its replies are imported as two different columns. Both are relevant to our analysis and needs to be concatenated.

#Adding the replies to comments list

comments_replies = []
for i in comments:
    if str(i[0]) != 'nan':
        comments_replies.append(str(i[0]).lower())
for i in replies:
    if str(i[0]) != 'nan':
        comments_replies.append(str(i[0]).lower())

Now we define the function that cleans the comments for easy analysis. PorterStemmer converts all plural words to its singular versions.

################### Cleaning Data ###################
stemmer = PorterStemmer()

def clean_text(text):
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}`$+=~|'.!?,]", "", text)
    text = stemmer.stem(text)
    return text

for i,c in enumerate(comments_replies):
    x = clean_text(c)
    comments_replies[i] = [re.sub(' +',' ',x)]

There were more than 2000 comments that have the keywords - 'apple' and 'iphone'. The sentiments on these comments needn't necessarily reflect people's opinion on the S9 itself and were hence removed.

apple_comments = []
samsung_comments = []

for i in comments_replies:
    if 'apple' in i[0] or 'iphone' in i[0]:
        apple_comments.append(i[0])
    else:
        if len(i[0]) > 0:
            samsung_comments.append(i[0])
Samsung data
Part 2: Predicting Polarity of the comments

TextBlob is a very good python package for unsupervised sentiment analysis. After predicting the polarity, it is then converted to a scale of 1-5 stars.

polarity = []
for i,c in enumerate(samsung_comments):
    blob = TextBlob(c)
    for sentence in blob.sentences:
        blob = sentence.sentiment.polarity
    polarity.append(float(blob))

max_, min_ = max(polarity), min(polarity)
stars = []
for i in range(len(samsung_comments)):
    x =  int(round((((polarity[i] - min_) * (5)) / (max_ - min_))))
    stars.append(x)

stars = np.asarray(stars)
for i in range(len(stars)):
    if stars[i] == 0:
        stars[i] = 1

The last part of the above code is to convert all zero stars to one stars.

# Comments to Pandas DataFrame
samsung_comments = pd.DataFrame(samsung_comments)[0]

For further cleaning of the comments, stop_words are imported.

# Importing StopWords

nltk_stop = stopwords.words('english')
sklearn_stop = [list(x) for x in [stop_words.ENGLISH_STOP_WORDS]][0]
stop_words = list(set(nltk_stop + sklearn_stop))

Splitting datasets based on their corresponding star ratings.

# Getting indeces based on predicted star ratings #################

one_star_ind = np.where(stars == 1)[0]
two_star_ind = np.where(stars == 2)[0]
three_star_ind = np.where(stars == 3)[0]
four_star_ind = np.where(stars == 4)[0]
five_star_ind = np.where(stars == 5)[0]
# Splitting comments to words
samsung_split = []
for i in range(len(samsung_comments)):
    x = samsung_comments[i].split()
    samsung_split.append(x)

# Comments based on Predicted stars
one_star = []
two_star = []
three_star = []
four_star = []
five_star = []

for i in one_star_ind:
    one_star.append(samsung_split[i])
for i in two_star_ind:
    two_star.append(samsung_split[i])
for i in three_star_ind:
    three_star.append(samsung_split[i])
for i in four_star_ind:
    four_star.append(samsung_split[i])
for i in five_star_ind:
    five_star.append(samsung_split[i])

Now that the comments are split to words, stop_words can be removed from the comments.

# Removing stopwords from comments  
for i in range(len(one_star_ind)):
    one_star[i] = [x for x in one_star[i] if x not in stop_words]

for i in range(len(two_star_ind)):
    two_star[i] = [x for x in two_star[i] if x not in stop_words]

for i in range(len(three_star_ind)):
    three_star[i] = [x for x in three_star[i] if x not in stop_words]

for i in range(len(four_star_ind)):
    four_star[i] = [x for x in four_star[i] if x not in stop_words]

for i in range(len(five_star_ind)):
    five_star[i] = [x for x in five_star[i] if x not in stop_words]
Part 3: Data Visualization with Word Cloud

This visualization helps to verify the sanity of the predicted star values. And also to identify the key aspects fans are looking forward in the new phone.

# WordCloud generator

def word_cloud_gen(data, font_size = 40, title = ''):
    data = data.lower()
    wordcloud = WordCloud().generate(data)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(title, fontsize = 24)

word_cloud_gen(str(one_star), title = 'one_star')
word_cloud_gen(str(two_star), title = 'two_star')
word_cloud_gen(str(three_star), title = 'three_star')
word_cloud_gen(str(four_star), title = 'four_star')
word_cloud_gen(str(five_star), title = 'five_star')

all_words = one_star + two_star + three_star + four_star + five_star
word_cloud_gen(str(all_words), title = 'all_words')
Stars

To get the Buzz words, the words from the comments are mapped to its number of occurrence. Words that are said for less than 5 times are ignored.

# Creating a dictionary that maps each word to its number of occurrence 
word2count = {}
for comment in all_words:
    for word in comment:
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

word2count = sorted(word2count.items(), key=operator.itemgetter(1), reverse=True)
while True:
    try:
        for i in range(len(word2count)):
            if word2count[i][1] in (1,2,3,4):
                del word2count[i]
    except:
         continue
    else:
         break

To find the aspects, two words and one word nouns are extracted from the comments using TextBlob. The phrase are then cleaned by removing words that dont help.

# Finding the Key Aspects 
aspects_orginal = []
for i,r in enumerate(samsung_comments):
    blob = TextBlob(r)
    noun = blob.noun_phrases
    aspects_orginal.append(noun)

aspects_filtered = []
for i in range(len(aspects_orginal)):
    for i in aspects_orginal[i]:
        if str(i) not in aspects_filtered:
            aspects_filtered.append(str(i))

aspects_split = []
for i in range(len(aspects_filtered)):
    x_split = aspects_filtered[i].lower().split()
    x_split = [x for x in x_split if x not in stop_words]
    aspects_split.append(x_split)

aspects_cleaned = []
for i in range(len(aspects_split)):
    for x in range(len(word2count)):
        if word2count[x][0] in aspects_split[i]:
            if len(aspects_split[i]) < 3:
                aspects_cleaned.append(aspects_split[i])

aspects_cleaned = [list(x) for x in set(tuple(x) for x in aspects_cleaned)]

aspects_final = []
futile_words = ['good', 'great', 'bad', 'awesome', 'poor', 'use', 'new', 'need', 'samsung', 'simple',
              'lol', 'high', 'low', 'work', 'works', 'quick', 'lot', 'excellent', 'wow',
              'people','s9','upgrade','yeah', 'phone', 'phones', 'phon', 'galaxy', 's8']
for i in range(len(aspects_cleaned)):
    minus = [x for x in aspects_cleaned[i] if x not in futile_words]
    if len(minus) > 0:
        aspects_final.append(minus)

full_list = [list(x) for x in set(tuple(x) for x in aspects_final)]
full_list = reduce(lambda a,b:a+[0]+b,full_list)
full_list = list(set(full_list))

aspects_selected = []
for i in range(len(word2count)):
    for x in full_list:
        if str(x) in word2count[i][0]:
            if len(str(x)) > 2:
                if x not in aspects_selected:
                    aspects_selected.append(x)

features = []
for i in range(len(aspects_final)):
    for x in aspects_selected:
        if str(x) in aspects_final[i]:
            features.append(aspects_final[i])

features = [list(x) for x in set(tuple(x) for x in features)]
word_cloud_gen(str(features), title = 'features')
Aspects
Part 4: Evaluation

The top words from the generated features wordcloud were selected as Aspects.

# The aspects have words that convey the same
camera = ['camera','cameras']
display = ['screen','display','edge']
fingerprint = ['fingerprint','finger','scanner','sensor']
price = ['price','cost','expensive','money','afford']
headphone_jack = ['headphone','jack']
emoji = ['emoji', 'ar','animoji','emojis']
battery = ['battery']
aspects = [camera, display, fingerprint, price, headphone_jack, emoji, battery]

Not all comments contain the aspect_words. We are interested in comments that help understand what people think about various improvements and features in the new phone. And thus only containing these words are selected.

# Getting Indeces of comments containing the aspects 
camera_i = []
display_i = []
fingerprint_i = []
price_i = []
headphone_i = []
emoji_i = []
battery_i = []

for i in range(len(samsung_split)):
    for x in aspects[0]:
        if x in samsung_split[i]:
            if i not in camera_i:
                camera_i.append(i)
    for x in aspects[1]:
        if x in samsung_split[i]:
            if i not in display_i:
                display_i.append(i)
    for x in aspects[2]:
        if x in samsung_split[i]:
            if i not in fingerprint_i:
                fingerprint_i.append(i)
    for x in aspects[3]:
        if x in samsung_split[i]:
            if i not in price_i:
                price_i.append(i)
    for x in aspects[4]:
        if x in samsung_split[i]:
            if i not in headphone_i:
                headphone_i.append(i)
    for x in aspects[5]:
        if x in samsung_split[i]:
            if i not in emoji_i:
                emoji_i.append(i)
    for x in aspects[6]:
        if x in samsung_split[i]:
            if i not in battery_i:
                battery_i.append(i)


# Getting the corresponding ratings from the retrieved indeces               
camera_stars = []
display_stars = []
fingerprint_stars = []
price_stars = []
headphone_stars = []
emoji_stars = []
battery_stars = []

for i in camera_i:
    camera_stars.append(stars[i])
for i in display_i:
    display_stars.append(stars[i])
for i in fingerprint_i:
    fingerprint_stars.append(stars[i])
for i in price_i:
    price_stars.append(stars[i])
for i in headphone_i:
    headphone_stars.append(stars[i])
for i in emoji_i:
    emoji_stars.append(stars[i])
for i in battery_i:
    battery_stars.append(stars[i])

Average of these ratings give an idea about people's opinion on the chosen aspects.

# Calculating the average of ratings wrt to the aspect    

camera_stars = float("{0:.2f}".format(sum(camera_stars) / len(camera_stars)))
display_stars = float("{0:.2f}".format(sum(display_stars) / len(display_stars)))
fingerprint_stars = float("{0:.2f}".format(sum(fingerprint_stars) / len(fingerprint_stars)))
price_stars = float("{0:.2f}".format(sum(price_stars) / len(price_stars)))
headphone_stars = float("{0:.2f}".format(sum(headphone_stars) / len(headphone_stars)))
emoji_stars = float("{0:.2f}".format(sum(emoji_stars) / len(emoji_stars)))
battery_stars = float("{0:.2f}".format(sum(battery_stars) / len(battery_stars)))

Inference

With the predicted sentiments(scale 1-5) from the above code, a Tableau bar chart was generated to visualize the results

Sentiment
Stars chart

From the above chart, it is clear that people are happy with the presence of the "headphone jack" and also the position of the "fingerprint sensor". Camera, the main focus of this year's edition got generally positive reviews.

"The Samsung Galaxy S8 is a beautiful phone with a critical flaw. The fingerprint sensor is in the wrong damn spot. A phone is only as good as its usability and, to me, the S8 is crippled with its fingerprint sensor located off center. It drives me mad but Samsung finally righted the wrong with the S9." - techcrunch.com

Emoji

Another new feature, AR Emoji, got below average reviews from the audience, with many labelling it creepy.

Most talked about
Num com

This chart gives us the hot topics among people about the S9.