Data Scientist vs Data Analyst vs Data Engineer

using Word Cloud - Python

Posted on January 15, 2018

The terms Data Scientist, Data Analyst and Data Engineer are often used interchangeably. Although all three are data focused roles, they have subtle differences that separate each other. With even the hiring companies switching between the terms we'll take a different approach understanding each role.

Let's ask Google

Data Scientist

Data scientists are big data wranglers. They take an enormous mass of messy data points (unstructured and structured) and use their formidable skills in math, statistics and programming to clean, massage and organize them. Then they apply all their analytic powers – industry knowledge, contextual understanding, skepticism of existing assumptions – to uncover hidden solutions to business challenges.

Data Analyst

Data analysts collect, process and perform statistical analyses of data. Their skills may not be as advanced as data scientists (e.g. they may not be able to create new algorithms), but their goals are the same – to discover how data can be used to answer questions and solve problems.

Data Engineer

Data engineers build massive reservoirs for big data. They develop, construct, test and maintain architectures such as databases and large-scale data processing systems. Once continuous pipelines are installed to – and from – these huge “pools” of filtered information, data scientists can pull relevant data sets for their analyses.

The above definitions are a little vague and doesn't explain clearly what skillset a company expects from a potential candidate for the given roles.

Word Cloud

Word Cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.

Data from LinkedIn

We collected considerable 'Job description and Qualifications' data for the above roles on LinkedIn posted by multiple companies. Generating word clouds using this data might help us distinguish the roles clearly.

Generating Word Cloud - Python Code

The extracted data are saved in text files and is used to generate the word cloud. This uses word_cloud library that can be installed with 'pip install wordcloud'

          from wordcloud import WordCloud
          import matplotlib.pyplot as plt

          ## Data analyst responsibilities
          f = open('data/Data_analyst_responsibility.txt','r')
          data_analyst_resp = f.read()
          f.close()

          ##### Data analyst skills
          f = open('data/Data_analyst_skill.txt','r')
          data_analyst_skill = f.read()
          f.close()

          ##### Data scientist responsibilities
          f = open('data/data_scientist_responsibility.txt','r')
          data_scientist_responsibility = f.read()
          f.close()

          ##### Data scientist skills
          f = open('data/data_scientist_skills.txt','r')
          data_scientist_skills = f.read()
          f.close()

          def word_cloud_job_title(data, font_size = 40, title = '') :

          stopwords = ['etc','years', 'Etc','degree','skill','using','preferred','field',
                      'based','related','including','ability', 'experience']
          data = data.lower()
          for word in stopwords:
              if word in data:
                  data=data.replace(word,"")

          # Generate a word cloud image
          wordcloud = WordCloud().generate(data)

          # Display the generated image:
          # the matplotlib way:
          plt.imshow(wordcloud, interpolation='bilinear')
          plt.axis("off")
          fig = plt.gcf()
          fig.set_size_inches(15,10)
          plt.title(title, fontsize = 24)
          plt.show()

          ### Data_analyst responsibility
          word_cloud_job_title(data_analyst_resp, title = 'data_analyst_responsibility')

          ### Data_analyst skill
          word_cloud_job_title(data_analyst_skill, title = 'data_analyst_skill')

          ### Data scientist responsibility
          word_cloud_job_title(data_scientist_responsibility, title = 'data_scientist_responsibility')

          ### Data scientist skills
          word_cloud_job_title(data_scientist_skills, title='data_scientist_skills')

          ### Data engineer responsibility
          word_cloud_job_title(data_scientist_responsibility, title = 'data_scientist_responsibility')

          ### Data engineer skills
          word_cloud_job_title(data_scientist_skills, title='data_scientist_skills')
Exported Matplotlib Images
Gencloud
Conclusion

Any company involved with processing large amounts of data will have employees in all three roles working in tandem. From the Data engineer skills word cloud, we notice a lot of keywords like SQL, Spark, Hadoop that are predominantly used for data processing. Data engineers process big data with these software and make it easier for Data Scientists and Analysts to work with the collected data.

While both Data scientists and analysts work closely with the business team to advise them on decisions based on their findings with the given data, data scientists also work on developing prediction models and thus more qualifications in programming, statistics and quantitative aptitude is expected off them. And this can again be seen with the generated word cloud keywords for data scientist skills (python, statistics, machine learning).

Diff

This article is co-authored by Chiranjeevi V, who is also a Machine Learning enthusiast. Check out his GitHub here