View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. What Is
March 20, 2017

What is a Data Scientist?

Cloudera's Sean Owen helps CBR explain the role of the data scientist.

By Ellie Burns

A data scientist is a professional who works to extract insights and make sense of large sets of data, or ‘big data’.

As the term is broad and continually evolving, CBR turned to Cloudera’s Sean Owen to help explain this ever important role.

 

EB: What is a data scientist?

Sean Owen, director of data science, Cloudera

SO: I still see it as a term used to mean a variety of things. To my ears it sounds like “enterprise architect,” a title claimed by everyone from junior DBAs to software engineers writing anything that doesn’t have a UI. Perhaps they’re all correct, but the title is now irritatingly meaningless.

To be sure it’s something to do with software engineering, domain knowledge, and statistics. The data scientists are those making plots on a laptop in R, but also building big ML pipelines on Apache Hadoop, even if those are quite different things. The best data scientists can do both, and, it has become expected that people good at either of these now become versed in both.

As the field matures we may fall back to more specific terms like data analyst, data engineer and statistician rather than bundle all of them in “data scientist”.

 

Content from our partners
Green for go: Transforming trade in the UK
Manufacturers are switching to personalised customer experience amid fierce competition
How many ends in end-to-end service orchestration?
EB: What is the difference between a data analyst/statistician and a data scientist?

SO: Supposedly, the data scientist also possesses software engineering skills. However, a lot of data science profiles I see, I’d more readily call statisticians. They typically have a formal background in applied math, statistics or related fields like operations research. They tend to be skilled in exploratory analytics tools like R, SAS or SPSS and they tend to not have as much exposure to software engineering practices or tools.

These are the original “data scientists” yet find themselves at some disadvantage where there isn’t enough distinction between the role of modelers working in a lab and the role of engineers maintaining the production systems these models affect. Suddenly there’s a demand to also write and maintain production software or build models at scale on clusters.

 

EB: Do all organisations need a data scientist?

SO: Only hire if your data house is in order. Hiring a data scientist as a ‘fixer’ for lack of data strategy will be a waste of time. Until data is abundant, clean and accessible, weaving it into magic business results will be an uphill struggle. Build a data engineering discipline first, and you’ll then find that data science is the easy part. It’s also important to be clear about which type of data scientist you need. Some are ‘exploratory’, tend to work in environments like R, Python, and SAS, work on a laptop and create visualisations. Some are ‘operational’ and productionise large scale software implementations of learning systems.

EB: What are the specific skills a data scientist must have?

SO: It’s determined by what kind of “data scientist” is required. The “data engineer” type needs understanding of information architecture, tools for storing and structuring data. He or she also needs software engineering skills to create, debug and maintain pipelines that feed models, and score models. These skills usually centre on the Apache Hadoop ecosystem, including Apache Spark.

The “statistician” type needs some background in linear algebra and statistics, and knowledge of some kind of exploratory analytics environment like Python’s scikit, or R. This type of data scientist is often assumed to have skill in visualisation, though that is more of a bonus.

All the better if a data scientist can do some of both. These skills are usually paired with something like a management consultant’s skill set, an ability to quickly understand the basics of a business domain, extract a solvable data problem from conflicting requirements, and explain the results to the business.

 

EB: Why are data scientists in such high demand at the moment?

SO: The implication is that they’re also in short supply, and I challenge that assumption. Hiring managers would tell you that they see lots of interest in, and applications for, data science requisitions. Can they be both in high supply and demand at once?

Anyone highly-skilled in two valuable fields will always be in demand. This is self-evident. Hiring for a jack-of-all-trades profile as a hedge against uncertainty, or some attempt to make a role more appealing, will cause problems. If it’s really a business analyst that’s needed — with pay on offer to match — then hiring for a bunch of other attributes at the same time would make it seem like there’s a shortage, when there are in fact plenty of business analysts out there wondering why employer keep wanting them to know Scala.

 

Data Scientist – The sexiest job of the 21st century?

The role of the data scientist directly stems from the field of data science, with the latter term having existed for over 30 years.

Initially, the term was used by Peter Naur in 1960 as a substitute for computer science. Naur, a Danish computer science pioneer and Turing award winner, took another step forward in taking ‘data science’ mainstream with the 1974 publication of Concise Survey of Computer Methods.

The term gathered traction in the IT and academic worlds, reaching a head in 2012 when the Harvard Business Review published Data Scientist: The Sexiest Job of the 21st Century. In the article, DJ Patil claimed to have coined the term ‘data scientist’ in 2008 with Jeff Hammerbacher. Patil, the former Chief Data Scientist of the United States Office of Science and Technology Policy, claims that they came up with the term after having to define their jobs on LinkedIn and Facebook.

Although the term data scientist is a stalwart of conferences, studies and reports, debate still rages over the distinction between statisticians and data scientists.

Many still argue that there is no distention at all, arguing that ‘data scientist’ is lacking in any clear definition.

However, those backing data scientists as clearly separate from statisticians argue that the former has to have much more business acumen than the latter and is an evolution from business analytics.

Looking at the differences between the two roles, data scientists are often required to produce answers in days rather than months and present those answers in dashboards. Statisticians, on the other hand, present results in papers or reports.

 

 

Topics in this article :
Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU