Data analytics is a $17 billion industry and a huge growth area in today’s competitive

Aaron Auld, CEO EXASOL.
Aaron Auld, CEO EXASOL.

business environment. Businesses have woken up to the fact there is value in their data, and as a result the humble data scientist is in high demand.

Last month LinkedIn released its top skills of 2016 survey results and for the third year in a row, data science – “Statistical Analysis and Data Mining” – is amongst the top two, vying for top position with “Cloud and Distributed Computing”. Drilling down by individual country, the UK lists data science as number one, the top skill of 2016.

Data scientists have their own tools and tactics to delve into the data and in this piece we will look at a few of the languages they use. We will give the reader an overview of each language, why they are popular for data analytics and provide an insight into the language choices made by data scientists.

 

The languages of Data Analytics

Java is probably the most popular and widely used programming language in computing today. The first public implementation was released by Sun Microsystems in 1995 and it promised that you could “Write Once, Run Anywhere”. This was a compelling idea at the time because many other languages had to be recompiled for difference types of computers. Java is different because it compiles into bytecodes – as opposed to machine code – that can run on any computer architecture using a “Java Virtual Machine”, or JVM for short.

This portability, along with its scalability, performance and reliability set Java up for the next 21 years. Popular big data tools such as Hadoop, Cassandra and Spark are written in Java so it’s no coincidence the language is popular for data analytics too, especially as the basis for large and complex projects.

 

R is widely used among statisticians and data miners for data analysis, in fact it is probably the most popular language for pure data science. By using just a few lines of code you can sift through complex data sets, manipulate data using sophisticated modelling functions and create slick graphics. The R language is backed by an active community that is constantly adding new packages and features to its already rich function sets.

It is often considered as more of a prototyping language, to test out ideas and manipulate data before handing off the model to be rewritten in Java or Python. It was not considered fast or stable enough, but this thinking is becoming outdated as R can now be integrated directly into a fast database to work on the data natively.

 

Python is another incredibly popular language that is vying with Java for top dog by some measures. The language emphasises readability and clarity – amongst other positive traits – which makes it easy to work with. It is intuitive and fast to learn and the ecosystem for data analytics in Python has grown dramatically in recent years. This means that some of the statistical analysis libraries that were previously reserved for R are now available in Python too.

Python contains two specific libraries – NumPy and SciPy –designed for data processing. Python isn’t the fastest language, in most instances it is interpreted rather than compiled which slows it down, but its ease of use makes it stand out from the crowd.

 

SQL, or Structured Query Language, is a special-purpose programming language for data manipulation. It doesn’t have all the functionality of the other languages but it is the de-facto way of retrieving, inserting and modifying data in a database. Usually, SQL is used for the most basic data manipulation and then hands off to other languages of data analytics. Many business intelligence and reporting tools create underlying SQL queries to obtain the data before manipulating it internally.

 

What does the future hold for Data Analytics Languages?

When Java came along in the mid-nineties, many computer scientists predicted the demise of older languages such as C and C++. However, 21 years later these languages are still popular. This proves that programming languages have longevity, especially once a loyal following and a community is established.

It is likely that SQL, R, Python and Java will continue to be used for analysing data for years to come. The change that we foresee in coming years is a move towards convenience and integration. Running your programming language of choice right there in the database means that you don’t have to extract the data first before analysing it. The database shouldn’t limit you or slow down your thought process. That’s why version 6 of analytic database EXASOL includes a framework for integrating the programming language of your choice and supports the four main languages of data analytics out-of-the-box.

Many predictions tell us that in the coming years, with the proliferation of the Internet of Things and ever-more granular data recording, the volume of data we are analysing is going to grow exponentially. Therefore, having the right tools to analyse this data and using them to extract value from the data is going to become even more important. Being able to find data insights will be vital for businesses to succeed in the future.