Five things you should know about Apache Hive

1. What is Hive?
Apache Hive is data warehouse software that is built on top of Apache Hadoop to structure big data in the form of query, summarisation and analysis.

Hive operates on a SQL-like language of its own: Hive Query Language, also known as HiveQL or HQL, which supports the MapReduce jobs to source data stored within the databases run on Hadoop.

Operating as an open source volunteer project since 2008, it has a team of developers, or Hive committers, who contribute to the code and run tests to improve the software.

The software used to be known as Hadoop Hive as it was a subproject of Hadoop, but the committers involved in the volunteer project have been so forth coming and that it is has graduated to become a top-level project of its own.

2. Why should you use Hive?

Hortonworks says that a typical use case for Hive is when you need to take large amounts of polystructred data and place it into a structure and view that is easier to use by the business analysts.

As well as enabling ad-hoc queries, summarisation and data analysis. HQL can also be extended with customer scalar function (user-defined functions) which turn multiple rows in databases into

You should not use Hive for real-time queries and row-level updates as it does not have the speed. It is better used for batch jobs over large sets of immutable data, such as web logs.

3. What are the benefits of using Hive?

According to the Apache Hive wiki, as HQL is similar to SQL language, it allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

The Hive system is also easily scalable, making it well suited to managing extensible big data sets. Hadoop developer Hortonworks says this means that more commodity machines can be added to the cluster without a corresponding reduction in performance.

It is also highly informative as familiar JDBC and ODBC drivers allow lots of applications to pull Hive data easily for reporting, meaning it can used across a variety of apps.

4. Who is using Hive?
Hive was originally developed by Facebook, as the social network needed a query language to allow its staff to analyse the data.

According to Facebook, its vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility. It is still used by Facebook today for summarisation jobs, business intelligence and machine learning to load 15TB of data on a daily basis.

Hive is also used by media streaming service Netflix for ad hoc queries and analytics. The Netflix data and engineering team say they use Hadoop clusters with command-line interfaces of Hive to allow developers to log in remotely and run jobs as well as run daily summaries of the site on through its cloud-based Hadoop data warehouse.

5. How does Hive compare to Pig?
Pig and Hive are often compared as they both sit on top of Hadoop databases to construct data. It depends what you are looking to do with your data, but the main difference is the way the script language is written.

Pig Latin, as the name suggests, is a simpler language, but database developers with a good handle on SQL should have no problem with HQL.

Hive also has more functionality as it has a web interface that can be used to visualise issue queries, which Pig does not. Hive can store any tables in a metabase database prior to determining whether they are acceptable, which makes it a suitable tool for data warehousing design. Alternatively, Pig is an effective software for optimising dataflow within Big Data.

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing