View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Software
March 5, 2014updated 22 Sep 2016 11:09am

Five things you should know about Apache Hive

CBR talks you through the popular data warehouse software.

By Claire Vanner

1. What is Hive?
Apache Hive is data warehouse software that is built on top of Apache Hadoop to structure big data in the form of query, summarisation and analysis.

Hive operates on a SQL-like language of its own: Hive Query Language, also known as HiveQL or HQL, which supports the MapReduce jobs to source data stored within the databases run on Hadoop.

Operating as an open source volunteer project since 2008, it has a team of developers, or Hive committers, who contribute to the code and run tests to improve the software.

The software used to be known as Hadoop Hive as it was a subproject of Hadoop, but the committers involved in the volunteer project have been so forth coming and that it is has graduated to become a top-level project of its own.

 

2. Why should you use Hive?

Hortonworks says that a typical use case for Hive is when you need to take large amounts of polystructred data and place it into a structure and view that is easier to use by the business analysts.

Content from our partners
Powering AI’s potential: turning promise into reality
Unlocking growth through hybrid cloud: 5 key takeaways
How businesses can safeguard themselves on the cyber frontline

As well as enabling ad-hoc queries, summarisation and data analysis. HQL can also be extended with customer scalar function (user-defined functions) which turn multiple rows in databases into

You should not use Hive for real-time queries and row-level updates as it does not have the speed. It is better used for batch jobs over large sets of immutable data, such as web logs.

 

3. What are the benefits of using Hive?

According to the Apache Hive wiki, as HQL is similar to SQL language, it allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

The Hive system is also easily scalable, making it well suited to managing extensible big data sets. Hadoop developer Hortonworks says this means that more commodity machines can be added to the cluster without a corresponding reduction in performance.

It is also highly informative as familiar JDBC and ODBC drivers allow lots of applications to pull Hive data easily for reporting, meaning it can used across a variety of apps.

4. Who is using Hive?
Hive was originally developed by Facebook, as the social network needed a query language to allow its staff to analyse the data.

According to Facebook, its vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility. It is still used by Facebook today for summarisation jobs, business intelligence and machine learning to load 15TB of data on a daily basis.

Hive is also used by media streaming service Netflix for ad hoc queries and analytics. The Netflix data and engineering team say they use Hadoop clusters with command-line interfaces of Hive to allow developers to log in remotely and run jobs as well as run daily summaries of the site on through its cloud-based Hadoop data warehouse.

 

5. How does Hive compare to Pig?
Pig and Hive are often compared as they both sit on top of Hadoop databases to construct data. It depends what you are looking to do with your data, but the main difference is the way the script language is written.

Pig Latin, as the name suggests, is a simpler language, but database developers with a good handle on SQL should have no problem with HQL.

Hive also has more functionality as it has a web interface that can be used to visualise issue queries, which Pig does not. Hive can store any tables in a metabase database prior to determining whether they are acceptable, which makes it a suitable tool for data warehousing design. Alternatively, Pig is an effective software for optimising dataflow within Big Data.

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU