As one of the world’s most popular free Java-based programming networks, Apache Hadoop is being used by an increasing number of companies who can no longer manage their data using traditional methods.

The open-source platform can deal with a wide variety of data, whether it is structured or unstructured, in its native format. This allows companies to put Big Data at the heart of their strategies to manage and grow their business.

Here are some cases of large enterprises that have used Hadoop to optimise their online capabilities:

A9

Amazon (A9)

The A9 subsidiary of Amazon builds the website’s product search indices using the streaming API, which processes millions of sessions daily for analytics.

As the world’s largest e-commerce product search operation, A9’s search operations team have to work to manage big data using the C++, Pearl and Python tools.

The A9 product search engine operates on clusters varying from 1 to 100 nodes to analyse data, observe traffic patterns and index every product in Amazon’s catalogue.

 

Facebook

Facebook

The world’s most popular social network handles big data in a big day: more than 1 billion pieces of content (we links, news stories, blog posts, photos etc) are shared each week.

Hadoop is used to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. The data runs across two major clusters: a 1100-machine cluster and a 300-machine cluster.

This operates across a Hadoop Hive warehouse to structure and store data using a SQP-like language called HiveQL.

hulu

Hulu

Thanks to its comprehensive video streaming service with over 30 million monthly unique views, Hulu’s metric team receives a billion lines of events daily, known to Hulu as beacons.

Across a 13-machine cluster on Hbase hosting, Hulu operates beaconspec, a self-developed declarative language, for processing its log data. It uses beacons and baseline assertions about user behaviour to aggregate business intelligence.

Spotify

Spotify

As a data-driven company, Hadoop is an integral part of running Spotify. It is used for content generation, data aggregation, reporting and analysis.

Spotify wrote its own scheduler to optimise its chain jobs: Luigi, which runs over 7,500 daily Hadoop jobs over a 690 node cluster.

It also operates on the Yahoo!-developed PIG programme for small scripts, which is faster than streaming.

Hadoop has helped Spotify to leverage data to better its service, creating the radio recommendations and top list features.

yahoo

Yahoo!

Yahoo! is a big user of Hadoop and has even developed its own programmes to help its services function more efficiently. Self-developed Zookeeper coordinates distributed systems and parallel programming language PIG runs over 60% of its Hadoop jobs.

Its biggest cluster is a gargantuan 4,500 nodes to support research for Ad Systems and Web Search.

According to Yahoo!’s Hadoop blog, they have set a new Gray sort record (the sort rate achieved while sorting at least 100 terabytes of data). Yahoo!’s team nearly doubled the previous record of 0.725TB per minute to a sort rate of 1.42TB.