As adoption of Hadoop, the open source framework for storing and processing massive datasets, gathers pace among enterprises, there remain substantial challenges that its users need to overcome.
These problem areas have shown up as the technology undergoes an intense period of development which shows little sign of slowing down. It is not so very long since Hadoop was only thought useful for batch processing of data through MapReduce.
Now the advent of Hadoop 2.0 and the release of Apache YARN, have led to a paradigm shift in processing. In tandem with Hadoop 2.0, YARN enables implementation of multiple data processing engines such as interactive SQL, real-time streaming, search and batch processing, using data stored in a single platform.
Whether Hadoop developers like it or not, SQL is the de facto language of business users as evidenced in the fact that there are more than 30 SQL-on-Hadoop solutions just to make Hadoop more usable. With these YARN-based projects getting traction and being packaged as part of Hadoop distributions, a decline in the use of MapReduce is to be expected as it is overtaken by other sophisticated solutions in the Hadoop ecosystem.
MapReduce may have facilitated much interesting work, but it has substantial overheads. It is also hard to programme and suffers from a degree of inflexibility. All of which makes its decline inevitable. The fact is that MapReduce has several constraints (e.g. writing each intermediate step to disk) which never enabled it to move beyond batch processing workloads. However, MapReduce was an important component in growing the momentum of Hadoop-based solutions and it was only natural that it would give way to richer data processing frameworks.
New developments such as Apache Spark, by contrast, are more adaptable, bringing the control and capacity that permits streaming and in-memory computing.
Expectations of Hadoop are also changing with the huge volumes of structured and unstructured data being loaded into the commodity hardware clusters in which it sits. All kinds of enterprises are attracted by its low costs, making it inevitable that search-based workloads are going to increase, deploying solutions such as Apache Solr.
Take for example, a company that wants to gauge sentiment among its customers on its website by searching for certain key words such as ‘unhappy’ in weblog feedback forms. There could be hundreds of thousands of forms. A search tool will be able to quickly reduce the log files under scrutiny in preparation for further analytics and modelling to discover what is behind the dissatisfaction of such customers.
As the Hadoop ecosystem continues to evolve in this way, enterprises wanting to keep pace with the technology will also need to face up to three more challenges, the first of which is security.
In the early days of Hadoop, security was not a main concern as most of the data was public (web pages downloaded by web-crawlers) and use of the technology did not even require a password, as the use cases did not demand it.
Now, however, in an era when even banking data is stored in Hadoop, the stringent requirements of privacy laws and audit controls assume huge importance. The introduction of Apache Knox and Apache Ranger are two big steps towards addressing this urgent problem. Knox is the gatekeeper, external to the cluster node and Ranger aims to provide central security administration for the cluster. Cloudera has introduced Sentry while MapR has also added interesting security capability to address the growing concerns of enterprises. Kerberos (MapR Tickets in case of MapR) support was added to Hadoop few years ago to provide authentication capabilities.
Despite the advances, each of the projects on top of Hadoop still has its own security considerations and this difficulty is not going to be solved through a single magic bullet. Defining security in Apache Hive – the relational project that sits on top of Hadoop – does not mean that data stored in HBase is secure
One solution is to use the Man in the Middle approach i.e. to place Hadoop behind a physical firewall so that access to the cluster is strictly controlled through a database or an application. Once the firewall is in place, the database or the application can connect to the Hadoop cluster and security provisioning is then managed through the database or the application itself.
More work needs to be conducted on developing best practices around security implementation. The current state of play is not perfect, but these steps present the best combined method of developing a sound defensive technique against security threats and associated risks.
High Availability and Disaster Recovery
As Hadoop transitions from sandbox to production, customers are looking for additional enterprise ready features such as high availability and disaster recovery. .
For back-up, it is possible to use solutions that copy data between clusters, so that they work in parallel, with one active and one passive. However, this method still requires manual intervention when the system goes down and can suffer from neglect, when, because it is not in day-to-day use, the passive cluster is not updated.
It can also be costly, as an enterprise is paying for round-the-clock cover against an event that may occur on less than ten days each year. In response, innovators are developing dual-active solutions for Hadoop in which data is replicated between systems in real time, meaning there is no passive cluster. However, unlike similar set-ups in the database world, there is no load-balancing to decide which server is free, in order to more directly optimise use. This is still a gap in Hadoop and we expect new solutions to be built in order to plug this gap.
Data Management and governance
The third challenge and one that represents a big area of growth is in instituting improvements in data management and governance, for which there are no out-of-the-box solutions.
Data governance is key for enterprises, in particular financial institutions which are very tightly governed by regulations. It is extremely critical that complete data lineage and transformation workflow are available within the Hadoop system for the purposes of backward traceability.
So-called schema-less information management means storing raw data – and then "late-binding" one of many different schemas, or interpretations, to the data at query run time, rather than applying a single interpretation to the data at load time.
This approach is a brilliant way of dealing with data where the structure evolves rapidly. If you are capturing raw device log data today, probably you are not modelling that data – or at least not all of that data – relationally. Because if you do, you risk having to re-visit the target data model and associated ETL (Extract, Transform and Load) processes every time you upgrade the device firmware so that the device is capable of capturing new attributes about itself and its environment.
Yet in Information Technology there is always a "but". Engineering is about trade-off and compromise and the flip-side of increased flexibility is well, increased flexibility. Giving users multiple ways of interpreting data typically also means giving them several ways of interpreting it incorrectly. Assuring data quality is much more complex without a pre-defined schema. And a well-designed schema-on-load implementation enables us to optimise access paths, so that more users can make more use of valuable data.
The concept of Schema-on-Read is very powerful in Hadoop but it can be very risky if every business user is allowed to apply their own business rules for a common set of business questions without any accounting. It is crucial that any company’s data management strategy defines clearly the business glossary which all staff conducting analytics must employ. Otherwise the subject of discussion at meetings will be the methodology and not the results.
When it comes to governance, the lack of ready-made solutions means that professional services will increasingly be called on to provide workarounds for those using Hadoop.
Any enterprise seeking to fully embrace Hadoop must be aware of the challenges that come with it as well as benefits in lower costs and accessibility. The key to keeping those costs down, improving efficiency and maintaining data integrity will be an approach that first subjects all the tools and solutions available to very rigorous examination.
By Fawad A. Qureshi, Principal Consultant, Big Data, Teradata International