Developer Talk: Why Hortonworks?
When it comes to Hadoop distributions, developers have a multitude of choices, including Cloudera, Hortonworks, MapR, and others. At DesignMind, we have certifications in several of these, and work with our clients to develop solutions to meet their unique needs.
Today, let’s talk about five distinguishing features of Hortonworks.
Hortonworks does not use MapReduce, but instead employs Tez to run Hive and Pig jobs. Built on top of YARN, Tez converts MapReduce jobs into Directed Acyclic Graphs (DAGs). With graph vertices representing application logic, edges, and data flow, Tez allows users to intuitively express complex query logic. What was once a series of complex MR jobs, is now a single Tez job. In addition, it allows for dynamically changing graphs: After creating a DAG at compile time, changes to the computed optimal path can occur when new information is available at runtime. Tez allows for changes to these dataflow graphs to optimize performance and resources. It is vital to Hortonworks’ Stinger and Stinger.next initiatives of delivering sub-second query responses and enabling transactions (ACID) and SQL:2011 Analytics on Hive.
Growing business means growing data, and growing data means complex tracking. Falcon provides data life cycle tracking and management with tools for replication for disaster recovery, cleansing, preparing it for BI tools, removal when no longer needed, and data auditing. By providing these services as part of a centralized framework, Falcon makes it easy by removing complex coding required to do the same and using configuration files instead. Configuration can be specified in XML, Java, JSON and JAXB. Once configured, Falcon manages the data flow, disaster recovery and data recovery workflows.
Today organizations want to do more than store and query data with SQL. The need the ability to apply machine learning and handle streaming data—all while performing batch processing on a single cluster. To better utilize cluster resources, Slider automatically deploys these long-running applications, so named because they need to keep running to process data whenever it comes in, onto YARN and scales the application up or down, depending on usage. It views applications as components, with their own configuration, scripts, etc. These components
are managed (scaled, started, stopped) by Slider through the YARN application master.
4. Ranger and Knox
Ranger is a security framework to manage access control to the Hadoop ecosystem (Hive, HBase, Storm etc.) You can manage and enforce security policies on users or groups by restricting access to files, databases, tables, or columns. It provides authorization and authentication of users, auditing and policy analytics, data encryption, and security administration. Knox, meanwhile, provides a secured endpoint to Hadoop clusters. It encapsulates Kerberos security and integrates with your SSO system for secure, extensible access to the Hadoop cluster. While Ranger secures access to files and tables, Knox secures systems providing services.
Ambari, in some Indian languages, is the place atop the elephant where you control it. Apache Ambari provides the user a UI to provision and manage a Hadoop cluster (the elephant) and its components. Now, administrators can view and manage cluster health and stats, start and stop services, security policies, etc. Developers can use it to view running tasks status, write Hive or Pig jobs, and manage files in HDFS. It is particularly useful for analysts who do not (need to) know Hadoop to simply log in and run their SQL queries as usual and generate reports.
If you’d like to learn more about Hortonworks, check out a recent blog post by my colleague Mike Wilcox, Hortonworks for Hadoop – The Open Source Leader.
At DesignMind, our senior team of multi-vendor certified partners can build or optimize your existing environment with Hortonworks or another distribution to create a perfect fit for your organization.
Akshay Iyengar, Big Data Consultant at DesignMind, is a Hortonworks Data Platform Certified Developer (HDPCD).