Data management in the Hadoop ecosystem is still in the early stages of development. The goal of cheaper and more effective ways of collecting, storing, processing and distributing structured and unstructured data (as well as internal and external data sources) has been impeded by complexity, lack of qualified professionals and difficulty in managing data.
Data movement and management in Hadoop is challenging. It includes data motion, process orchestration, lifecycle management and data discovery. The trick to simplifying data management in Hadoop is to process data in a decentralized fashion by pushing complexity into the platform - enabling data engineers to focus on the processing / business logic.
Apache Falcon is an open source data processing and management solution for the Hadoop ecosystem. It simplifies the management of data by enabling users to define infrastructure endpoints (e.g., clusters, HBase, databases, HCatalog), logical tables/feed/datasets (e.g., location, permissions, source, retention limits, replication targets) and processing rules (e.g., inputs, outputs, schedule, business logic) as configurations.
Hadoop Falcon addresses:
Falcon allows users to on-board data sets with a complete understanding of how, when and where their data is managed across its lifecycle. It uses Apache Oozie for coordinating workflows. Workflow templates are used for data management. Falcon provides open APIs that enable those workflows to be orchestrated more broadly to allow integration between data warehouse systems (e.g., orchestrate data lifecycle workflows within Hadoop as well as with a Teradata system).
The Berkeley Data Analytics Stack (BDAS) is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos.
One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.
Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. Download for a test drive: http://spark-project.org/downloads
Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries. Spark provides an easy-to-program interface that is available in Java, Python, and Scala. Spark Streaming is a component of Spark that provides highly scalable, fault-tolerant streaming processing. With this functionality, Spark provides integrated support for all major computation models: batch, interactive, and streaming.
Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.
The Shark data analysis warehouse system:
By using Spark as the execution engine and employing novel and traditional database techniques, Shark bridges the gap between MapReduce and MPP databases.
Take Shark for a drive on EC2 (takes 10 minutes to spin up a cluster): http://shark.cs.berkeley.edu
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.
Data quality management includes the continuous analysis, observation, and improvement of all information in the organization. Best practice includes:
1. Data quality assessment, as a way for the practitioner to understand the scope of how poor data quality affects
the ways that the business processes are intended to run, and to develop a business case for data quality management;
2. Data quality measurement, in which the data quality analysts synthesize the results assessment and concentrate on the data elements that are deemed critical based on the selected business users’ needs. This leads to the definition of performance metrics that feed management reporting via data quality scorecards;
3. Integrating data quality into the application infrastructure, by way of integrating data requirements analysis across the organization and by engineering data quality into the system development life cycle;
4. Operational data quality improvement, where data stewardship procedures are used to manage identified data quality rules and conformance to acceptability thresholds;
5. Data quality incident management, which allows the data quality analysts to review the degree to which the data
does or does not meet the levels of acceptability, report, log, and track issues, and document the processes for
remediation and improvement.
Copyright © SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Used with permission.
Optimization answers the question: How do we do things better? What is the best decision for a complex problem?
Example: Given business priorities, resource constraints and available technology, determine the best way to optimize your IT platform to satisfy the needs of every user.
Optimization supports innovation. It takes your resources and needs into consideration and helps you find the best possible way to accomplish your goals.
Predictive Modeling answers the questions: What will happen next? How will it affect my business?
Example: Hotels and casinos can predict which VIP customers will be more interested in particular vacation packages.
If you have 10 million customers and want to do a marketing campaign, who's most likely to respond? How do you segment that group? And how do you determine who's most likely to leave your organization? Predictive modeling provides the answers.
Forecasting answers the questions: What if these trends continue? How much is needed? When will it be needed?
Example: Retailers can predict how demand for individual products will vary from store to store.
Forecasting is one of the hottest markets – and hottest analytical applications – right now. It applies everywhere. In particular, forecasting demand helps supply just enough inventory, so you don’t run out or have too much.
Statistical Analysis answers the questions: Why is this happening? What opportunities am I missing?
Example: Banks can discover why an increasing number of customers are refinancing their homes.
Here we can begin to run some complex analytics, like frequency models and regression analysis. We can begin to look at why things are happening using the stored data and then begin to answer questions based on the data.
Alerts answer the questions: When should I react? What actions are needed now?
Example: Sales executives receive alerts when sales targets are falling behind.
With alerts, you can learn when you have a problem and be notified when something similar happens again in the future. Alerts can appear via e-mail, RSS feeds or as red dials on a scorecard or dashboard.
Query Drilldown (or OLAP) answers the questions: Where exactly is the problem? How do I find the answers?
Example: Sort and explore data about different types of cell phone users and their calling behaviors.
Query drilldown allows for a little bit of discovery. OLAP lets you manipulate the data yourself to find out how many, what color and where.
Ad Hoc Reports answer the questions: How many? How often? Where?
Example: Custom reports that describe the number of hospital patients for every diagnosis code for each day of the week.
At their best, ad hoc reports let you ask the questions and request a couple of custom reports to find the answers.
Standard Reports answer the questions: What happened? When did it happen?
Example: Monthly or quarterly financial reports.
We all know about these. They're generated on a regular basis and describe just "what happened" in a particular area. They're useful to some extent, but not for making long-term decisions.
Managing data is challenging. Many efforts result in siloed information and fragmented views that damage competitiveness and increase costs. In the modern era of "big data" the best practice may be to create one central data depository with a uniform data governance architecture yet allow each business unit to own their data.
The goal is to provide simple ways for both data scientists and non-technical users to explore, visualize and interpret data to reveal patterns, anomalies, key variables and potential relationships. Data Governance and Master Data Management (MDM) design is key to achieving this goal.
Master data management (MDM) comprises a set of processes and tools that defines and manages data. MDM lies at the core of many organizations’ operations, and the quality of that data shapes decision making. MDM helps leverage trusted business information—helping to increase profitability and reduce risk.
Master data is reference data about an organization’s core business entitles. These entities include people
(customers, employees, suppliers), things (products, assets, ledgers), and places (countries, cities, locations). The
applications and technologies used to create and maintain master data are part of a master data management (MDM) system.
Recent developments in business intelligence (BI) aid in regulatory compliance and provide more usable and quality data for smarter decision making and spending. Virtual master data management (Virtual MDM) utilizes data
virtualization and a persistent metadata server to implement a multi-level automated MDM hierarchy.
● Improving business agility
● Providing a single trusted view of people, processes and applications
● Allowing strategic decision making
● Enhancing customer relationships
● Reducing operational costs
● Increasing compliance with regulatory requirements
MDM helps organizations handle four key issues:
● Data redundancy
● Data inconsistency
● Business inefficiency
● Supporting business change
MDM provides processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and
distributing data throughout an organization to ensure consistency and control in the ongoing maintenance and
application use of this information. MDM seeks to ensure that an organization does not use multiple (potentially
inconsistent) versions of the same master data in different parts of its operations and solves issues with the quality of data, consistent classification and identification of data, and data-reconciliation issues.
MDM solutions include source identification, data collection, data transformation, normalization, rule administration,
error detection and correction, data consolidation, data storage, data distribution, and data governance.
MDM tools include data networks, file systems, a data warehouse, data marts, an operational data store, data mining, data analysis, data virtualization, data federation and data visualization.
MDM requires an organization to implement policies and procedures for controlling how master data is created and
One of the main objectives of an MDM system is to publish an integrated, accurate, and consistent set of master data for use by other applications and users. This integrated set of master data is called the master data system of record (SOR). The SOR is the gold copy for any given piece of master data, and is the single place in an organization that the master data is guaranteed to be accurate and up to date.
Although an MDM system publishes the master data SOR for use by the rest of the IT environment, it is not
necessarily the system where master is created and maintained. The system responsible for maintaining any given
piece of master data is called the system of entry (SOE). In most organizations today, master data is maintained by
Customer data is an example. A company may, for example, have customer master data that is maintained by multiple Web store fronts, by the retail organization, and by the shipping and billing systems. Creating a single SOR for customer data in such an environment is a complex task.
The long term goal of an enterprise MDM environment is to solve this problem by creating an MDM system that is not only the SOR for any given type of master data, but also the SOE as well.
MDM then can be defined as a set of policies, procedures, applications and technologies for harmonizing and
managing the system of record and systems of entry for the data and metadata associated with the key business
entities of an organization.
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is the most well known technology used for "Big Data" because it allows an organization to store huge quantities of data at very low costs. R is a programming language and software environment for statistical computing and graphics. Put the two together to provide easy to use R interfaces for the distributed computing Hadoop environment and you have one king-hell data crunching tool for serious data analytics.
RHadoop is a small, open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R - allowing data scientists access to Hadoop’s scalability from their favorite language, R. It allows users to write general MapReduce programs, offering the full power and ecosystem of an existing, established programming language.
RHadoop is comprised of three packages:
There are two basic types of graph engines:
(1) Graph databases providing real-time, traversal-based algorithms over linked-list graphs represented on a single-server (vendors include Neo4j, OrientDB, DEX, and InfiniteGraph).
(2) Batch-processing using vertex-centric message passing within a graph represented across a cluster of machines. (Hama, Golden Orb, Giraph, and Pregel).
With Hadoop, instead of focusing on a particular vertex-centric BSP-based graph-processing package such as Hama or Giraph, the results presented are via Hadoop (HDFS + MapReduce). Moreover, instead of developing the MapReduce algorithms in Java, the R programming language is used.
When a graph is on the order of 100+ billion elements (vertices+edges), then a single-server graph database will not be able to represent nor process the graph. A multi-machine graph engine is required. While Hadoop is not a graph engine, a graph can be represented in its distributed HDFS file system and processed using its distributed processing MapReduce framework.
The graph generated previously is loaded up in R and a count of its vertices and edges is conducted. Next, the graph is represented as an edge list. An edge list (for a single-relational graph) is a list of pairs, where each pair is ordered and denotes the tail vertex id and the head vertex id of the edge. The edge list can be pushed to HDFS using RHadoop.
Below is a brief introduction to Hadoop and R describing how to program in the MapReduce framework, and provides an alternative way to implement MapReduce programs that strikes a delicate compromise between power and usability.
R is an open source programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software and data analysis. According to Rexer's Annual Data Miner Survey in 2010, R has become the data mining tool used by more data miners (43%) than any other. The S language is often the vehicle of choice for research in statistical methodology, and R provides an open source route to participation in that activity.
R is emerging as a defacto standard for computational statistics and predictive analytics. R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as free software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:
The emerging "Data Stack" or "Data Layer" is in full transition and can be viewed and defined many different ways. The ability to capture, analyze and learn from data generated at unprecedented scale, combined with means to access that information, on demand, when relevant, creates business opportunities we are only just beginning to appreciate.
One way simply defines data in a three layer stack:
The top layer of the stack, internal data, is specific to an organization. The contextual layer comes from other sources. The integrated data model is for advanced data analytics applications.
Another more complex way is represented in the above image:
As the foundational layer in the big data stack, speed kills along with scalable persistence and compute power. At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms. At the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.
At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.
There are three data layer trends: data growth, web application user growth and the explosion of mobile computing.
Data growth [Big Data]. IDC estimates an organizations data will double every two years. Mining this raw data for valuable, actionable insights is challenging. Hadoop (HDFS, MapReduce, Cassandra and Hive) are batch-processing oriented and assist in analyzing large data sets.
User growth [NoSQL]. Most new interactive software systems are accessed via browser. If available on the public Internet, these applications now have 2 billion potential users and a 24x7 uptime requirement. Regardless of dataset size, these software systems put unprecedented pressure on the data layer: massive user concurrency; need for predictable, low-latency random access to data to maintain a snappy interactive user experience; and the need for continuous operations, even during database maintenance. Couchbase and MongoDB are open source NoSQL technologies that meet the data management needs of interactive web applications.
Mobile computing growth [Mobile Sync]. Mobile devices are increasingly where we create and consume information. But data aggregation and processing will be accomplished in the cloud. IDC estimates that in 2015, 1.4 of the 4.9 zettabytes created that year will be "touched by the cloud." Delivering the right data to millions of mobile devices, when and where it is needed (and then getting it back again) is the mobile-cloud data sync challenge.
These three trends may constitute the future emerging modern data stack - one that supports the ebb and flow of information from web and mobile applications to the cloud.
The key is to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and analysis of large and diverse datasets.
Data comes from a variety of sources (internal, external, contextual, integrated): data directly created by users of web and mobile applications, observations and metadata related to the use of web and mobile applications, external data feeds, intermediate analysis results. The processing of this information creates information needed by user-facing applications and is fed into a NoSQL solution.
The NoSQL solution provides low-latency, random access to the data, meeting the needs of web applications. It also allows a mobile synchronization server quick, random access to data needed by mobile users.
A Mobile Sync Server manages transient connections with mobile devices, delivering data to native mobile applications when and where it is needed; and receiving information in return.