Data management in the Hadoop ecosystem is still in the early stages of development. The goal of cheaper and more effective ways of collecting, storing, processing and distributing structured and unstructured data (as well as internal and external data sources) has been impeded by complexity, lack of qualified professionals and difficulty in managing data.
Data movement and management in Hadoop is challenging. It includes data motion, process orchestration, lifecycle management and data discovery. The trick to simplifying data management in Hadoop is to process data in a decentralized fashion by pushing complexity into the platform - enabling data engineers to focus on the processing / business logic.
Apache Falcon is an open source data processing and management solution for the Hadoop ecosystem. It simplifies the management of data by enabling users to define infrastructure endpoints (e.g., clusters, HBase, databases, HCatalog), logical tables/feed/datasets (e.g., location, permissions, source, retention limits, replication targets) and processing rules (e.g., inputs, outputs, schedule, business logic) as configurations.
Hadoop Falcon addresses:
Falcon allows users to on-board data sets with a complete understanding of how, when and where their data is managed across its lifecycle. It uses Apache Oozie for coordinating workflows. Workflow templates are used for data management. Falcon provides open APIs that enable those workflows to be orchestrated more broadly to allow integration between data warehouse systems (e.g., orchestrate data lifecycle workflows within Hadoop as well as with a Teradata system).
Benefits of Data Virtualization
Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.
Consuming applications may include: business intelligence, analytics, CRM, enterprise resource planning, and more across both cloud computing platforms and on-premises.
Data Virtualization Benefits:
● Decision makers gain fast access to reliable information
● Improve operational efficiency - flexibility and agility of integration due to the short cycle creation of virtual data stores without the need to touch underlying sources
● Improved data quality due to a reduction in physical copies
● Improved usage through creation of subject-oriented, business-friendly data objects
● Increases revenues
● Lowers costs
● Reduces risks
Data virtualization abstracts, transforms, federates and delivers data from a variety of sources and presents itself as a single access point to a consumer regardless of the physical location or nature
of the various data sources.
Data virtualization is based on the premise of the abstraction of data contained within a variety of data sources (databases, applications, file repositories, websites, data services vendors, etc.) for
the purpose of providing a single-point access to the data and its architecture is based on a shared semantic abstraction layer as opposed to limited visibility semantic metadata confined to a single
Data Virtualization software is an enabling technology which provides the following capabilities:
• Abstraction – Abstract data the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.
• Virtualized Data Access – Connect to different data sources and make them accessible from one logical place
• Transformation / Integration – Transform, improve quality, and integrate data based on need across multiple sources
• Data Federation – Combine results sets from across multiple source systems.
• Flexible Data Delivery – Publish result sets as views and/or data services executed by consuming application or users when requested In delivering these capabilities, data virtualization also addresses requirements for data security, data quality, data governance, query optimization, caching, etc. Data virtualization software includes functions for development, operation and management.
Managing data is challenging. Many efforts result in siloed information and fragmented views that damage competitiveness and increase costs. In the modern era of "big data" the best practice may be to create one central data depository with a uniform data governance architecture yet allow each business unit to own their data.
The goal is to provide simple ways for both data scientists and non-technical users to explore, visualize and interpret data to reveal patterns, anomalies, key variables and potential relationships. Data Governance and Master Data Management (MDM) design is key to achieving this goal.
Master data management (MDM) comprises a set of processes and tools that defines and manages data. MDM lies at the core of many organizations’ operations, and the quality of that data shapes decision making. MDM helps leverage trusted business information—helping to increase profitability and reduce risk.
Master data is reference data about an organization’s core business entitles. These entities include people
(customers, employees, suppliers), things (products, assets, ledgers), and places (countries, cities, locations). The
applications and technologies used to create and maintain master data are part of a master data management (MDM) system.
Recent developments in business intelligence (BI) aid in regulatory compliance and provide more usable and quality data for smarter decision making and spending. Virtual master data management (Virtual MDM) utilizes data
virtualization and a persistent metadata server to implement a multi-level automated MDM hierarchy.
● Improving business agility
● Providing a single trusted view of people, processes and applications
● Allowing strategic decision making
● Enhancing customer relationships
● Reducing operational costs
● Increasing compliance with regulatory requirements
MDM helps organizations handle four key issues:
● Data redundancy
● Data inconsistency
● Business inefficiency
● Supporting business change
MDM provides processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and
distributing data throughout an organization to ensure consistency and control in the ongoing maintenance and
application use of this information. MDM seeks to ensure that an organization does not use multiple (potentially
inconsistent) versions of the same master data in different parts of its operations and solves issues with the quality of data, consistent classification and identification of data, and data-reconciliation issues.
MDM solutions include source identification, data collection, data transformation, normalization, rule administration,
error detection and correction, data consolidation, data storage, data distribution, and data governance.
MDM tools include data networks, file systems, a data warehouse, data marts, an operational data store, data mining, data analysis, data virtualization, data federation and data visualization.
MDM requires an organization to implement policies and procedures for controlling how master data is created and
One of the main objectives of an MDM system is to publish an integrated, accurate, and consistent set of master data for use by other applications and users. This integrated set of master data is called the master data system of record (SOR). The SOR is the gold copy for any given piece of master data, and is the single place in an organization that the master data is guaranteed to be accurate and up to date.
Although an MDM system publishes the master data SOR for use by the rest of the IT environment, it is not
necessarily the system where master is created and maintained. The system responsible for maintaining any given
piece of master data is called the system of entry (SOE). In most organizations today, master data is maintained by
Customer data is an example. A company may, for example, have customer master data that is maintained by multiple Web store fronts, by the retail organization, and by the shipping and billing systems. Creating a single SOR for customer data in such an environment is a complex task.
The long term goal of an enterprise MDM environment is to solve this problem by creating an MDM system that is not only the SOR for any given type of master data, but also the SOE as well.
MDM then can be defined as a set of policies, procedures, applications and technologies for harmonizing and
managing the system of record and systems of entry for the data and metadata associated with the key business
entities of an organization.
The emerging "Data Stack" or "Data Layer" is in full transition and can be viewed and defined many different ways. The ability to capture, analyze and learn from data generated at unprecedented scale, combined with means to access that information, on demand, when relevant, creates business opportunities we are only just beginning to appreciate.
One way simply defines data in a three layer stack:
The top layer of the stack, internal data, is specific to an organization. The contextual layer comes from other sources. The integrated data model is for advanced data analytics applications.
Another more complex way is represented in the above image:
As the foundational layer in the big data stack, speed kills along with scalable persistence and compute power. At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms. At the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.
At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.
There are three data layer trends: data growth, web application user growth and the explosion of mobile computing.
Data growth [Big Data]. IDC estimates an organizations data will double every two years. Mining this raw data for valuable, actionable insights is challenging. Hadoop (HDFS, MapReduce, Cassandra and Hive) are batch-processing oriented and assist in analyzing large data sets.
User growth [NoSQL]. Most new interactive software systems are accessed via browser. If available on the public Internet, these applications now have 2 billion potential users and a 24x7 uptime requirement. Regardless of dataset size, these software systems put unprecedented pressure on the data layer: massive user concurrency; need for predictable, low-latency random access to data to maintain a snappy interactive user experience; and the need for continuous operations, even during database maintenance. Couchbase and MongoDB are open source NoSQL technologies that meet the data management needs of interactive web applications.
Mobile computing growth [Mobile Sync]. Mobile devices are increasingly where we create and consume information. But data aggregation and processing will be accomplished in the cloud. IDC estimates that in 2015, 1.4 of the 4.9 zettabytes created that year will be "touched by the cloud." Delivering the right data to millions of mobile devices, when and where it is needed (and then getting it back again) is the mobile-cloud data sync challenge.
These three trends may constitute the future emerging modern data stack - one that supports the ebb and flow of information from web and mobile applications to the cloud.
The key is to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and analysis of large and diverse datasets.
Data comes from a variety of sources (internal, external, contextual, integrated): data directly created by users of web and mobile applications, observations and metadata related to the use of web and mobile applications, external data feeds, intermediate analysis results. The processing of this information creates information needed by user-facing applications and is fed into a NoSQL solution.
The NoSQL solution provides low-latency, random access to the data, meeting the needs of web applications. It also allows a mobile synchronization server quick, random access to data needed by mobile users.
A Mobile Sync Server manages transient connections with mobile devices, delivering data to native mobile applications when and where it is needed; and receiving information in return.
Microsoft’s Big Data solution unleashes actionable insights for everyone from all their data through familiar tools. It also enables customers to uncover new insights by connecting to the world’s data through an open and flexible platform.
For customers with large or diverse datasets, Microsoft’s Big Data solution unleashes actionable business insights to drive smarter decisions from structured, semi-structured and unstructured data. Unlike the competition, it offers insights to everyone through integration with familiar Microsoft tools such as Excel, PowerPivot and Power View. In addition it enables customers to discover new insights by connecting to publicly available data and services from Azure Marketplace and social media sites such as Twitter and Facebook. Microsoft Big Data offers an Enterprise-ready Hadoop distribution through integration with key Microsoft components including Active Directory and System Center, and an open platform with full compatibility with Apache Hadoop APIs.
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.
In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.
As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.
Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.
Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.
Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.
Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.
Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.
Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.
Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.