Data management in the Hadoop ecosystem is still in the early stages of development. The goal of cheaper and more effective ways of collecting, storing, processing and distributing structured and unstructured data (as well as internal and external data sources) has been impeded by complexity, lack of qualified professionals and difficulty in managing data.
Data movement and management in Hadoop is challenging. It includes data motion, process orchestration, lifecycle management and data discovery. The trick to simplifying data management in Hadoop is to process data in a decentralized fashion by pushing complexity into the platform - enabling data engineers to focus on the processing / business logic.
Apache Falcon is an open source data processing and management solution for the Hadoop ecosystem. It simplifies the management of data by enabling users to define infrastructure endpoints (e.g., clusters, HBase, databases, HCatalog), logical tables/feed/datasets (e.g., location, permissions, source, retention limits, replication targets) and processing rules (e.g., inputs, outputs, schedule, business logic) as configurations.
Hadoop Falcon addresses:
Falcon allows users to on-board data sets with a complete understanding of how, when and where their data is managed across its lifecycle. It uses Apache Oozie for coordinating workflows. Workflow templates are used for data management. Falcon provides open APIs that enable those workflows to be orchestrated more broadly to allow integration between data warehouse systems (e.g., orchestrate data lifecycle workflows within Hadoop as well as with a Teradata system).
The Internet of Things (IOT) will soon produce a massive volume and variety of data at unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's soul.
Let's define our terms:
Internet of Things (IOT): equipping all physical and organic things in the world with identifying intelligent devices allowing the near real-time collecting and sharing of data between machines and humans. The IOT era has already begun, albeit in it's first primitive stage.
Data Science: the analysis of data creation. May involve machine learning, algorithm design, computer science, modeling, statistics, analytics, math, artificial intelligence and business strategy.
Big Data: the collection, storage, analysis and distribution/access of large data sets. Usually includes data sets with sizes beyond the ability of standard software tools to capture, curate, manage, and process the data within a tolerable elapsed time.
We are in the pre-industrial age of data technology and science used to process and understand data. Yet the early evidence provides hope that we can manage and extract knowledge and wisdom from this data to improve life, business and public services at many levels.
To date, the internet has mostly connected people to information, people to people, and people to business. In the near future, the internet will provide organizations with unprecedented data. The IOT will create an open, global network that connects people, data and machines.
Billions of machines, products and things from the physical and organic world will merge with the digital world allowing near real-time connectivity and analysis. Machines and products (and every physical and organic thing) embedded with sensors and software - connected to other machines, networked systems, and to humans - allows us to cheaply and automatically collect and share data, analyze it and find valuable meaning. Machines and products in the future will have the intelligence to deliver the right information to the right people (or other intelligent machines and networks), any time, to any device. When smart machines and products can communicate, they help us and other machines understand so we can make better decisions, act fast, save time and money, and improve products and services.
The IOT, Data Science and Big Data will combine to create a revolution in the way organizations use technology and processes to collect, store, analyze and distribute any and all data required to operate optimally, improve products and services, save money and increase revenues. Simply put, welcome to the new information age, where we have the potential to radically improve human life (or create a dystopia - a subject for another time).
The IOT will produce gigantic amounts of data. Yet data alone is useless - it needs to be interpreted and turned into information. However, most information has limited value - it needs to be analyzed and turned into knowledge. Knowledge may have varying degrees of value - but it needs specialized manipulation to transform into valuable, actionable insights. Valuable, actionable knowledge has great value for specific domains and actions - yet requires sophisticated, specialized expertise to be transformed into multi-domain, cross-functional wisdom for game changing strategies and durable competitive advantage.
Big data may provide the operating system and special tools to get actionable value out of data, but the soul of the data, the knowledge and wisdom, is the bailiwick of the data scientist.
Fast Data applications are typically compute intensive and run on High Performance Computing Architectures. This video explores and unlocks the need for speed in Fast Data. It examines the properties that define Fast Data, analyzes the challenges, and then takes a closer look at the role Flash technologies play in delivering the performance required by applications using Fast Data.