Data movement and management in Hadoop is challenging. It includes data motion, process orchestration, lifecycle management and data discovery. The trick to simplifying data management in Hadoop is to process data in a decentralized fashion by pushing complexity into the platform - enabling data engineers to focus on the processing / business logic.
Apache Falcon is an open source data processing and management solution for the Hadoop ecosystem. It simplifies the management of data by enabling users to define infrastructure endpoints (e.g., clusters, HBase, databases, HCatalog), logical tables/feed/datasets (e.g., location, permissions, source, retention limits, replication targets) and processing rules (e.g., inputs, outputs, schedule, business logic) as configurations.
- Data Motion
- Process orchestration and scheduling
- Policy-based Lifecycle Management
- Data Discovery
Falcon allows users to on-board data sets with a complete understanding of how, when and where their data is managed across its lifecycle. It uses Apache Oozie for coordinating workflows. Workflow templates are used for data management. Falcon provides open APIs that enable those workflows to be orchestrated more broadly to allow integration between data warehouse systems (e.g., orchestrate data lifecycle workflows within Hadoop as well as with a Teradata system).