Regarding defects in human-built systems, the term "bug" appears to have been coined by Thomas Edison in 1876 to describe problems in his systems. Bug has been defined as "an unexpected defect, fault, flaw, or imperfection".
Like the "system" or "software" bug - the "data" bug is a defect, fault, flaw, or imperfection in data. Data bugs may be hidden and difficult to find - considering the following:
Further, humans are flawed and have both naked and hidden biases as well as other incentives to skew data to obtain a desired result, including:
Data science results from data bugs may be extremely serious - they are sometimes impossible or very difficult to detect and may trigger errors that can cause a myriad of secondary effects, resulting in an illusion of reality and bad decisions.
Moreover, data bugs may remain undetected for long periods of time. Data has many secondary uses with low barriers to sharing, combining with other data sources and transformation or manipulation.
What is urgently needed is a new "meta-data reporting system" that labels, defines, rates and categorizes all new and transformed data (structured, raw unstructured and semi-structured). This goes beyond the traditional simple "meta-data" definitions. Meta-data is information about data - describing how and when and by whom a particular set of data was collected, and how the data is formatted. Descriptive meta-data is about the data content and the creation, validation and transformation of the data - as well as specific instances of data application. Structural meta-data provides information about the technical design and specification of data structures.
A new meta-data reporting system should include:
This detailed meta-data information should follow the data like a "chain of data evidence" - for future users of the data. This is especially useful after the data is sliced, diced and combined with other data sources.
Along the "chain of data evidence" future users can add reports detailing real or potentially hidden data bugs. Included in the data bug reports would be veracity reports, quality reports, defect reports, fault reports, problem reports, trouble reports, and other potential data evidentiary issues.
Data management in the Hadoop ecosystem is still in the early stages of development. The goal of cheaper and more effective ways of collecting, storing, processing and distributing structured and unstructured data (as well as internal and external data sources) has been impeded by complexity, lack of qualified professionals and difficulty in managing data.
Data movement and management in Hadoop is challenging. It includes data motion, process orchestration, lifecycle management and data discovery. The trick to simplifying data management in Hadoop is to process data in a decentralized fashion by pushing complexity into the platform - enabling data engineers to focus on the processing / business logic.
Apache Falcon is an open source data processing and management solution for the Hadoop ecosystem. It simplifies the management of data by enabling users to define infrastructure endpoints (e.g., clusters, HBase, databases, HCatalog), logical tables/feed/datasets (e.g., location, permissions, source, retention limits, replication targets) and processing rules (e.g., inputs, outputs, schedule, business logic) as configurations.
Hadoop Falcon addresses:
Falcon allows users to on-board data sets with a complete understanding of how, when and where their data is managed across its lifecycle. It uses Apache Oozie for coordinating workflows. Workflow templates are used for data management. Falcon provides open APIs that enable those workflows to be orchestrated more broadly to allow integration between data warehouse systems (e.g., orchestrate data lifecycle workflows within Hadoop as well as with a Teradata system).
Of course the traditional data warehouse (DW) is here to stay for the near future and you must integrate a new modern data analytical tech ecosystem with the legacy DW. This integration is challenging.
Smart organizations are planning for the future by disfavoring the inflexible data modeling of the traditional DW and
favoring more flexible and faster tech like NoSQL databases, Hadoop and in-memory databases.
Data virtualized, in-memory analytical engines holding structured, semi-structured and unstructured data is the future.
Scalable, in-memory, virtualized analytical DW platforms improves and simplifies information management, increases speed and lowers costs.
At the 2013 TDWI World Conference and BI Executive Summit in Las Vegas, speakers and attendees chewed over some of the meatiest trends and hottest technologies in the business intelligence and data warehousing market.
One discussion thread centered on strategies for finding business value in big data through the use of technologies such as Hadoop. Another focused on the need to make BI applications more enticing to business users, with more eye-catching designs and interactive elements. Operational intelligence, self-service BI and mobile BI were also on the agenda, as were Agile BI and data warehousing, enterprise information management and big data management.