Data and information silos are a significant problem for organizations getting full value from data. Data silos are separate databases or data files that are not part of an organization's enterprise-wide data administration. An information silo is where parts of the information management system is unable to freely communicate with other information management systems. A siloed application is an application that does not interact with other applications or information systems.
The data science revolution depends on collecting, storing, analyzing and distributing massive volumes and varieties of data to turn into knowledge and valuable, actionable insights. The real value is from mixing different internal and external data sources and sharing information within a culture of collaboration.
Strong evidence suggests that organizations utilizing a variety of both internal and external data sources - in conjunction with data science and business analytics - outperform firms that only rely on internal data and have data silos that prevent data science practice, information sharing and collaboration.
Unfortunately, managers have strong incentives to silo information to maintain power. As a result, organization leadership must have an information management strategy and policy of sharing information to break the deadly "data silo" status quo.
Data and information silos often exist because managers control the flow of information and access to the silo, and they perceive (1) their power and careers depend on information control; (2) there is not enough benefit from sharing information; (3) information might not be useful to folks in other systems; (4) costs to integrating the information systems is not justified.
In addition, data silos are a danger to data integrity increasing the risk that current (or more recent) data will accidentally get overwritten with outdated (or less recent) data. When two or more silos exist for the same data, their contents might differ, creating confusion as to which repository represents the most legitimate or up-to-date version.
According to a recent CompTIA survey:
To fully benefit from the data science revolution, organizations must change the incentive structure for managers to hoard and control information and prevent data and information silos from impeding the quest for competitive advantage.
Regarding defects in human-built systems, the term "bug" appears to have been coined by Thomas Edison in 1876 to describe problems in his systems. Bug has been defined as "an unexpected defect, fault, flaw, or imperfection".
Like the "system" or "software" bug - the "data" bug is a defect, fault, flaw, or imperfection in data. Data bugs may be hidden and difficult to find - considering the following:
Further, humans are flawed and have both naked and hidden biases as well as other incentives to skew data to obtain a desired result, including:
Data science results from data bugs may be extremely serious - they are sometimes impossible or very difficult to detect and may trigger errors that can cause a myriad of secondary effects, resulting in an illusion of reality and bad decisions.
Moreover, data bugs may remain undetected for long periods of time. Data has many secondary uses with low barriers to sharing, combining with other data sources and transformation or manipulation.
What is urgently needed is a new "meta-data reporting system" that labels, defines, rates and categorizes all new and transformed data (structured, raw unstructured and semi-structured). This goes beyond the traditional simple "meta-data" definitions. Meta-data is information about data - describing how and when and by whom a particular set of data was collected, and how the data is formatted. Descriptive meta-data is about the data content and the creation, validation and transformation of the data - as well as specific instances of data application. Structural meta-data provides information about the technical design and specification of data structures.
A new meta-data reporting system should include:
This detailed meta-data information should follow the data like a "chain of data evidence" - for future users of the data. This is especially useful after the data is sliced, diced and combined with other data sources.
Along the "chain of data evidence" future users can add reports detailing real or potentially hidden data bugs. Included in the data bug reports would be veracity reports, quality reports, defect reports, fault reports, problem reports, trouble reports, and other potential data evidentiary issues.