Regarding defects in human-built systems, the term "bug" appears to have been coined by Thomas Edison in 1876 to describe problems in his systems. Bug has been defined as "an unexpected defect, fault, flaw, or imperfection".
Like the "system" or "software" bug - the "data" bug is a defect, fault, flaw, or imperfection in data. Data bugs may be hidden and difficult to find - considering the following:
Further, humans are flawed and have both naked and hidden biases as well as other incentives to skew data to obtain a desired result, including:
Data science results from data bugs may be extremely serious - they are sometimes impossible or very difficult to detect and may trigger errors that can cause a myriad of secondary effects, resulting in an illusion of reality and bad decisions.
Moreover, data bugs may remain undetected for long periods of time. Data has many secondary uses with low barriers to sharing, combining with other data sources and transformation or manipulation.
What is urgently needed is a new "meta-data reporting system" that labels, defines, rates and categorizes all new and transformed data (structured, raw unstructured and semi-structured). This goes beyond the traditional simple "meta-data" definitions. Meta-data is information about data - describing how and when and by whom a particular set of data was collected, and how the data is formatted. Descriptive meta-data is about the data content and the creation, validation and transformation of the data - as well as specific instances of data application. Structural meta-data provides information about the technical design and specification of data structures.
A new meta-data reporting system should include:
This detailed meta-data information should follow the data like a "chain of data evidence" - for future users of the data. This is especially useful after the data is sliced, diced and combined with other data sources.
Along the "chain of data evidence" future users can add reports detailing real or potentially hidden data bugs. Included in the data bug reports would be veracity reports, quality reports, defect reports, fault reports, problem reports, trouble reports, and other potential data evidentiary issues.
See: Gartner, "Magic Quadrant for Data Quality Tools," Ted Friedman, Andreas Bitterer. October 7, 2013.
Colin White of BI Research and Harriet Fryman of IBM help separate the reality from the hype by taking a look at use cases and the benefits customers are gaining from big data.
The Internet of Things (IOT) will soon produce a massive volume and variety of data at unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's soul.
Let's define our terms:
Internet of Things (IOT): equipping all physical and organic things in the world with identifying intelligent devices allowing the near real-time collecting and sharing of data between machines and humans. The IOT era has already begun, albeit in it's first primitive stage.
Data Science: the analysis of data creation. May involve machine learning, algorithm design, computer science, modeling, statistics, analytics, math, artificial intelligence and business strategy.
Big Data: the collection, storage, analysis and distribution/access of large data sets. Usually includes data sets with sizes beyond the ability of standard software tools to capture, curate, manage, and process the data within a tolerable elapsed time.
We are in the pre-industrial age of data technology and science used to process and understand data. Yet the early evidence provides hope that we can manage and extract knowledge and wisdom from this data to improve life, business and public services at many levels.
To date, the internet has mostly connected people to information, people to people, and people to business. In the near future, the internet will provide organizations with unprecedented data. The IOT will create an open, global network that connects people, data and machines.
Billions of machines, products and things from the physical and organic world will merge with the digital world allowing near real-time connectivity and analysis. Machines and products (and every physical and organic thing) embedded with sensors and software - connected to other machines, networked systems, and to humans - allows us to cheaply and automatically collect and share data, analyze it and find valuable meaning. Machines and products in the future will have the intelligence to deliver the right information to the right people (or other intelligent machines and networks), any time, to any device. When smart machines and products can communicate, they help us and other machines understand so we can make better decisions, act fast, save time and money, and improve products and services.
The IOT, Data Science and Big Data will combine to create a revolution in the way organizations use technology and processes to collect, store, analyze and distribute any and all data required to operate optimally, improve products and services, save money and increase revenues. Simply put, welcome to the new information age, where we have the potential to radically improve human life (or create a dystopia - a subject for another time).
The IOT will produce gigantic amounts of data. Yet data alone is useless - it needs to be interpreted and turned into information. However, most information has limited value - it needs to be analyzed and turned into knowledge. Knowledge may have varying degrees of value - but it needs specialized manipulation to transform into valuable, actionable insights. Valuable, actionable knowledge has great value for specific domains and actions - yet requires sophisticated, specialized expertise to be transformed into multi-domain, cross-functional wisdom for game changing strategies and durable competitive advantage.
Big data may provide the operating system and special tools to get actionable value out of data, but the soul of the data, the knowledge and wisdom, is the bailiwick of the data scientist.