Regarding defects in human-built systems, the term "bug" appears to have been coined by Thomas Edison in 1876 to describe problems in his systems. Bug has been defined as "an unexpected defect, fault, flaw, or imperfection".
Like the "system" or "software" bug - the "data" bug is a defect, fault, flaw, or imperfection in data. Data bugs may be hidden and difficult to find - considering the following:
Further, humans are flawed and have both naked and hidden biases as well as other incentives to skew data to obtain a desired result, including:
Data science results from data bugs may be extremely serious - they are sometimes impossible or very difficult to detect and may trigger errors that can cause a myriad of secondary effects, resulting in an illusion of reality and bad decisions.
Moreover, data bugs may remain undetected for long periods of time. Data has many secondary uses with low barriers to sharing, combining with other data sources and transformation or manipulation.
What is urgently needed is a new "meta-data reporting system" that labels, defines, rates and categorizes all new and transformed data (structured, raw unstructured and semi-structured). This goes beyond the traditional simple "meta-data" definitions. Meta-data is information about data - describing how and when and by whom a particular set of data was collected, and how the data is formatted. Descriptive meta-data is about the data content and the creation, validation and transformation of the data - as well as specific instances of data application. Structural meta-data provides information about the technical design and specification of data structures.
A new meta-data reporting system should include:
This detailed meta-data information should follow the data like a "chain of data evidence" - for future users of the data. This is especially useful after the data is sliced, diced and combined with other data sources.
Along the "chain of data evidence" future users can add reports detailing real or potentially hidden data bugs. Included in the data bug reports would be veracity reports, quality reports, defect reports, fault reports, problem reports, trouble reports, and other potential data evidentiary issues.
Real-time applications have long been considered off-limits for Hadoop clusters and Hadoop is often considered key to open-source exploitation of really large data streams. This talk shows how Storm and Hadoop can work together to achieve latencies of less than 5 ms typically and less than 5 seconds almost certainly can be achieved for a sample metrics application while still retaining years of data with high availability and durability. This is done using a hybrid system where Storm and Hadoop cooperate to do something neither can do alone.
One of the most popular methods or frameworks used by data scientists at the Rose Data Science Professional Practice Group is Random Forests. The Random Forests algorithm is one of the best among classification algorithms - able to classify large amounts of data with accuracy.
Random Forests are an ensemble learning method (also thought of as a form of nearest neighbor predictor) for classification and regression that construct a number of decision trees at training time and outputting the class that is the mode of the classes output by individual trees (Random Forests is a trademark of Leo Breiman and Adele Cutler for an ensemble of decision trees).
Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”. Random Forests are a wonderful tool for making predictions considering they do not overfit because of the law of large numbers. Introducing the right kind of randomness makes them accurate classifiers and regressors.
Single decision trees often have high variance or high bias. Random Forests attempts to mitigate the problems of high variance and high bias by averaging to find a natural balance between the two extremes. Considering that Random Forests have few parameters to tune and can be used simply with default parameter settings, they are a simple tool to use without having a model or to produce a reasonable model fast and efficiently.
Random Forests are easy to learn and use for both professionals and lay people - with little research and programming required and may be used by folks without a strong statistical background. Simply put, you can safely make more accurate predictions without most basic mistakes common to other methods.
The Random Forests algorithm was developed by Leo Breiman and Adele Cutler. Random Forests grows many classification trees. Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
Top Benefits of Random Forests
FastRandomForest is an efficient implementation of the Random Forests classifier for the Weka environment.
Introduction to Machine Learning - Slides
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." – Benjamin Franklin
Recent revelations concerning the extent of surveillance and massive personal data collection by the United States National Security Agency (NSA) and big business has sparked fierce debate about the appropriate balance of privacy versus security in our society. On the one hand, any governments first job priority is to protect citizens from harm and this means using technology and intelligence methods (e.g., data collection and analysis) to do the job effectively and efficiently. On the other hand, most people believe in some sort of inalienable right to a certain level of privacy and protection from potential government abuse. Most think the law constrains intelligence agencies and big business from spying on citizens.
How to strike the optimal balance of privacy versus security in our society should be answered by the citizens in a democracy - not unelected judges or agencies with little to no political accountability. Most people knew some level of surveillance and intelligence collection was taking place to protect us - yet are shocked at the massive amount of private data being collected and stored and the fact that so many leading tech and communication corporations have been willing partners with government in collecting all this data. Many are disturbed by the fact that our political leaders did not disclose the pervasive level of data snooping and thus foreclosed public debate about surveillance levels folks feel comfortable with. We need to have this debate.
In theory, we could stop 99% of terrorist attacks - yet he price paid would be a significant reduction in our quality of life. How much quality of life are we willing to sacrifice for what risk percentage of protection? Who owns our personal data? What legal rights do we have to our data? What secondary uses can the government use all this stored personal data in the future? These are issues to be debated and decided by the people.
I respectfully suggest we need to grow up as citizens (and world community) and figure this out so we can tell our political leaders what type of society we desire. What levels of risk are we willing to accept? How do we optimally allocate resources among various risks (e.g., car accidents, fires, climate change, health care, poverty, education, terrorism...)? Technology and broad surveillance of society for security purposes is expensive: is the current spending proportional to the risks? Is the spending rational considering competing risks and issues? Trade-offs are required - no easy answers - but we the people need make these difficult decisions and not default to government and big business that skew the system in favor of their interests.
Technology and raw data is morally and ethically neutral: they can be used for good and bad purposes. Yes, tech design and data objectivity / quality matter - but humans decide purpose. Intelligence methods used by government can help protect us and simultaneously (or at a later time) be abused to target groups and individuals for both good and bad alternative purposes.
This is not new - we have been down this road before. The United States has a long history of government data snooping and abusing intelligence methods (e.g., wiretaps and opening mail - foreign manipulation and assassination). See 1976 Church Committee (found illegal intelligence gathering on citizens by the Central Intelligence Agency, National Security Agency and Federal Bureau of Investigation).
While most reasonable folks expect the government to use some level of surveillance to protect us, we also desire to live a reasonably good quality of life and fear government and big business - comprised of fallible humans - may at some time - abuse these awesome powers if not monitored and checked. It is plausible that future governments will mine personal data to control, manipulate and abuse the citizenry. And it frightens many that big tech firms and government appear to be in bed together in creating a modern Orwellian surveillance state - without full disclosure to and approval by the people.
Simply put, it comes down to trust: do we the people trust our government and big tech and communication firms not to abuse this new extraordinary power? Historical evidence creates doubts. At this time there is no evidence of personal data abuse. Yet there is also no solid evidence that data snooping has protected us from specific harm. What are the checks and balances against abuse? Are they effective or flawed? What are the incentives?
It appears at this time only "metadata" is being collected (e.g., logs of calls, data for credit card transactions and online communications). Americans now produce about 161 exabytes of combined raw data per year and collecting, filtering, organizing, storing and analyzing this raw data is only possible using sophisticated technology and data science techniques.
The raw data sets are massive and growing exponentially - and at this time only machine learning and algorithms can understand them. The search is for trends, patterns, associations and networks. Once strange activity is identified, then humans can drill down and investigate.
Here is the data science problem: you are attempting to find a small needle in a larger and larger haystack. You will find more "statistically significant" relationships in larger data sets - and more patterns and relationships will have no meaning - creating greater opportunity to mistake noise for signal. Put another way, you will find more correlations and patterns between data - yet the number of false positives will rise significantly - more correlations without causation leading to an illusion of reality.
The danger is (1) government will make bad policy decisions believing noise is signal eroding our quality of life and (2) as the data grows exponentially, more false positives will require more and more humans to further investigate and find more specific personal data on a greater and greater number of both citizens and non-citizens. The secondary uses of this particularized personal data are many and offers temptation for abuse.
Data science - especially machine learning, algorithms and future artificial intelligence - will play an important role in big security data analysis and turning this massive amount of personal data into valuable, actionable information. As professional data scientists, I respectfully suggest we have a moral duty to make sure personal data is used responsibly and actionable intelligence used for proper legal purposes (e.g., to protect us from harm) and not abused against the people.
What is needed is a type of "magna carta" for government and big business to use our personal data responsibly and a Data Science Code of Professional Conduct to guide and protect data scientists when the temptation to abuse our personal data arises. There should be a legal procedure for data scientists and other technology professionals to report potential data abuses to specified government authorities or watchdog agencies to create an effective check and balance against personal data abuse.
The Internet of Things (IOT) will soon produce a massive volume and variety of data at unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's soul.
Let's define our terms:
Internet of Things (IOT): equipping all physical and organic things in the world with identifying intelligent devices allowing the near real-time collecting and sharing of data between machines and humans. The IOT era has already begun, albeit in it's first primitive stage.
Data Science: the analysis of data creation. May involve machine learning, algorithm design, computer science, modeling, statistics, analytics, math, artificial intelligence and business strategy.
Big Data: the collection, storage, analysis and distribution/access of large data sets. Usually includes data sets with sizes beyond the ability of standard software tools to capture, curate, manage, and process the data within a tolerable elapsed time.
We are in the pre-industrial age of data technology and science used to process and understand data. Yet the early evidence provides hope that we can manage and extract knowledge and wisdom from this data to improve life, business and public services at many levels.
To date, the internet has mostly connected people to information, people to people, and people to business. In the near future, the internet will provide organizations with unprecedented data. The IOT will create an open, global network that connects people, data and machines.
Billions of machines, products and things from the physical and organic world will merge with the digital world allowing near real-time connectivity and analysis. Machines and products (and every physical and organic thing) embedded with sensors and software - connected to other machines, networked systems, and to humans - allows us to cheaply and automatically collect and share data, analyze it and find valuable meaning. Machines and products in the future will have the intelligence to deliver the right information to the right people (or other intelligent machines and networks), any time, to any device. When smart machines and products can communicate, they help us and other machines understand so we can make better decisions, act fast, save time and money, and improve products and services.
The IOT, Data Science and Big Data will combine to create a revolution in the way organizations use technology and processes to collect, store, analyze and distribute any and all data required to operate optimally, improve products and services, save money and increase revenues. Simply put, welcome to the new information age, where we have the potential to radically improve human life (or create a dystopia - a subject for another time).
The IOT will produce gigantic amounts of data. Yet data alone is useless - it needs to be interpreted and turned into information. However, most information has limited value - it needs to be analyzed and turned into knowledge. Knowledge may have varying degrees of value - but it needs specialized manipulation to transform into valuable, actionable insights. Valuable, actionable knowledge has great value for specific domains and actions - yet requires sophisticated, specialized expertise to be transformed into multi-domain, cross-functional wisdom for game changing strategies and durable competitive advantage.
Big data may provide the operating system and special tools to get actionable value out of data, but the soul of the data, the knowledge and wisdom, is the bailiwick of the data scientist.
Revolution R Enterprise 6.1 includes two important advances in high performance predictive analytics with R: (1) big data decision trees, and (2) the ability to easily extract and perform predictive analytics on data stored in the Hadoop Distributed File System (HDFS).
Classification and regression trees are among the most frequently used algorithms for data analysis and data mining. The implementation provided in Revolution Analytics' RevoScaleR package is parallelized, scalable, distributable, and designed with big data in mind.