Raw data sets - both large and small - are not objective - they are selected, collected, filtered, structured and analyzed by human design. What was measured, in what manner, with what devices and to what purpose? What was not measured and why? Was only low-hanging fruit measured because the important could not be measured? What was the quality of the data?
Humans then interpret meaning from data in different ways. Experts can be shown the same sets of data and reasonably come to different conclusions. Naked and hidden biases in selecting, collecting, structuring and analyzing data present serious risks. How we decide to slice and dice data and what elements to emphasize or ignore influences the types and quality of measurements.
For example, the Vietnam war was arguably the first war in modern history to be data-driven. Robert McNamara was the United States Secretary of Defense during the war who instituted a data-driven, analytical strategy for managing the war. Prior to public service he used data science and business analytics to successfully run the Ford Motor Company. Yet war is a different beast than selling cars where the goal is simply to sell the most cars. War goals are often fuzzy and selecting the right data to measure is more difficult. McNamara and his team did not adequately understand data biases and often measured the wrong things (e.g., body counts) to manage and evaluate the war. The result was disaster.
Another example is the data-driven culture and swift destruction of Enron. Enron's leadership was super-smart and used data-driven, analytical strategies for managing business. It worked for a period of time making Enron among the most valuable firms of that era. Yet success bred arrogance and a dubious ethical culture that often measured the wrong things (e.g., energy market models and accounting). The Enron smart guys thought they were smarter and more data-driven than the competition. Only they were smart enough to build the complex models that others could not understand. Of course, the models were flawed yet over-confident belief in the data and superior analysis created a closed-system logic - without checks and balances to disclose the biases, flaws and errors - leading to disaster.
Here are a few of the ways data scientists sometimes fool themselves:
Confirmation bias: tendency to favor data that confirms beliefs or hypotheses.
Naive Rationalism bias: thinking that the reasons for things are, by default, accessible to you.
Funding - Agency bias: intentional or unconscious skewing of data, assumptions and interpretations to favor the interests of the party that financially supports the data science.
Data selection bias: skewing selection of data sources to most available, convenient and cost-effective, in contrast to being most valid and relevant for specific study. Data scientists have budget, data source and time limits - and thus may introduce unconscious bias in data sets able to select and those excluded.
Cherry picking bias: pointing to individual cases or data that seem to confirm a particular position, while ignoring a significant portion of related cases or data that may contradict that position.
Cognitive bias: skewing decisions based on pre-existing cognitive and heuristic factors (e.g., intuition) rather than on data and evidence. Biases in judgment or decision-making can also result from motivation, such as when beliefs are distorted by wishful thinking. Some biases have a variety of cognitive ("cold") or motivational ("hot") explanations.
Omitted-variable bias: appears in estimates of parameters in a regression analysis when the assumed specification is incorrect, in that it omits an independent variable that should be in the model.
Sampling bias: systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample - skewing the sampling of data sets toward subgroups of the population most relevant to the initial scope of data science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments.
Data Dredging bias: using regression techniques that may find correlations in small or some samples - but that may not be statistically significant in the wider population.
Projection bias: tendency to assume that most folks think just like us, though there may be no justification for it. Assume that a consensus exists on matters when there may be none or the exaggerated confidence one has when predicting the winner of an election or sports match.
Modeling bias: skewing models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics of fitness - including overfitting of models to past data without regard for predictive lift and failure to score and iterate models in a timely fashion with fresh observational data.
Reporting bias: skewing availability of data, such that observations of a certain kind may be more likely to be reported and consequently used in research.
Data-snooping bias: misuse of data mining techniques.
Exclusion bias: systematic exclusion of certain things.
Ingroup bias: tendency to favor one's own group - causes us to overestimate the abilities and values of our immediate group at the expense of others we don't really know.
Observation Selection bias: data is filtered not only by study design and measurement, but by the necessary precondition that there has to be someone doing a study. In situations where the existence of the observer or the study is correlated with the data observation selection effects occur, and anthropic reasoning is required.
Agency problem bias: means moral hazard and conflict of interest may arise in any relationship where one party is expected to act in another's best interests. The problem is that the agent who is supposed to make the decisions that would best serve the principal is naturally motivated by self-interest, and the agent's own best interests may differ from the principal's best interests. The two parties have different interests and asymmetric information (the agent having more information), such that the principal cannot directly ensure that the agent is always acting in its (the principal's) best interests, particularly when activities that are useful to the principal are costly to the agent, and where elements of what the agent does are costly for the principal to observe. Agents may hide risks and structure relationships so when he is right, he collects large benefits, when he is wrong, others pay the price.