Data Science is in the early stage of development and needs to develop canons to guide us. There is a brewing debate about the use of established scientific methods in the practice of data science. Some suggest traditional scientific methods must be used while others assert new scientific methods must be developed - especially considering algorithms, machine learning and future artificial intelligence. Part of that debate includes whether it is necessary to form a hypothesis. I suggest the answer is it depends.
Let us stipulate that "Data Science" means the scientific study of the creation, manipulation and transformation of data to create meaning and "Data Scientist" means a professional who uses scientific methods to liberate and create meaning from raw data.
Simply put, a hypothesis is a proposed explanation for a phenomenon and is part of the scientific method. “Scientific method” means a method of research in which a problem is identified, relevant data are gathered, a hypothesis is formulated from these data, and the hypothesis is empirically tested.
It is suggested that best practice data science methods consists of the following steps:
(1) Careful observations of data, data sets and relationships between data.
(2) Deduction of meaning from the data and different data relationships.
(3) Formation of hypothesis.
(4) Experimental or observational testing of the validity of the hypothesis.
To be termed scientific, a method of inquiry must be based on empirical and measurable evidence subject to specific principles of reasoning.
I suggest there is a difference between using the scientific method for hard science and business or policy purposes. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories.
For a hypothesis to be a business or policy hypothesis, the standards may be different. In fact, it may or may not be necessary to formulate a hypothesis for business or public policy purposes - depending on the subject matter and context. Note that even without a hypothesis, it is prudent to use standard scientific methods to measure and record any experimental or test results for optimal decision making and continuous improvement.
A data scientist working on a business or policy case may find a number of statistically significant correlations in the data without proof of causation. Sometimes this matters (for reasons that may be difficult to understand), and other times it may not. Absent causation, these correlations may or may not have value. It depends on domain and context.
For example, we recently were engaged by a large financial firm to find meaning in data to help market and sell certain financial products. In one case, we found a strong correlation between two (2) variables suggesting the purchase of one product increased the purchase of another product. There was no rational explanation for this correlation and no way to prove causation.
We suggested a number of controlled experiments to test different strategies. Human purchasing behavior is tricky business yet by running a number of experiments we found the optimal marketing and selling process that significantly increased sales. This process would not meet traditional scientific standards - yet it worked for this particular purpose. Note that we followed traditional scientific practice in designing and executing the experiments to accurately measure and record results. No hypothesis, no expectations - just pure trial and error to see what worked and did not work, and attempt to explain why. To be fair, it may be argued that by selecting and designing the experiments in a certain manner we were in fact formulating and testing hypothesis.
The dirty secret in business and public policy (but not hard scientific disciplines) - when dealing with unpredictable human behavior - is that running many experiments is often (but not always) superior to creating a model to test a hypothesis. Models are - in all cases to variable degrees - flawed. For example, attempting to find one or more causal variables in a financial model, identifying why the hypothesis could be true before crunching data is vital considering it is a generalized model. Yet you do not need to build a general model to understand human behavior and purchase patterns. Finding a strong correlation between A and B and increased sales, you can run an experiment or better yet a series of controlled experiments to see what works. You don't even need to know why it works or does not work - although that would be nice.
Thus, the answer to whether it is necessary to form a hypothesis depends on subject matter and context. In hard scientific disciplines (e.g., biometrics/econometrics) absolutely yes. In business and public policy - sometimes yes (e.g., health/legal system) and sometimes no (e.g., marketing/sales). Further complicating matters is the design and execution of algorithms and machine learning when practicing data science. I suggest data science needs to develop canons to guide us in formulating appropriate scientific methods to use in different circumstances.
Please note that I strongly support the use of data, statistical and quantitative analysis, explanatory and predictive models, and evidence-based decision making. But sometimes controlled experiments without models or hypothesis is the right way to go. Selecting the best method to use for the job at hand is the art of data science.