Health care is ripe for disruption. Overpriced, inefficient, lack of price transparency, dysfunctional regulations and a "one-size-fits-all" approach are a few of the many problems screaming for solutions.
Data science has the potential to lower costs, improve care and personalize medicine. Just as Google changed advertising with data science, a health care revolution will occur as new tools, techniques, and data sources are available. Modern medicine focuses on the average patient yet does not usually allow for differences between patients. A treatment is deemed effective or ineffective, safe or unsafe, based on double-blind studies that rarely considers the differences between patients. Data science and the proliferation of sensors generating medical data changes this dynamic.
Data science has the potential to help us make better policy and resource decisions at lower cost and make improved medical decisions based on a patients specific biology. We can now work on massive data sets effectively, combining data from clinical trials and direct observation by practicing physicians. When we combine data with the resources needed to work on the data, we can start asking the important questions like what treatments work and for whom.
Data science allows for completely different approach to treatment. Rather than a treatment that works 80% of the time, or even 100% of the time for 80% of the patients, a treatment might be effective for a small group. It might be entirely specific to the individual - the next cancer patient may have a different protein that’s out of control, an entirely different genetic cause for the disease. Treatments that are specific to one patient don’t exist in medicine as it’s currently practiced.
Three (3) innovations - all involving data science - will converge to improve and personalize health care:
2. Body Sensors
3. Electronic Medical Records (EMR)
Our health depends on our genes and environmental factors. Recent advances in genomics allows us to determine our entire DNA sequence and understand how our specific genome sequence can better manage health. It is also possible to measure tens of thousands of components in blood to obtain a clear picture of our molecular picture during healthy and disease states.
Next-generation genomic technologies allow data scientists to drastically increase the amount of genomic data collected on large study populations. When combined with new informatics approaches that integrate many kinds of data with genomic data in disease research - we will better understand the genetic bases of drug response and disease.
Cheap DNA sequencing in the doctor’s office will soon be available. In conjunction with inexpensive compute power, the availability of EMR data to study whether treatments are effective and improved techniques for analyzing data - personalized medicine at lower costs should become a reality with a proper legal regulatory scheme that creates the right incentives.
Wearable body sensors are already a reality - albeit in primitive stage. Heart rate monitors, blood monitors, body water sensors, lactate acid sensors, testosterone and estrogen sensors and other body measuring devices will soon provide a wealth of data to help us better manage health.
The proliferation of sensors providing personal health data and cheap compute power to store and process are creating a meeting of medical science and data science. The result should be better health care for everyone at lower costs.
Electronic Medical Records
Electronic medical records are now required by law in most developed nations. Data becomes infinitely more powerful when you can mix data from different sources. Physician offices, hospitals and the increasing use of body sensors are creating a treasure chest of health data to allow data scientists to slice, dice and re-combine all this data into new forms of health knowledge and understanding.
This new information can help us avoid paying for treatments that are ineffective and help us design a system where the consumer pays only for outcomes.
At this time, when physicians order a treatment, whether it’s surgery or an over-the-counter medication, they are applying a “standard of care” treatment or some variation that is based on their own intuition - effectively hoping for the best. At this time modern medicine does not understand the relationship between treatments and outcomes. The proliferation of health data and data science will change the physicians "standard of care" and "intuition" mind-set to one of personalized care based on both evidence and intuitive experience.
Data science will allow us to predict more accurately which treatments will be effective for which patient, and which treatments won’t - improving health care at lower costs.
Personalized medicine (PM) customizes health care - with treatment tailored to the individual patient. PM may be defined as a comprehensive, prospective approach to preventing, diagnosing, treating and monitoring disease in ways that achieve optimal individual health care decisions.
The United States spends over USD $2.6 trillion on health care every year, an amount that constitutes an unsustainable fiscal burden for society. These costs include over USD $600 billion of unexplained variations in treatments - treatments that cause no differences in outcomes, or even make the patient’s condition worse. This is unacceptable and unsustainable.
We all want a smarter, more cost-effective health care system where treatments are designed to be effective on our individual biologies; where treatments are administered effectively; where physicians and hospitals are used cost-effectively; and where we pay for outcomes, not procedures.
Data science will play a role in creating this new system by creating a better understanding of the relationship between treatments, outcomes, patients and costs.
Spark - Shark Data Analytics Stack on a Hadoop Cluster Tues. April 23 @6pm
Register Now @ http://bit.ly/11dLSn0
For folks unable to attend in-person, register to attend the event and two (2) hours before the event we will email you a link to watch the event via live webcast.
We look forward to meeting you at this must-attend Big Data Week Event on Spark Shark Hadoop.
University of Colorado Denver - Tuesday April 23, 2013 @ 6:00pm MST Large auditorium (170 person capacity) with 20' screen.
Location: CU Denver - North Classroom #1539 - 1200 Larimer Street Denver, CO 80217-3364 - Map: http://bit.ly/Tyznzg
6:00 - 6:15 Schmooze - Old Chicago Pizza will be served.
6:15 - 8:30 Demonstrate the Spark - Shark Data Analytics Stack on a Hadoop Cluster
8:30 - 9:30 Network at Old Chicago at 14th and Market.
Data scientists need to be able to access and analyze data quickly and easily. The difference between high-value data science and good data science is increasingly about the ability to analyze larger amounts of data at faster speeds. Speed kills in data science and the ability to provide valuable, actionable insights to the client in a timely fashion can mean the difference between competitive advantage and no or little value-added.
One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.
The Berkeley Data Analytics Stack (BDAS) is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos.
Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.
Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries.
Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.
This presentation covers the nuts and bolts of the Spark, Shark and Mesos Data Analytics Stack on a Hadoop Cluster. We will demonstrate capabilities with a data science use-case.
Michael Malak is a Data Analytics Senior Engineer at Time Warner Cable. He has been pushing computers to their limit since the 1970's. Mr. Malak earned his M.S. Math degree from George Mason University.He blogs at http://www.technicaltidbit.com.
Chris Deptula is a Senior System Integration Consultant with OpenBI and is responsible for data integration and implementation of Big Data systems. With over 5 years experience in data integration, business intelligence, and big data platforms, Chris has helped deploy multiple production Hadoop clusters. Prior to OpenBI, Chris was a consultant with FICO implementing marketing intelligence and fraud identification systems. Chris holds a degree in Computer and Information Technology from Purdue University. Follow Chris on Twitter@chrisdeptula.
Michael Walker is a managing partner at Rose Business Technologies, a professional technology services and systems integration firm. He leads the Data Science Professional Practice at Rose. Mr. Walker received his undergraduate degree from the University of Colorado and earned a doctorate from Syracuse University. He speaks and writes frequently about data science and is writing a book on Data Science Strategy for Business. Learn more about the Rose Data Science Professional Practice at http://bit.ly/10TgVHG. Follow Mike on Twitter @Ironwalker76.
Register Now @ http://bit.ly/11dLSn0
On April 10, 2013 Gregory Piatetsky-Shapiro (KDnuggets), Eric Siegel (Predictive Analytics World) and Michael Walker (Rose Business Technologies) discussed whether data science should be an independent profession with a code of professional conduct and self-regulation. See the video here.
Regulation of data science is under consideration (read here and here) and Michael Walker argued that either data science becomes a profession and regulates itself or congress will impose draconian regulations that defeat the purpose of data science: to make life, business and government better. He has drafted a proposed "Data Science Code of Professional Conduct". See: bit.ly/YbsjXR.
In support of data science as a profession is the following:
1) Data science is in the pre-industrial stage and needs to develop a "Canon" (a body of principles, rules, standards, or norms) of scientific methods, principles and best practices for practitioners. Data science incorporates a number of disciplines - is wide open for innovation - and requires guidance to ensure data science is used to make life, business and government better - and prevent abuse. Ninety percent (90%) of the worlds data has been produced in the past two (2) years and will grow exponentially. How we extract meaning from all this data without creating an illusion of reality is important.
2) To protect both consumers of data science and data scientists from charlatans, illegal and unethical conduct and data science malpractice. A Data Science Code of Professional Conduct is needed to protect individuals privacy, clients confidential data, prevent conflicts of interest and to ensure data scientists have a duty to the greater good of society, and not just blind loyalty to the client.
3) Self-regulation versus imposed regulation. Either data science becomes a profession and regulates itself or congress will impose both good and bad regulations. It is better for data scientists to architect and implement a regulatory scheme than to trust congress to enact an appropriate regulatory structure that may defeat or limit the development of data science.
4) To create a check and balance against big government and big business using data science at the expense of the majority in society. Some argue that the internet, mobile smart-phones and computers are a big spying machine that big government and business uses to collect information on people further eroding civil liberties. The potential for abuse is significant and the professionalization of data science can mitigate harms.
Reasons to oppose data science becoming a profession include:
1) Professions tend to create artificial barriers to entry causing artificially higher prices.
2) Professions tend to be self-serving at the expense of consumers.
3) Professions - after a period of time - tend to stifle innovation to protect vested interests.
Michael Walker argued that - on balance - the equities favor data science becoming a profession. He pointed out that in many disciplines like medical research, economics and psychology, data manipulation is common and the scientific method has not been honored resulting in decreased reputation and the eroding trust of society. Future data scientists need to preempt this outcome by not only honoring the traditional scientific method, but by developing new data science "canons" and scientific methods to liberate meaning from data without creating an illusion of reality.
Eric Siegel is agnostic about whether data science needs to become a profession. Mr. Siegel agreed that data science can be abused - that a code of professional conduct may be useful and stated that a certification to establish a base level of competency may be prudent. He voiced concern over the civil liberties aspect of the use and potential abuse of data.
Gregory Piatetsky-Shapiro argued against data science becoming a profession. He asserted that other established organizations - like ACM (computing professionals) - is considering The Pledge of the Computing Professional, which touches upon many themes relevant to Data Science - and also pointed out that INFORMS has Analytics Certification programs. He thinks these organizations will be adequate to develop data science.
Mr. Piatetsky-Shapiro asserted that while a code of professional conduct is a noble goal, it is meaningless without a central organization that promotes and enforces this goal, and currently data science is such a diverse field that central organization is very unlikely. Just looking at current Data Sceince related meetings on www.kdnuggets.com/meetings page, we see meetings sponsored by research societies like ACM, IEEE, INFORMS, SIAM, commercial companies like O’Reilly, GigaOM, IEG, Big Data Companies like IBM, SAS, EMC, and many others. It looks very unlikely that all these diverse interests will agree to a single organization to enforce any codeof conduct.
Further, a recent KDnuggets Poll (March 2013) found that a majority of data scientists voted against a pledge. Yet a majority of non-data scientists supported the pledge suggesting that consumers of data science would welcome and favor a data science code of professional conduct.
Mr. Walker responded that data science is a new field that encompasses a variety of skill sets from different disciplines and desperately requires a professional body to develop canons that incorporate and blend scientific methods from a myriad of disciplines. The blend of scientific methods will create something new and relying on the scientific methods of math, statistics, computer engineering and others - alone - is not sufficient. Data science requires its own professional canons.
Mr Walker also asserted that - while a majority of data scientists may not at this time favor a "pledge" - a large majority of data science consumers would likely favor hiring a data scientist who is certified and is required to honor a code of professional conduct - similar to certified public accountants, lawyers and physicians. Considering the significant damage data science malpractice can cause, Walker speculated that the market would favor certified, professionalized data scientists. Moreover, a professional code can protect data scientists from unethical and illegal client conduct.
Mr. Walker suggested that we should learn from other professions like law and medicine - adopt the good and remove the bad to mitigate the negatives of a profession. To earn and maintain trust and credibility, data science must follow traditional scientific methods, innovate new methods and follow a code of professional conduct.