Brevity is the soul of wit. Rose Business Technologies has three (3) rules of effective business communication: 1) be brief; 2) be blunt; 3) be gone. This is good counsel for data scientists.
The goal of data science is to make life, business and government better. This means communicating data science effectively and data scientists need to learn to draw pictures and tell stories so laypeople can understand actionable insights quickly and easily to make better decisions.
Communicating insights in a timely and understandable manner to decision makers, at all levels, is a learned skill. Many data scientists feel more comfortable - due to years of academic training - communicating valuable information in long white papers filled with mathematical equations or in long memos filled with sophisticated jargon. While this may work in academia or research environments, this will not suffice in the fast pace business and government world where decisions must be made fast by the consumers of data science.
For example, I received a call from the leadership of a large retail client complaining that a data scientist was unable to communicate results in a way that was understood by the consumers of the data science (a team of sixty plus marketing professionals). We examined the data science and found it to be excellent - full of actionable, valuable insights. Yet the data scientist attempted to communicate the results in a ten (10) page memo full of math equations. The equations were beautiful - but worthless considering that: 1) nobody understood the math; 2) nobody had the time to read a long memo - they needed to make a series of decisions in a short time; and 3) nobody could understand what the data scientist was communicating - they had no clue what insights could help them make better decisions.
Data scientists must learn how to communicate the meaning contained in data in short stories with data visualization. We solved the problem by training the data scientist to use data visualization and short storytelling. We worked with the marketing team to learn what they needed to make better decisions, how to best and most efficiently communicate data science results, and understand decision making processes. No more math equations or long memos full of jargon.
The old saw that a picture is worth a thousand words is even more true in data science. Data visualization is a powerful tool to simplify complexity. Data visualization is visual representation of data: the goal is to communicate information clearly and effectively through graphical means. The picture should provide insights by communicating key-aspects in a more intuitive way. It helps if the picture is beautiful, yet data scientists should avoid the trap of creating gorgeous data visualizations that fail to communicate critical information in a fast, easy and intuitive way.
A good book to help learn how to create effective data visualizations is "Visualize This: The FlowingData Guide to Design, Visualization, and Statistics", by Nathan Yau. Although we are constantly exposed to graphics that lack context and provide little actionable insight, Yau separates the signal from the noise and explains the tools to create better data graphics. He shows how to explore data through visual metaphors that tell short stories. Click here to find downloadable data files, interactive examples of how visualization works and code samples to use as the basis for your own visual experimentation.
Short storytelling is also important. People love stories and can often better understand meaning via storytelling than simple fact-telling. Explaining the meaning of data with a good story helps hide and manage its complexity. Communicating actionable, valuable insights with storytelling helps folks better understand data context and complexity in a short time. An example of presenting data that tells a story is Steve Wexler's story of STD, HIV and AIDS rates in Texas and how those trends have changed over time. See: http://bit.ly/qCpGai.
In data science, pairing different data sets will often provide valuable insights. For example, integrating genetic data, body sensor data and electronic health records can tell a story that both physicians and patients understand. Another example of data storytelling regarding health entitlement spending is told in six (6) charts @ http://bit.ly/12hh5sR.
The future of communicating data science is not in long memos full of equations and fancy jargon, or in extensive whitepapers, but in shorter storytelling format using data visualization to make data science results understood easily and quickly.
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.
In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.
As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.
Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.
Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.
Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.
Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.
Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.
Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.
Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
The goal is to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem for each unique organization.
A traditional BI architecture has analytical processing first pass through a data warehouse.
In the new, modern BI architecture, data reaches users through a multiplicity of organization data structures, each tailored to the type of content it contains and the type of user who wants to consume it.
The data revolution (big and small data sets) provides significant improvements. New tools like Hadoop allow organizations to cost-effectively consume and analyze large volumes of semi-structured data. In addition, it complements traditional top-down data delivery methods with more flexible, bottom-up approaches that promote predictive or exploration analytics and rapid application development.
In the above diagram, the objects in blue represent traditional data architecture. Objects in pink represent the new modern BI architecture, which includes Hadoop, NoSQL databases, high-performance analytical engines (e.g. analytical appliances, MPP databases, in-memory databases), and interactive, in-memory visualization tools.
Most source data now flows through Hadoop, which primarily acts as a staging area and online archive. This is especially true for semi-structured data, such as log files and machine-generated data, but also for some structured data that cannot be cost-effectively stored and processed in SQL engines (e.g. call center records).
From Hadoop, data is fed into a data warehousing hub, which often distributes data to downstream systems, such as data marts, operational data stores, and analytical sandboxes of various types, where users can query the data using familiar SQL-based reporting and analysis tools.
Today, data scientists analyze raw data inside Hadoop by writing MapReduce programs in Java and other languages. In the future, users will be able to query and process Hadoop data using familiar SQL-based data integration and query tools.
The modern BI architecture can analyze large volumes and new sources of data and is a significantly better platform for data alignment, consistency and flexible predictive analytics.
Thus, the new BI architecture provides a modern analytical ecosystem featuring both top-down and bottom-up data flows that meet all requirements for reporting and analysis.
In the top-down world, source data is processed, refined, and stamped with a predefined data structure--typically a dimensional model--and then consumed by casual users using SQL-based reporting and analysis tools. In this domain, IT developers create data and semantic models so business users can get answers to known questions and executives can track performance of predefined metrics. Here, design precedes access. The top-down world also takes great pains to align data along conformed dimensions and deliver clean, accurate data. The goal is to deliver a consistent view of the business entities so users can spend their time making decisions instead of arguing about the origins and validity of data artifacts.
Creating a uniform view of the business from heterogeneous sets of data is not easy. It takes time, money, and patience, often more than most departmental heads and business analysts are willing to tolerate. They often abandon the top-down world for the underworld of spreadmarts and data shadow systems. Using whatever tools are readily available and cheap, these data hungry users create their own views of the business. Eventually, they spend more time collecting and integrating data than analyzing it, undermining their productivity and a consistent view of business information.
The bottom up world is a different process. Modern BI architecture creates an analytical ecosystem that brings prodigal data users back into the fold. It allows an organization to perform true ad hoc exploration (predictive or exploratory analytics) and promotes the rapid development of analytical applications using in-memory departmental tools. In a bottom-up environment, users can't anticipate the questions they will ask on a daily or weekly basis or the data they'll need to answer those questions. Often, the data they need doesn't yet exist in the data warehouse.
The modern BI architecture creates analytical sandboxes that let power users explore corporate and local data on their own terms. These sandboxes include Hadoop, virtual partitions inside a data warehouse, and specialized analytical databases that offload data or analytical processing from the data warehouse or handle new untapped sources of data, such as Web logs or machine data. The new environment also gives department heads the ability to create and consume dashboards built with in-memory visualization tools that point both to a corporate data warehouse and other independent sources.
Combining top-down and bottom-up worlds is challenging but doable with determined commitment.
BI professionals need to guard data semantics while opening access to data.
Business users need to commit to adhering to data standards.
Further, well designed data governance programs are an absolute requirement.
Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured.
Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to:
Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. What is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data.
A number of recent technology advancements are enabling organizations to make the most of big data and big data analytics:
Big data technologies not only support the ability to collect large amounts of data, they provide the ability to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
• As much as 80% of the world’s data is now in unstructured formats, which is created and held on the web. This data is increasingly associated with genuine Cloud-based services, used externally to the Enterprise IT. The part of Big Data that relates to the expected explosive growth and creation of new value is the unstructured data mostly arising from these external sources.
• Data sets are growing at a staggering pace
• Expected to grow by 100% every year for at least the next 5 years.
• Most of this data is unstructured or semi-structured – generated by servers, network devices, social media, and distributed sensors.
• “Big Data” refers to such data because the volume (petabytes and exabytes), the type (semi- and unstructured, distributed), and the speed of growth (exponential) make the traditional data storage and analytics tools insufficient and cost-prohibitive.
• An entirely new set of processing and analytic systems are required for Big Data, with Apache Hadoop being one example of a Big Data processing system that has gained significant popularity and acceptance.
• According to a recent McKinsey Big Data report, Big Data can provide up to USD $300 billion annual value to the US Healthcare industry, and can increase US retail operating margins by up to 60%. It’s no surprise that Big Data analytics is quickly becoming a critical priority for large enterprises across all verticals.
Big data characteristics
Volume: there is a lot of data to be analyzed and/or the analysis is extremely intense; either way, a lot of hardware is needed.
Variety: the data is not organized into simple, regular patterns as in a table; rather text, images and highly varied structures—or structures unknown in advance—are typical.
Velocity: the data comes into the data management system rapidly and often requires quick analysis or decision making.
Volume, variety, velocity, and complexity of incoming data streams
Growth of “Internet of Things” results in explosion of new data
Commoditization of inexpensive terabyte-scale storage hardware is making storage less costly ….so why not store it?
Increasingly enterprises are needing to store non-traditional and unstructured data in a way that is easily queried
Desire to integrate all the data into a single source
The power of Compression
Data comes from many different sources (enterprise apps, web, search, video, mobile, social conversations and sensors)
All of this information has been getting increasingly difficult to store in traditional relational databases and even data warehouses
Unstructured or semi-structured text is difficult to query. How does one query a table with a billion rows?
Culture, skills, and business processes
Conceptual Data Modeling
Data Quality Management
Emerging capabilities to process vast quantities of structured and unstructured data are bringing about changes in technology and business landscapes.
As data sets get bigger and the time allotted to their processing shrinks, look for ever more innovative technology to help organizations glean the insights they'll need to face an increasingly data-driven future.
What is Hadoop?
The most well known technology used for Big Data is Hadoop. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantities of data (petabytes or even more) at very low costs (compared to SAN systems).
Hadoop is an opensource version of Google’s MapReduce framework. It is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation: http://hadoop.apache.org/.
The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:
Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.
Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power.
What problems can Hadoop solve?
• The Hadoop framework is used by major players including Google, Yahoo , IBM, eBay, LinkedIn and Facebook, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X.
• The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
• Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built.
Big Data Market
The Big Data market is on the verge of a rapid growth spurt that will see it top the USD $50 billion mark worldwide within the next five years.
As of early 2012, the Big Data market stands at just over USD $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.
Enhancing Fraud Detection for Banks and Credit Card Companies Scenario
• Build up-to-date models from transactional to feed real-time risk-scoring systems for fraud detection.
• Analyze volumes of data with response times that are not possible today.
• Apply analytic models to individual client, not just client segment.
• Detect transaction fraud in progress, allow fraud models to be updated in hours than weeks.
Social Media Analysis for Products, Services and Brands Scenario
• Monitor data from various sources such as blogs, boards, news feeds, tweets, and social medias for information pertinent to brand and products, as well as competitors.
• Extract and aggregate relevant topics, relationships, discover patterns and reveal up-and-coming topics and trends.
• Brand Management for marketing campaigns, Brand protection for ad placement networks.
Store Clustering Analysis in the Retail Industry Scenario
• Retailer with large number of stores needs to understand cluster patterns of shoppers.
• Use shopping patterns for multiple characteristics like location, incomes, family size for better product placement.
• Store specific clustering of products, clustering specific types of products by locations.
Healthcare and Energy Industry Scenario
IBM Stream Computing for Smarter Healthcare
IBM Watson pairs natural language processing with predictive root cause analysis.
InfoSphere Streams based analytics can alert hospital staff of impending life threatening infections in premature infants up to 24 hours earlier than current practices.
Vestas Wind Systems use IBM big data analytics software and powerful IBM systems to improve wind
turbine placement for optimal energy output.