Selecting the right Business Intelligence (BI) and Analytics Platform for a unique organization is challenging. Each organization - both in the public and private sectors - has a legacy IT ecosystem as well as unique competencies, skills, knowledge / business processes and people. As a result, the selection process should factor the above and include all stakeholders, including data scientists.
Gartner recently released the "2013 Business Intelligence and Analytics Platforms Magic Quadrant" as shown above. I suggest using Gartner's Magic Quadrant as one source of guidance - albeit with a heavy grain of salt. There may be conflicts of interest with Gartner's ratings. It is wise to give more weight to organization requirements and best platform fit than to vendor ratings. In addition, there is no perfect BI / Analytical Platform - each have strengths and weaknesses.
It is advisable to include data scientists in the selection process. They will provide invaluable information about the strengths and weaknesses of various BI / Analytical Platforms. Moreover, they will use different analytical tools for their job and know best how those tools integrate with different BI / Analytical Platforms.
Many organizations are electing to architect and build their own BI / Analytical Platforms using a mixture of open source and proprietary technologies. This has the advantage of greater technology agility and flexibility with the potential for lower costs. The disadvantage is greater expertise and time required to design, build and maintain the platform. Many value IT agility and flexibility for the future over other considerations. Others assign greater weight to perceived security of the black box of a major vendor.
I strongly suggest using a third-party professional - usually an independent professional IT services firm or systems integrator - to provide counsel in the selection process. They can provide an independent opinion - free of conflicts of interest - on which vendor(s) best meet needs of the organization. They can also help design and implement an overall Information Management Plan.
It is critical to have an organizational Information Management Plan to get optimal use of the BI / Analytical Platform. This may entail re-engineering knowledge / business processes and painful change management. Yet the short term pain may be worth the long term reward.
Recent surveys suggest the number one investment area for both private and public organizations is the design and building of a modern data warehouse (DW) / business intelligence (BI) / data analytics architecture that provides a flexible, multi-faceted analytical ecosystem. The goal is to leverage both internal and external data to obtain valuable, actionable insights that allows the organization to make better decisions.
Unfortunately, the amount of recent DW / BI / Data Analytics innovation, themes and paths is causing confusion. The "Big Data" and "Hadoop" hype is causing many organizations to roll-out Hadoop / MapReduce systems to dump data into without a big-picture information management strategic plan or understanding how all the pieces of a data analytics ecosystem fit together to optimize decision making capabilities.
This has resulted in the creation of a new word: Hadump - meaning data dumped into Hadoop with no plan. There are two schools of thought about data collection and storage strategy:
1) Start big data analytics project with a specific use case or problem to solve
2) Start dumping data to store and analyze later
We strongly suggest using both strategies. One is short term for quick results and other for long term value.
Consider only about 30% of all collected data will be valuable. The problem is you do not know what 30% will indeed be valuable. Thus, it is prudent to collect and store all data: structured and unstructured as well as internal and external.
The cost of collecting and storing the data - and data analytics technology - has been significantly reduced and will get cheaper and cheaper.
The cost of analyzing the data for valuable, actionable insights is very high. While machine learning and automation will reduce cost in future, the formula of cheap, abundant data and expensive data science and business analytics will likely remain for some time.
Thus, start a data analytics project to solve a specific problem or to take advantage of an opportunity to demonstrate value. Yet understand the long term value of saving any and all data for future analysis - as the specific use case arises.
More importantly, it is crucial to spend time and resources to develop both an information management strategic plan and decision optimizing processes. Data science knowledge and business processes detailing the collection, storage, analysis and distribution of data is the magic sauce that orchestrates the data tech ingredients.
A traditional BI architecture has analytical processing first pass through a data warehouse.
In the new, modern BI architecture, data reaches users through a multiplicity of organization data structures, each tailored to the type of content it contains and the type of user who wants to consume it.
The data revolution (big and small data sets) provides significant improvements. New tools like Hadoop allow organizations to cost-effectively consume and analyze large volumes of semi-structured data. In addition, it complements traditional top-down data delivery methods with more flexible, bottom-up approaches that promote predictive or exploration analytics and rapid application development.
In the above diagram, the objects in blue represent traditional data architecture. Objects in pink represent the new modern BI architecture, which includes Hadoop, NoSQL databases, high-performance analytical engines (e.g. analytical appliances, MPP databases, in-memory databases), and interactive, in-memory visualization tools.
Most source data now flows through Hadoop, which primarily acts as a staging area and online archive. This is especially true for semi-structured data, such as log files and machine-generated data, but also for some structured data that cannot be cost-effectively stored and processed in SQL engines (e.g. call center records).
From Hadoop, data is fed into a data warehousing hub, which often distributes data to downstream systems, such as data marts, operational data stores, and analytical sandboxes of various types, where users can query the data using familiar SQL-based reporting and analysis tools.
Today, data scientists analyze raw data inside Hadoop by writing MapReduce programs in Java and other languages. In the future, users will be able to query and process Hadoop data using familiar SQL-based data integration and query tools.
The modern BI architecture can analyze large volumes and new sources of data and is a significantly better platform for data alignment, consistency and flexible predictive analytics.
Thus, the new BI architecture provides a modern analytical ecosystem featuring both top-down and bottom-up data flows that meet all requirements for reporting and analysis.
After designing and building a modern data warehouse / business intelligence / data analytical ecosystem, many clients are frustrated they are unable to extract valuable, actionable insights from data.
The solution is to develop data science business and knowledge processes and engage data scientists to gain business understanding. It may be helpful to think of data scientists like lawyers: highly trained knowledge workers that can be hired full time in-house or engaged independently on a time or project basis.
In the above image, the rectangular boxes represent an approximate data flow and circles represent process flow. The processes and data are arranged by degree of effort and degree of information structure.
Data analysis may be driven by bottom-up (from data to theory) or top-down (from theory to data) processes - or a mixture of both.
Such data science processes include:
1) Information gathering
2) Re-representation of the information in a schema that aids analysis
3) Development of insight through the manipulation of this representation
4) Creation of some knowledge product or direct action based on the insight
Data Science for Business Understanding Formula
Information > Schema > Insight > Product
The data analysis may be organized in two key loops:
1) Searching loop (seeking, extracting, filtering information)
2) Understanding loop (modeling and conceptualization from a schema that best fits the evidence)
New academic research suggests that companies using this kind of data science and business analytics to guide their decisions are more productive and have higher returns on equity than competitors that do not. As data science changes the game for virtually all industries, it will tilt the playing field, favoring some over others.
Top Benefits of Data Science for Better Business Decisions Formula
Data science in business is about having the right information and insight to create better business outcomes. Business analytics means leaders know where to find the new revenue opportunities and which product or service offerings are most likely to address the market requirement. It means the ability to quickly access the right data points to evaluate key performance and revenue indicators in building successful growth strategies. And, it means recognizing regulatory, reputational, and operational risks before they become realities.
1) Having the knowledge you need: Data science delivers insightful information in context so decision makers have the right information, where, when and how you need it.
2) Making better, faster decisions: Data science provides decision makers throughout the organization with the interactive, self-service environment needed for exploration and analysis.
3) Optimizing business performance: Data science enables decision makers to easily measure and monitor financial and operational business performance, analyze results, predict outcomes and plan for better business results.
4) Uncover new business opportunities: Data science delivers new insights that help the organization maximize customer and product profitability, minimize customer churn, detect fraud and increase campaign effectiveness.
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.
In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.
As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.
Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.
Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.
Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.
Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.
Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.
Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.
Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
Becoming a data and evidence driven organization provides a significant competitive advantage. Speed and accuracy of insight, delivered across any device including smart phones and tablets, means organizations can make better, faster decisions. Organizations need to develop a “culture of analytics” that encourages an analytic-centric business model where everyone feels empowered to think like a leader.
Organizations need to be data driven with teams that can formulate questions, understand data needed to answer questions, create solutions, validate solutions, and get the insights to the right people in a way they can understand.
The organization needs processes for all to act on the resulting insights. The entire organization needs to become data driven - guided by facts, evidence, statistics and analysis. This is the secret success sauce of Google, Wal Mart, Goldman Sachs, and others.
The challenge for the ‘data scientist’ is to make sense of the randomness - structure chaos into an intelligible pattern or ‘insights’. The challenge for organizations is to structure and establish a common business language that propagates into the data being created by the business.
Transforming into a data-driven organization - turning information into actionable insights is a 3 part strategy:
• Technology – build a modern BI architecture & analytics ecosystem with the right tools
• Processes – streamline and standardize BI processes, measurements, and reports wherever possible
• People – train staff to use BI tools, become data-driven decision makers to meet the needs of the organization
The goal of a modern BI system is to allow the organization to:
• Make confident, data-based decisions based on evidence
• Access timely, relevant information you need, to meet the requirements of all types of users
• Link strategy to execution, leveraging data from all data sources
• Get answers when and where you need them on any device, at any time
• Transform data into actionable insight for everyone
• Uncover new or hidden opportunities to increase competitiveness
• Explore data in an intuitive way, for immediate answers to questions
Three Step OPD Data Science Process:
Step 1. Organize Data.
Organizing data involves the physical storage and format of data and incorporated best practices in data management.
Step 2. Package Data.
Packaging data involves logically manipulating and joining the underlying raw data into a new representation and package.
Step 3. Deliver Data.
Delivering data involves ensuring that the message the data has is being accessed by those that need to hear it.
Plus, at all steps have answers to these questions.
What is being created?
How will it be created?
Who will be involved in creating it?
Why is it to be created
The key is how quickly data can be turned in to currency by:
Analyzing patterns and spotting relationships/trends that enable decisions to be made faster with more precision and confidence.
Identifying actions and bits of information that are out of compliance with company policies can avoid millions in fines.
Proactively reducing the amount of data you pay ($18,750/gigabyte to review in eDiscovery) by identifying only the relevant pieces of information.
Optimizing storage by deleting or offloading non-critical assets to cheaper cloud storage thus saving millions in archive solutions.
We are now living in a data-driven world where a mastery of technologies and processes that enable a rapid ROD (Return on Data) is the key to reducing cost, complexity, risk and increasing the value of your holdings.