Dr. Alex Guazzelli's talk on Big Data, Predictive Analytics, and PMML at Intellifest 2012 in San Diego, CA.
Predictive analytics has been used for many years to learn patterns from historical data to literally predict the future. Well known techniques include neural networks, decision trees, and regression models. Although these techniques have been applied to a myriad of problems, the advent of big data, cost-efficient processing power, and open standards have propelled predictive analytics to new heights.
Big data involves large amounts of structured and unstructured data that are captured from people (e.g., on-line transactions, tweets, ... ) as well as sensors (e.g., GPS signals in mobile devices). With big data, companies can now start to assemble a 360 degree view of their customers and processes. Luckily, powerful and cost-efficient computing platforms such as the cloud and Hadoop are here to address the processing requirements imposed by the combination of big data and predictive analytics.
But, creating predictive solutions is just part of the equation. Once built, they need to be transitioned to the operational environment where they are actually put to use. In the agile world we live today, the Predictive Model Markup Language (PMML) delivers the necessary representational power for solutions to be quickly and easily exchanged between systems, allowing for predictions to move at the speed of business.
This talk will give an overview of the colliding worlds of big data and predictive analytics. It will do that by delving into the technologies and tools available in the market today that allow us to truly benefit from the barrage of data we are gathering at an ever-increasing pace.
Data quality management includes the continuous analysis, observation, and improvement of all information in the organization. Best practice includes:
1. Data quality assessment, as a way for the practitioner to understand the scope of how poor data quality affects
the ways that the business processes are intended to run, and to develop a business case for data quality management;
2. Data quality measurement, in which the data quality analysts synthesize the results assessment and concentrate on the data elements that are deemed critical based on the selected business users’ needs. This leads to the definition of performance metrics that feed management reporting via data quality scorecards;
3. Integrating data quality into the application infrastructure, by way of integrating data requirements analysis across the organization and by engineering data quality into the system development life cycle;
4. Operational data quality improvement, where data stewardship procedures are used to manage identified data quality rules and conformance to acceptability thresholds;
5. Data quality incident management, which allows the data quality analysts to review the degree to which the data
does or does not meet the levels of acceptability, report, log, and track issues, and document the processes for
remediation and improvement.
Best data management practices include data governance, data quality assurance, data enhancement and the following core data services:
• Data access and integration
• Data profiling
• Identity resolution
• Record Linkage and merging
• Data cleansing
• Data enhancement
• Mapping & Spatial Analysis
• Data inspection and monitoring
Enabling these core data services requires:
1. Shoring up the information production processes that support multiple data consumers,
2. Standardizing the tools and techniques across the enterprise,
3. Standardizing the definition and subsequent use of business rules, and
4. Standardizing spatial analysis/location intelligence services associated with operations and transactions.
Deploying standardized business information services using a core data services approach improves quality and consistency while decreasing operational costs.
A data model is a plan for building a database. To use a common analogy, the data model is equivalent to an architect's building plans.
Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations. To be effective, it must be simple enough to communicate to the end user the data structure required by the database yet detailed enough for the database design to use to create the physical structure.
A data model is a conceptual representation of the data structures that are required by a database. The data structures include the data objects, the associations between data objects, and the rules which govern operations on the objects. As the name implies, the data model focuses on what data is required and how it should be organized rather than what operations will be performed on the data.
Data modeling is the formalization and documentation of existing processes and events that occur during application software design and development. Data modeling techniques and tools capture and translate complex system designs into easily understood representations of the data flows and processes, creating a blueprint for construction and/or re-engineering.
A data model can be thought of as a diagram or flowchart that illustrates the relationships between data. Although capturing all the possible relationships in a data model can be very time-intensive, it's an important step and shouldn't be rushed. Well-documented models allow stake-holders to identify errors and make changes before any programming code has been written.
Data modeling is also used as a technique for detailing business requirements for specific databases. It is sometimes called database modeling because a data model is eventually implemented in a database.
There are three different types of data models produced while progressing from requirements to the actual database to be used for the information system:
1) Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.
2) Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).
3) Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.
Although LDMs and PDMs sound very similar, and they in fact are, the level of detail that they model can be significantly different. This is because the goals for each diagram is different – you can use an LDM to explore domain concepts with your stakeholders and the PDM to define your database design.
Data Modeling in Context of Business Process Integration
Data Modeling in the Context of Database Design
Database design is defined as: "design the logical and physical structure of one or more databases to accommodate the information needs of the users in an organization for a defined set of applications". The design process roughly follows five steps:
1. planning and analysis
2. conceptual design
3. logical design
4. physical design
The data model is one part of the conceptual design process. The other, typically is the functional model. The data model focuses on what data should be stored in the database while the functional model deals with how the data is processed. To put this in the context of the relational database, the data model is used to design the relational tables. The functional model is used to design the queries which will access and perform operations on those tables.
Data Model Components
The data model gets its inputs from the planning and analysis stage. Here the modeler, along with analysts, collects information about the requirements of the database by reviewing existing documentation and interviewing end-users.
The data model has two outputs. The first is an entity-relationship diagram which represents the data strucures in a pictorial form. Because the diagram is easily learned, it is valuable tool to communicate the model to the end-user.
The second component is a data document. This a document that describes in detail the data objects, relationships,
and rules required by the database. The dictionary provides the detail required by the database developer to construct the physical database.
Hadoop (MapReduce where code is turned into map and reduce jobs, and Hadoop runs the jobs) is great at crunching data yet inefficient for analyzing data because each time you add, change or manipulate data you must stream over the entire dataset.
In most organizations, data is always growing, changing, and manipulated and therefore time to analyze data significantly increases.
As a result, to process large and diverse data sets, ad-hoc analytics or graph data structures, there must be better alternatives to Hadoop / MapReduce.
Google (architect of Hadoop / MapReduce) thought so and architected a better, faster data crunching ecosystem that includes Percolator, Dremel and Pregel. Google is one of the key innovative leaders for large scale architecture.
Percolator is a system for incrementally processing updates to a large data sets. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, you significantly speed up the process and reduce the time to analyze data.
Percolator’s architecture provides horizontal scalability and resilience. Percolator allows reducing the latency (time between page crawling and availability in the index) by a factor of 100. It allows simplifying the algorithm. The big advantage of Percolator is that the indexing time is now proportional to the size of the page to index and no more to the whole existing index size.
Dremel is for ad hoc analytics. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and allows analysts to scan over petabytes of data in seconds to answer queries. Dremel is capable of running aggregation queries over trillions of rows in seconds and thus is about 100 times faster than MapReduce.
Dremel's architecture is similar to Pig and Hive. Yet while Hive and Pig rely on MapReduce for query execution, Dremel uses a query execution engine based on aggregator trees.
Pregel is a system for large-scale graph processing and graph data analysis. Pregel is designed to execute graph algorithms faster and use simple code. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use.
Pregel is architected for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Data mining is the process that results in the discovery of new patterns in large data sets. It utilizes methods at the intersection of artificial intelligence,machine learning, statistics, and database systems. The overall goal of the data mining process is to extract knowledge from an existing data set and transform it into a human-understandable structure for further use.
Data mining involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structures,visualization, and online updating.
Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:
The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.
Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases intodata warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis.
Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining.
What can data mining do?
Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data.
With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments.
For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures.
WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries.
The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.
By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.
How does data mining work?
While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:
Data mining consists of five major elements:
Different levels of analysis are available:
What technological infrastructure is required?
Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical technological drivers:
Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium processors to achieve performance levels exceeding those of the largest supercomputers.
CRISP-DM is a widely accepted methodology for data mining projects. The steps in the process are:
Hadoop was developed to enable applications to work with thousands of computational independent computers and petabytes of data. Hadoop is a popular open source project that not only has incorporated an implementation of the MapReduce programming model, but also includes other subprojects supporting reliable and scalable distributed computing such as HDFS, (a distributed file system) and Pig (a high-level data flow language for parallel computing), along with others. See http://hadoop.apache.org.
The Hadoop stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is really hard.
The main components include:
The range of applications that use Hadoop show the versatility of the MapReduce approach, and reviewing them provides some of the typical characteristics of problems suited to this approach:
Some good examples that display some or all of these characteristics include:
• Applications that boil lots of data down into ordered or aggregated results – sorting, word and phrase counts, building inverted indices mapping phrases to documents, phrase searching among large document corpuses.
• Batch analyses fast enough to satisfy the needs of operational and reporting applications, such as web traffic statistics or product recommendation analysis.
• Iterative analysis using data mining and machine learning algorithms, such as association rule analysis or k-means clustering, link analysis, classification, Naïve Bayes analysis.
• Statistical analysis and reduction, such as web log analysis, or data profiling
• Behavioral analyses such as click stream analysis, discovering content-distribution networks, viewing behavior of video audiences.
• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.
MapReduce is a programming model introduced and described by researchers at Google for parallel computation involving large data sets that are distributed across clusters of many processers. In contrast to the explicitly parallel programming models typically used with imperative language such as Java and C++, the MapReduce programming model is reminiscent of functional languages such as Lisp and APL, in its reliance on two basic operational steps:
• Map which describes the computation or analysis to be applied to a set of input key/value pairs to produce a set of intermediate key/value pairs, and
• Reduce, in which the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results.
Conceptually, the computations applied during the Map phase to each input key/value pair are inherently independent, which means that both the data and the computations can be distributed across multiple storage and processing units and automatically parallelized.
A Common Example
The ability to scale based on automatic parallelization can be demonstrated using a common MapReduce example that counts the number of occurrences of each word in a collection of many documents. Looking at the problem provides a hierarchical view:
• The total number of occurrences of each word in the entire collection is equal to the sum of the occurrences of each word in each document;
• The total number of occurrences of each word in each document can be computed as the sum of the occurrences of each word in each paragraph;
• The total; number of occurrences of each word in each paragraph can be computed as the sum of the occurrences of each word in each sentence;
This apparent recursion provides the context for both our Map function, which instructs each processing node to map each word to its count, and the Reduce function, which collects the word count pairs and sums together the counts for each particular word. The runtime system is responsible for distributing the input to the processing nodes, initiating the Map phase, coordinating the communication of the intermediate results, initiating the Reduce phase, and then collecting the final results.
While we can speculate on the level of granularity for computation (document vs. paragraph vs. sentence), ultimately we can leave it up to the runtime system to determine the best distribution of data and allocation of computation to reduce the execution time. In fact, the value of a programming model such as MapReduce is that its simplicity essentially allows the programmer to describe the expected results of each of the computational phases while relying on the complier and runtime systems for optimal parallelization while providing fault-tolerance.
The Economist Intelligence Unit surveyed over 600 business leaders, across the globe and industry sectors about the use of Big Data in their organizations. The research confirms a growing appetite for data and data-driven decisions and those who harness these correctly stay ahead of the game.
The report provides insight on their use of Big Data today and in the future, and highlights the advantages seen and the specific challenges Big Data has on decision making for business leaders.
75% of respondents believe their organizations to be data-driven
9 out of 10 say decisions made in the past 3 years would have been better if they’d had all the relevant information
42% say that unstructured content is too difficult to interpret
85% say the issue is not about volume but the ability to analyze and act on the data in real time
More than half (54 percent) of respondents cite access to talent as a key impediment to making the most of Big Data, followed by the barrier of organizational silos (51 percent)
Other impediments to effective decision-making are lack of time to interpret data sets (46 percent), and difficulty managing unstructured data (39 percent)
71 percent say they struggle with data inaccuracies on a daily basis
62 percent say there is an issue with data automation, and not all operational decisions have been automated yet
Half will increase their investments in Big Data analysis over the next three years
The report reveals that nine out of ten business leaders believe data is now the fourth factor of production, as fundamental to business as land, labor, and capital. The study, which surveyed more than 600 C-level executives and senior management and IT leaders worldwide, indicates that the use of Big Data has improved businesses' performance, on average, by 26 percent and that the impact will grow to 41 percent over the next three years. The majority of companies (58 percent) claim they will make a bigger investment in Big Data over the next three years.
Approximately two-thirds of 168 North American (NA) executives surveyed believe Big Data will be a significant issue over the next five years, and one that needs to be addressed so the organization can make informed decisions. They consider their companies as 'data-driven,' reportingthat the collection and analysis of data underpins their firm's business strategy and day-to-day decision-making.
Fifty-five percent are already making management decisions based on "hard analytic information." Additionally, 44 percent indicated that the increasing volume of data collected by their organization (from both internal and external sources) has slowed down decision-making, but the vast majority (84 percent) feel the larger issue is being able to analyze and act on it in real-time.
The exploitation of Big Data is fueling a major change in the quality of business decision-making, requiring organizations to adopt new and more effective methods to obtain the most meaningful results from their data that generate value. Organizations that do so will be able to monitor customer behaviors and market conditions with greater certainty, and react with speed and effectiveness to differentiate from their competition.