The Internet of Things (IOT) will soon produce a massive volume and variety of data at unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's soul.

Let's define our terms:

Internet of Things (IOT): equipping all physical and organic things in the world with identifying intelligent devices allowing the near real-time collecting and sharing of data between machines and humans. The IOT era has already begun, albeit in it's first primitive stage.

Data Science: the analysis of data creation. May involve machine learning, algorithm design, computer science, modeling, statistics, analytics, math, artificial intelligence and business strategy.

Big Data: the collection, storage, analysis and distribution/access of large data sets. Usually includes data sets with sizes beyond the ability of standard software tools to capture, curate, manage, and process the data within a tolerable elapsed time. 

We are in the pre-industrial age of data technology and science used to process and understand data. Yet the early evidence provides hope that we can manage and extract knowledge and wisdom from this data to improve life, business and public services at many levels. 

To date, the internet has mostly connected people to information, people to people, and people to business. In the near future, the internet will provide organizations with unprecedented data. The IOT will create an open, global network that connects people, data and machines. 

Billions of machines, products and things from the physical and organic world will merge with the digital world allowing near real-time connectivity and analysis. Machines and products (and every physical and organic thing) embedded with sensors and software - connected to other machines, networked systems, and to humans - allows us to cheaply and automatically collect and share data, analyze it and find valuable meaning. Machines and products in the future will have the intelligence to deliver the right information to the right people (or other intelligent machines and networks), any time, to any device. When smart machines and products can communicate, they help us and other machines understand so we can make better decisions, act fast, save time and money, and improve products and services.

The IOT, Data Science and Big Data will combine to create a revolution in the way organizations use technology and processes to collect, store, analyze and distribute any and all data required to operate optimally, improve products and services, save money and increase revenues. Simply put, welcome to the new information age, where we have the potential to radically improve human life (or create a dystopia - a subject for another time).

The IOT will produce gigantic amounts of data. Yet data alone is useless - it needs to be interpreted and turned into information. However, most information has limited value - it needs to be analyzed and turned into knowledge. Knowledge may have varying degrees of value - but it needs specialized manipulation to transform into valuable, actionable insights. Valuable, actionable knowledge has great value for specific domains and actions - yet requires sophisticated, specialized expertise to be transformed into multi-domain, cross-functional wisdom for game changing strategies and durable competitive advantage.

Big data may provide the operating system and special tools to get actionable value out of data, but the soul of the data, the knowledge and wisdom, is the bailiwick of the data scientist.

See: http://bit.ly/10TgVHG

See: http://bit.ly/10TgVHG

 
 
 
 
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. 

Data mining is the process that results in the discovery of new patterns in large data sets. It utilizes methods at the intersection of artificial intelligence,machine learning, statistics, and database systems. The overall goal of the data mining process is to extract knowledge from an existing data set and transform it into a human-understandable structure for further use. 

Data mining involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structures,visualization, and online updating. 

Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. 

Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:

  • operational or transactional data such as, sales, cost, inventory, payroll, and accounting
  • nonoperational data, such as industry sales, forecast data, and macro economic data
  • meta data - data about the data itself, such as logical database design or data dictionary definitions

The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

Data Warehouses

Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases intodata warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. 

Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining. 

What can data mining do?

Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data.

With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments.

For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures.

WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata  data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries.

The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout  software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.

By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot. 

How does data mining work?

While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:

  • Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
  • Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
  • Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.
  • Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

  • Extract, transform, and load transaction data onto the data warehouse system.
  • Store and manage the data in a multidimensional database system.
  • Provide data access to business analysts and information technology professionals.
  • Analyze the data by application software.
  • Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

What technological infrastructure is required?

Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR  has the capacity to deliver applications exceeding 100 terabytes. There are two critical technological drivers:

  • Size of the database: the more data being processed and maintained, the more powerful the system required.
  • Query complexity: the more complex the queries and the greater the number of queries being processed, the more powerful the system required.

Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium processors to achieve performance levels exceeding those of the largest supercomputers.

CRISP-DM is a widely accepted methodology for data mining projects. The steps in the process are:

  1. Business Understanding: Understand the project objectives and requirements from a business perspective, and then convert this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.

  2. Data Understanding: Start by collecting data, then get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses about hidden information.

  3. Data Preparation: Includes all activities required to construct the final data set (data that will be fed into the modeling tool) from the initial raw data. Tasks include table, case, and attribute selection as well as transformation and cleaning of data for modeling tools.

  4. Modeling: Select and apply a variety of modelling techniques, and calibrate tool parameters to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

  5. Evaluation: Thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. Determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results is reached.

  6. Deployment: Organize and present the results of data mining. Deployment can be as simple as generating a report or as complex as implementing a repeatable data mining process.

Data mining is iterative. A data mining process continues after a solution is deployed. The lessons learned during the process can trigger new business questions. Changing data can require new models. Subsequent data mining processes benefit from the experiences of previous ones.
 
 
getting_value_from_big_data.mp3
File Size: 12578 kb
File Type: mp3
Download File

Gartner's Yvonne Genovese reviews the popular term "Big Data" and why IT Leaders should act now.

Pattern-Based Strategy: Getting Value from Big Data

"Big data" refers to the growth in the volume of data in organizations. Understanding how to use Pattern-Based Strategy to seek, model and adapt to patterns contained in big data will be a critical IT and business skill.
 
 
Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured.  

Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to:

  • Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory.
  • Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.
  • Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
  • Quickly identify customers who matter the most.
  • Generate retail coupons at the point of sale based on the customer's current and past purchases, ensuring a higher redemption rate.
  • Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
  • Analyze data from social media to detect new market trends and changes in demand.
  • Use clickstream analysis and data mining to detect fraudulent behavior.
  • Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.

Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. What is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data.

A number of recent technology advancements are enabling organizations to make the most of big data and big data analytics:

  • Cheap, abundant storage and server processing capacity.
  • Faster processors.
  • Affordable large-memory capabilities, such as Hadoop.
  • New storage and processing technologies designed specifically for large data volumes, including unstructured data.
  • Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
  • Cloud computing and other flexible resource allocation arrangements.

Big data technologies not only support the ability to collect large amounts of data, they provide the ability to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.

• As much as 80% of the world’s data is now in unstructured formats, which is created and held on the web. This data is increasingly associated with genuine Cloud-based services, used externally to the Enterprise IT. The part of Big Data that relates to the expected explosive growth and creation of new value is the unstructured data mostly arising from these external sources.

• Data sets are growing at a staggering pace 

• Expected to grow by 100% every year for at least the next 5 years. 

• Most of this data is unstructured or semi-structured – generated by servers, network devices, social media, and distributed sensors. 

• “Big Data” refers to such data because the volume (petabytes and exabytes), the type (semi- and unstructured, distributed), and the speed of growth (exponential) make the traditional data storage and analytics tools insufficient and cost-prohibitive. 

• An entirely new set of processing and analytic systems are required for Big Data, with Apache Hadoop being one example of a Big Data processing system that has gained significant popularity and acceptance.

• According to a recent McKinsey Big Data report, Big Data can provide up to USD $300 billion annual value to the US Healthcare industry, and can increase US retail operating margins by up to 60%. It’s no surprise that Big Data analytics is quickly becoming a critical priority for large enterprises across all verticals.

Big data characteristics

Volume: there is a lot of data to be analyzed and/or the analysis is extremely intense; either way, a lot of hardware is needed.

Variety: the data is not organized into simple, regular patterns as in a table; rather text, images and highly varied structures—or structures unknown in advance—are typical.

Velocity: the data comes into the data management system rapidly and often requires quick analysis or decision making.

Drivers 

Volume, variety, velocity, and complexity of incoming data streams

Growth of “Internet of Things” results in explosion of new data 

Commoditization of inexpensive terabyte-scale storage hardware is making storage less costly ….so why not store it?

Increasingly  enterprises are needing to store non-traditional and unstructured data in a way that is easily queried

Desire to integrate all the data into a single source

The power of Compression

Challenges

Data comes from many different sources (enterprise apps, web, search, video, mobile, social conversations and sensors) 

All of this information has been getting increasingly difficult to store in traditional relational databases and even data warehouses

Unstructured or semi-structured text is difficult to query. How does one query a table with a billion rows?

Culture, skills, and business processes

Conceptual Data Modeling

Data Quality Management

Implications

Emerging capabilities to process vast quantities of structured and unstructured data are bringing about changes in technology and business landscapes.

As data sets get bigger and the time allotted to their processing shrinks, look for ever more innovative technology to help organizations glean the insights they'll need to face an increasingly data-driven future.

What is Hadoop?

The most well known technology used for Big Data is Hadoop.   It has been inspired from Google publications on MapReduce, GoogleFS and BigTable.   As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantities of data (petabytes or even more) at very low costs (compared to SAN  systems).

Hadoop is an opensource version of Google’s MapReduce framework.  It is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation:  http://hadoop.apache.org/

The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:

Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints. 

Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power.

What problems can Hadoop solve?

• The Hadoop framework is used by major players including Google, Yahoo , IBM, eBay, LinkedIn and Facebook, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X.

• The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. 

• Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. 

Big Data Market

The Big Data market is on the verge of a rapid growth spurt that will see it top the USD $50 billion mark worldwide within the next five years.

As of early 2012, the Big Data market stands at just over USD $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.

Vertical Perspective

Enhancing Fraud Detection for Banks and Credit Card Companies Scenario

• Build up-to-date models from transactional to feed real-time risk-scoring systems for fraud detection.

Requirements

• Analyze volumes of data with response times that are not possible today.

• Apply analytic models to individual client, not just client segment. 

Benefits

• Detect transaction fraud in progress, allow fraud models to be updated in hours than weeks.

Social Media Analysis for Products, Services and Brands Scenario

• Monitor data from various sources such as blogs, boards, news feeds, tweets, and social medias for information pertinent to brand and products, as well as competitors.

Requirement

• Extract and aggregate relevant topics, relationships, discover patterns and reveal up-and-coming topics and trends.

Benefits

• Brand Management for marketing campaigns, Brand protection for ad placement networks.

Store Clustering Analysis in the Retail Industry Scenario

• Retailer with large number of stores needs to understand cluster patterns of shoppers. 

Requirement

• Use shopping patterns for multiple characteristics like location, incomes, family size for better product placement.

Age Range
Education
Income
Children
Assets
Urbanicity 

Benefits

• Store specific clustering of products, clustering specific types of products by locations.

Healthcare and Energy Industry Scenario

IBM Stream Computing for Smarter Healthcare

IBM Watson pairs natural language processing with predictive root cause analysis.

InfoSphere Streams based analytics can alert hospital staff of impending life threatening infections in premature infants up to 24 hours earlier than current practices.

Vestas Wind Systems use IBM big data analytics software and powerful IBM systems to improve wind 
turbine placement for optimal energy output.