Intel CIO Kim Stevenson discusses how Intel IT is looking to leverage predictive analytics to deal with the sea of data out there, and how this is already creating new opportunities for the organization.
In one example of a successful big data implementation at Intel, Stevenson discussed a pilot program that the company ran that identified customers that were more likely to purchase than others based on the heaps of information generated at Intel.
“We looked at that and we examined how our sales coverage model was against those customers, and we took our inside sales force and made calls based on what the predictive analytics said who the customers more likely to purchase were,” explained Stevenson. “In a short amount of time, we were able to cover customers that weren’t previously covered and generate millions of dollars in incremental revenue.”
Another example that Stevenson gave was in retroactive analysis of a failed program that she said cost the silicon giant $700 million when the dust settled. Using the massive amounts of manufacturing data available to them during the die process, Stevenson says that they are now able to see problems sooner using big data analytics to assist in the debugging process.
When asked what advice she would give to others, Stevenson said that she advises fellow CIOs to partner with their business units to identify where the hidden potential is and let that become the guiding light in terms of what problems are focused on – and then stay focused.
“There’s a lot of questions you can answer about any given business, but if you stay focused on a small set of business problems, they you’ll create some early wins and you’re able to grow based on your successful track record.”
Stevenson says Intel has set rules about how to operate predictive analytics in their company, which include small teams of roughly 5 people, and problems that can be solved within a 6 month period of time. Ultimately, says Stevenson, these projects are tied to ROI, which Intel has set a target of $10 million dollars for their first initial deployment. “That helps us narrow and prioritize the problems to higher value problems for the company.”
Stevenson also advises that enterprises start to build the skills that they need. “You need data scientists, visualization experts, data curators, and those types of skills – they’re rare today,” she comments. “It’s harder to learn the business knowledge that is needed to make the data into valuable information than to learn the IT technical skills,” she says in advising people to grow the skills internally. “It will be a focus diligent progression of taking the people that understand your business process today and complimenting them with the technical skills required in building big data management systems or predictive analytic models.”
Python is an increasingly popular object-oriented, interpreted and interactive programming language used for heavy-duty data analysis. Python is designed for ease-of-use, speed, readability and tailored for data-intensive applications. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming styles. It features a fully dynamic type system and automatic memory management, similar to that of Scheme, Ruby, Perl and Tcl.
You can create customized data tools using Python that can handle large data sets efficiently - it lets you work more quickly and integrate your systems more effectively. You can get more done in less time using Python for manipulating, processing, cleaning, and crunching data.
Python allows an organization to build a framework that makes it easy to collect data from a myriad of data sources and model them. So instead of spending time writing database connector code, you can use a simple configuration and quickly get off the ground. As a result of this easy familiarity, Python allows an organization to move code from development to production more quickly considering the same code created as a prototype can easily be moved into production.
If you like R language, Python libraries such as SciPy, iPython and Pandas provide much of the mathematical functionality typically found in R. While R offers more packages and visualization capabilities at this time, Python is catching up.
Simply, Python is easy to learn, platform neutral and cheap. Python is a tool to build other tools with, including data analysis tools. It was actually conceived in a huge orgy of different programming paradigms, styles and languages. Python runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines.
Python is free to use, even for commercial products, because of its OSI-approved open source license. See: http://www.python.org/psf/license/
Pandas is a Python package for doing data transformation and statistical analysis. Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools. See: http://pandas.pydata.org/
While R is the most widely-used open source environment for statistical modeling and graphics, Pandas adopts some of the best concepts of R, like the foundational data.frame. Pandas has been described as "R data.frame on steroids". Pandas seeks to remedy some frustrations common to R users:
1. R has simple data alignment and indexing functionality, leaving much work to the user. Pandas makes it easy and intuitive to work with messy, irregularly indexed data - like time series data. Pandas also provides rich tools, like hierarchical indexing, not found in R;
2. R is not well-suited to general purpose programming and system development. Pandas enables you to do large-scale data processing seamlessly when developing your production applications;
3. Hybrid systems connecting R to a low-productivity systems language like Java, C++, or C# suffer from significantly reduced agility and maintainability, and you’re still stuck developing the system components in a low-productivity language;
4. The "copyleft" GPL license of R can create concerns for commercial software vendors who want to distribute R with their software under another license. Python and Pandas use more permissive licenses.
Top Python Advantages
- Instant feedback from the interactive interpreter.
- Non-intrusive: You think about the problem, not the tool you are working with. After you learn Python, it gets out of the way.
- Libraries: Whatever you want to do, somebody has written code to help you get there.
- Community: The community is a great source of examples and ideas.
- The philosophy of one-best-way means that Python programmers all tend to do things in sort of the same way. This is a big advantage because it makes it easy to read other people's code - a great way to learn.
Top Python Disadvantages
- No single source of truth / best-practices: It can be hard to learn what is the best library for a particular job. The large number of packages relevant to a particular task can make it difficult to find the one best suited to your exact needs.
- Documentation is substandard: The Python official documentation is seldom the best way to learn a new library. The informal Python community provides the most useful examples. Yet sorting out the wheat from the chaff can be hit-or-miss.
- Concurrency: Python was designed without concurrency in mind and it shows.
No doubt the amount of data your company collects is growing. But what's the point of amassing all that information if you can't use it to drive your business forward? Smart businesses are giving people throughout their organizations access to deeper intelligence by marrying their big data and business intelligence efforts into a big data solution. The result is better decisions based on meaningful insights company wide. What's your strategy for big data analytics?
NoSQL & Non-Relational Databases
Relational databases have been the de facto technology for storing and querying data for 40 years. What is driving the recent innovation in databases? This talk will touch on the history of databases, why RDBMS have been so successful, and why we are seeing the rise of NoSQL databases. Next we will examine the different categories of NoSQL databases and technology. The presentation will finish with a specific introduction to MongoDB, its design principles, and what it looks like to code against.
Will LaForest heads up the Federal practice for 10gen, the MongoDB company. Will is focused on evangelizing the benefits of MongoDB, NoSQL, and (OSS) open source software in solving Big Data challenges in the Federal government. He believes that software in the Big Data space must scale not only from a technical perspective but also from a cost perspective. He has spent 7 years in the NoSQL space focused on the Federal government, most recently as Principal Technologist at MarkLogic. His technical career spans diverse areas from data warehousing, to machine learning, to building statistical visualization software for SPSS but began with code slinging at DARPA. He holds degrees in Mathematics and Physics from the University of Virginia.
Monte Carlo Simulation Methods in Energy Risk Management
Monte Carlo methods are stochastic techniques or probabilistic modeling - meaning they are based on the use of random numbers and probability statistics to investigate problems.
They are used to model phenomena with significant uncertainty in inputs, such as the calculation of risk in business. When Monte Carlo simulations have been applied in space exploration and oil exploration, their predictions of failures, cost overruns and schedule overruns are routinely better than human intuition or alternative "soft" methods.
For energy companies, understanding the impact of commodity price movements on the value of a portfolio is critical for hedging, risk management and planning purposes. For example, consider a gas-fired power plant which buys natural gas from a spot market, converts it into electricity, and sells that electricity into a deregulated power spot market.
The generator is exposed to fluctuations in the price it must pay to purchase natural gas and the price it will receive for the sales of power. In order to reduce risks, a power plant operator may choose to buy in advance the natural gas that it anticipates it will need, and to sell in advance the power it anticipates it will generate -- that is contract in advance for the forward purchase of gas and the sale of power at a future delivery period, for a fixed price today. This practice, known as hedging, attempts to remove the uncertainty in future cash flows from the power plant.
The decision on how much and how often to hedge will, in general, require sophisticated analytical methods. One popular method, Monte Carlo simulation, attempts to simulate future states of the world to understand the impact on cash flows.
In this talk, we discuss Monte Carlo methods for energy risk applications. We review one popular approach, which uses a set of linked simulation models to capture the fundamental physical drivers of electricity price formation, and calibrates them to match current prices being quoted in the financial markets. Monte Carlo simulations of weather, load and prices can then be used to value a portfolio of generation assets and trades, and to support hedging and risk management decisions.
Scotty Nelson is a Senior Energy Analyst at Ascend Analytics, where he deploys analytic software solutions to help companies understand and manage risk in the energy markets.
The recent TDWI Keynote by Evan Levy focused on traditional versus new strategies for managing data. The traditional way is based on an organization creating, storing, analyzing and distributing data internally. Most modern (last 10 years) data warehouse / business intelligence platforms are designed on this model and can manage and track where data is created and consumed.
The new strategy for managing data includes: both internal and external data; structured and unstructured data; external data analytical applications; external data providers; and both internal and external data scientists. As a result, in addition to managing and tracking wheredata is created and consumed, an organization needs fast and easy access to big volumes ofdata and to know how it moves, transforms and migrates. Modern data warehouse / business intelligence platforms are unable to do this job.
The past 10 to 15 years has seen a shift from custom-built to packaged applications to automate knowledge / business processes. The design flaw is that custom code and middleware is required to move all this data between the packaged systems. The brain damage, money and time spent on data migration solutions - in addition to the human capital needed to clean the data - is huge and wasteful. Current ETL tools are primitive and while they save time and reduce custom coding they are not a long term solution. Moreover, this design will not work with the new volume, variety and velocity of large internal and external data sets.
Levy offers a partial solution: the data supply chain. The data supply chain concept was pioneered by Walmart years ago and seeks to broaden the traditional corporate information life cycle to include the numerous data sourcing, provisioning and logistical activities required to manage data. Walmart understood the design flaw in having a separate custom distribution system. The solution was a standard distribution system where standardization occurs at the source.
Simply, the data supply chain is all about standardization of data. Focus on designing and building one standardized data supply chain instead of custom distribution systems for each business application. Eliminate middleware, ETL and writing massive amounts of custom code to standardize, clean and integrate data.
Yet standardizing data at the source is only part of the solution. The other part is Master DataManagement (MDM).
MDM standardizes data enabling better data governance to capture and enforce clean and reliable data for optimal data science and business analytics. Standardized values and definitions allow uniform understanding of data stored in various data warehouses so users can find and access the data they need easily and fast.
MDM comprises a set of processes and tools that defines and manages data. Quality of datashapes decision making and MDM helps leverage trusted information to make better decisions, increase profitability and reduce risk.
Master data is reference data about: people (customers, employees, suppliers), things (products, assets, ledgers) and places (countries, cities, locations). The applications and technologies used to create and maintain master data are part of a MDM system. Virtual master data management (Virtual MDM) utilizes data virtualization and a persistent metadata server to implement a multi-level automated MDM hierarchy.
MDM help organizations handle four key issues:
One of the main objectives of an MDM system is to publish an integrated, accurate and consistent set of master data for use by other applications and users. This integrated set of master data is called the master data system of record (SOR). The SOR is the gold copy for any given piece of master data, and is the single place in an organization that master data is guaranteed to be accurate and up to date.
Although an MDM system publishes the master data SOR for use by the rest of the IT environment, it is not necessarily the system where master is created and maintained. The system responsible for maintaining any given piece of master data is called the system of entry (SOE). In most organizations today, master data is maintained by multiple SOEs.
Customer data is an example. A company may, for example, have customer master data that is maintained by multiple Web store fronts, by the retail organization, and by the shipping and billing systems. Creating a single SOR for customer data in such an environment is a complex task.
The long term goal of MDM is to solve this problem by creating an MDM system that is not only the SOR for any given type of master data, but also the SOE as well. In other words, standardize data at the source.
MDM then can be defined as a set of policies, procedures, applications and technologies for harmonizing and managing the system of record and systems of entry for the data and metadata associated with the key business entities of an organization.
The Data Supply Chain: A Different Approach to Managing Your Company's Data
There's no argument that data is a corporate asset, but there's often disagreement on how to manage and address everyone's needs. Traditional data strategies assume that data is created, distributed, and consumed within a company's four walls. Today, companies are moving toward external applications and information providers to support their growing business demands. It's no longer sufficient to manage and track where data is created and consumed—we must also know how it moves and migrates.
In this keynote, Evan Levy will introduce the concept of the data supply chain, a new approach to managing a company's data assets. This approach expands the traditional corporate information life cycle to include the numerous data sourcing, provisioning, and logistical activities that are required to successfully manage a company's data.
The goal of Data Analytics (big and small) is to get actionable insights resulting in smarter decisions and better business outcomes. How you architect business technologies and design data analytics processes to get valuable, actionable insights varies.
It is critical to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and analysis of large and diverse datasets.
There are three types of data analysis:
Descriptive (business intelligence and data mining)
Prescriptive (optimization and simulation)
Predictive analytics turns data into valuable, actionable information. Predictive analytics uses data to determine the probable future outcome of an event or a likelihood of a situation occurring.
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events.
In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions.
Three basic cornerstones of predictive analytics are:
Decision Analysis and Optimization
An example of using predictive analytics is optimizing customer relationship management systems. They can help enable an organization to analyze all customer data therefore exposing patterns that predict customer behavior.
Another example is for an organization that offers multiple products, predictive analytics can help analyze customers’ spending, usage and other behavior, leading to efficient cross sales, or selling additional products to current customers. This directly leads to higher profitability per customer and stronger customer relationships.
An organization must invest in a team of experts (data scientists) and create statistical algorithms for finding and accessing relevant data. The data analytics team works with business leaders to design a strategy for using predictive information.
Descriptive analytics looks at data and analyzes past events for insight as to how to approach the future. Descriptive analytics looks at past performance and understands that performance by mining historical data to look for the reasons behind past success or failure. Almost all management reporting such as sales, marketing, operations, and finance, uses this type of post-mortem analysis.
Descriptive models quantify relationships in data in a way that is often used to classify customers or prospects into groups. Unlike predictive models that focus on predicting a single customer behavior (such as credit risk), descriptive models identify many different relationships between customers or products. Descriptive models do not rank-order customers by their likelihood of taking a particular action the way predictive models do.
Descriptive models can be used, for example, to categorize customers by their product preferences and life stage. Descriptive modeling tools can be utilized to develop further models that can simulate large number of individualized agents and make predictions.
For example, descriptive analytics examines historical electricity usage data to help plan power needs and allow electric companies to set optimal prices.
Prescriptive analytics automatically synthesizes big data, mathematical sciences, business rules, and machine learning to make predictions and then suggests decision options to take advantage of the predictions.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the decision maker the implications of each decision option. Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen.
Further, prescriptive analytics can suggest decision options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each decision option. In practice, prescriptive analytics can continually and automatically process new data to improve prediction accuracy and provide better decision options.
Prescriptive analytics synergistically combines data, business rules, and mathematical models. The data inputs to prescriptive analytics may come from multiple sources, internal (inside the organization) and external (social media, et al.). The data may also be structured, which includes numerical and categorical data, as well as unstructured data, such as text, images, audio, and video data, including big data. Business rules define the business process and include constraints, preferences, policies, best practices, and boundaries. Mathematical models are techniques derived from mathematical sciences and related disciplines including applied statistics, machine learning, operations research, and natural language processing.
For example, prescriptive analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of external factors such as economic data, population demographic trends and population health trends, to more accurately plan for future capital investments such as new facilities and equipment utilization as well as understand the trade-offs between adding additional beds and expanding an existing facility versus building a new one.
Another example is energy and utilities. Natural gas prices fluctuate dramatically depending upon supply, demand, econometrics, geo-politics, and weather conditions. Gas producers, transmission (pipeline) companies and utility firms have a keen interest in more accurately predicting gas prices so that they can lock in favorable terms while hedging downside risk. Prescriptive analytics can accurately predict prices by modeling internal and external variables simultaneously and also provide decision options and show the impact of each decision option.