Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured.
Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to:
- Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory.
- Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.
- Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
- Quickly identify customers who matter the most.
- Generate retail coupons at the point of sale based on the customer's current and past purchases, ensuring a higher redemption rate.
- Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
- Analyze data from social media to detect new market trends and changes in demand.
- Use clickstream analysis and data mining to detect fraudulent behavior.
- Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.
Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. What is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data.
A number of recent technology advancements are enabling organizations to make the most of big data and big data analytics:
- Cheap, abundant storage and server processing capacity.
- Faster processors.
- Affordable large-memory capabilities, such as Hadoop.
- New storage and processing technologies designed specifically for large data volumes, including unstructured data.
- Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
- Cloud computing and other flexible resource allocation arrangements.
Big data technologies not only support the ability to collect large amounts of data, they provide the ability to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
• As much as 80% of the world’s data is now in unstructured formats, which is created and held on the web. This data is increasingly associated with genuine Cloud-based services, used externally to the Enterprise IT. The part of Big Data that relates to the expected explosive growth and creation of new value is the unstructured data mostly arising from these external sources.
• Data sets are growing at a staggering pace
• Expected to grow by 100% every year for at least the next 5 years.
• Most of this data is unstructured or semi-structured – generated by servers, network devices, social media, and distributed sensors.
• “Big Data” refers to such data because the volume (petabytes and exabytes), the type (semi- and unstructured, distributed), and the speed of growth (exponential) make the traditional data storage and analytics tools insufficient and cost-prohibitive.
• An entirely new set of processing and analytic systems are required for Big Data, with Apache Hadoop being one example of a Big Data processing system that has gained significant popularity and acceptance.
• According to a recent McKinsey Big Data report, Big Data can provide up to USD $300 billion annual value to the US Healthcare industry, and can increase US retail operating margins by up to 60%. It’s no surprise that Big Data analytics is quickly becoming a critical priority for large enterprises across all verticals.
Big data characteristicsVolume: there is a lot of data to be analyzed and/or the analysis is extremely intense; either way, a lot of hardware is needed.
Variety: the data is not organized into simple, regular patterns as in a table; rather text, images and highly varied structures—or structures unknown in advance—are typical.
Velocity: the data comes into the data management system rapidly and often requires quick analysis or decision making.
Drivers Volume, variety, velocity, and complexity of incoming data streams
Growth of “Internet of Things” results in explosion of new data
Commoditization of inexpensive terabyte-scale storage hardware is making storage less costly ….so why not store it?
Increasingly enterprises are needing to store non-traditional and unstructured data in a way that is easily queried
Desire to integrate all the data into a single source
The power of Compression
ChallengesData comes from many different sources (enterprise apps, web, search, video, mobile, social conversations and sensors)
All of this information has been getting increasingly difficult to store in traditional relational databases and even data warehouses
Unstructured or semi-structured text is difficult to query. How does one query a table with a billion rows?
Culture, skills, and business processes
Conceptual Data Modeling
Data Quality Management
ImplicationsEmerging capabilities to process vast quantities of structured and unstructured data are bringing about changes in technology and business landscapes.
As data sets get bigger and the time allotted to their processing shrinks, look for ever more innovative technology to help organizations glean the insights they'll need to face an increasingly data-driven future.
What is Hadoop?The most well known technology used for Big Data is Hadoop. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable. As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any RAID replication technology), it allows them to store huge quantities of data (petabytes or even more) at very low costs (compared to SAN systems).
Hadoop is an opensource version of Google’s MapReduce framework. It is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation:
http://hadoop.apache.org/.
The Hadoop “brand” contains many different tools. Two of them are core parts of Hadoop:
Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.
Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power.
What problems can Hadoop solve?• The Hadoop framework is used by major players including Google, Yahoo , IBM, eBay, LinkedIn and Facebook, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X.
• The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
• Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built.
Big Data MarketThe Big Data market is on the verge of a rapid growth spurt that will see it top the USD $50 billion mark worldwide within the next five years.
As of early 2012, the Big Data market stands at just over USD $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.
Vertical PerspectiveEnhancing Fraud Detection for Banks and Credit Card Companies Scenario
• Build up-to-date models from transactional to feed real-time risk-scoring systems for fraud detection.
Requirements
• Analyze volumes of data with response times that are not possible today.
• Apply analytic models to individual client, not just client segment.
Benefits
• Detect transaction fraud in progress, allow fraud models to be updated in hours than weeks.
Social Media Analysis for Products, Services and Brands Scenario
• Monitor data from various sources such as blogs, boards, news feeds, tweets, and social medias for information pertinent to brand and products, as well as competitors.
Requirement
• Extract and aggregate relevant topics, relationships, discover patterns and reveal up-and-coming topics and trends.
Benefits
• Brand Management for marketing campaigns, Brand protection for ad placement networks.
Store Clustering Analysis in the Retail Industry Scenario
• Retailer with large number of stores needs to understand cluster patterns of shoppers.
Requirement
• Use shopping patterns for multiple characteristics like location, incomes, family size for better product placement.
Age Range
Education
Income
Children
Assets
Urbanicity
Benefits
• Store specific clustering of products, clustering specific types of products by locations.
Healthcare and Energy Industry Scenario
IBM Stream Computing for Smarter Healthcare
IBM Watson pairs natural language processing with predictive root cause analysis.
InfoSphere Streams based analytics can alert hospital staff of impending life threatening infections in premature infants up to 24 hours earlier than current practices.
Vestas Wind Systems use IBM big data analytics software and powerful IBM systems to improve wind
turbine placement for optimal energy output.