No doubt the amount of data your company collects is growing. But what's the point of amassing all that information if you can't use it to drive your business forward? Smart businesses are giving people throughout their organizations access to deeper intelligence by marrying their big data and business intelligence efforts into a big data solution. The result is better decisions based on meaningful insights company wide. What's your strategy for big data analytics?
The Berkeley Data Analytics Stack (BDAS) is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos.
One flaw of Hadoop MapReduce is high latency. Considering the growing volume, variety and velocity of data, organizations and data scientists require faster analytical platforms. Put simply, speed kills and Spark gains speed through caching and optimizing the master/node communications.
Spark is an open source cluster computing system that makes data analytics fast. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. Download for a test drive: http://spark-project.org/downloads
Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory. It is a computation engine built on top of the Hadoop Distributed File System (HDFS) that efficiently support iterative processing (e.g., ML algorithms), and interactive queries. Spark provides an easy-to-program interface that is available in Java, Python, and Scala. Spark Streaming is a component of Spark that provides highly scalable, fault-tolerant streaming processing. With this functionality, Spark provides integrated support for all major computation models: batch, interactive, and streaming.
Shark is a large-scale data warehouse system that runs on top of Spark and is backward-compatible with Apache Hive, allowing users to run unmodified Hive queries on existing Hive workhouses. Shark is able to run Hive queries 100 times faster when the data fits in memory and up to 5-10 times faster when the data is stored on disk. Shark is a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100 times faster than Hive without modification to the data and queries, and is also open source as part of BDAS.
The Shark data analysis warehouse system:
By using Spark as the execution engine and employing novel and traditional database techniques, Shark bridges the gap between MapReduce and MPP databases.
Take Shark for a drive on EC2 (takes 10 minutes to spin up a cluster): http://shark.cs.berkeley.edu
Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications such as Hadoop, MPI, Hypertable, and Spark. As a result, Mesos allows users to easily build complex pipelines involving algorithms implemented in various frameworks.
The amount of data your organization produces, collects, stores, analyzes and distributes is growing. But what's the point of amassing all that information if you can't use it to improve production, increase revenues and reduce costs?
Smart businesses are giving people throughout their organizations access to deeper intelligence by marrying their big data and business intelligence efforts into a big data solution. The result is better decisions based on meaningful insights company wide.
What's your strategy for big data analytics?
Big Data Platforms as a Service (PaaS) lets an organization take advantage of a service providers compute power, analytical tools, store as much data as needed and pay only for resources used. Data should be protected with multiple layers of security, replicated across multiple data centers and easily exported.
The real value of Big Data Platforms as a Service is the ability to quickly scale-up big data projects without the upfront CapEx required for an on-premise deployment. Additionally, organizations can scale down fast and pay only for the storage and compute resources they use.
Big Data PaaS offerings reduce the need for organizations to hire and/or train big data staff, a challenging task considering a lack of skilled big data practitioners at this time. Rather, the service providers are responsible for deploying, managing and scaling installations.
Big Data PaaS may be a good potential starting point for organizations looking to tap into the power of big data analytics but are not prepared to commit to a full-scale, production level deployment at this time.
We are in the pre-industrial age of data analytical platforms and there are many different types of Big Data PaaS offerings. The following is a partial list:
Choosing a Big Data Technology Stack for Digital Marketing
Different techniques for evaluating, planning or designing a technology stack for digital marketing.
Knowing what your users are doing on your site in real time and matching what they do with more targeted information transforms into better conversion rate and better user satisfaction, which means more money in the end. That's been one of the primary drivers for the transition of many of the batch oriented analytics system into real time. FaceBook real time analytics system is a good reference on that regard.
In the first part of the session we will understand and learn from Facebook's experience and the reason they chose Hbase over Cassandra. In the second part of the session we learn how we can build our own Real Time analytics system, achieve better performance, and make the deployment and scaling significantly simpler using commonly used Java API and frameworks such as JPA and Spring ontop of a scale-out architecture based on Cassandra and GigaSpaces.