Comments

06/12/2012 17:44

There is an explosion of data sources - and explosion of data volumes and variety - and an increase in demand for timeliness in development and execution and analysis.

Data should be stored in the technology best suited for storage and accessed through data virtualization.

Offloading data analytics from traditional data warehouse environments to analytic platforms is logical and best practice, all things being equal.

Reply
John Berry
06/12/2012 20:17

How do you deliver big-data BI in a timely fashion to those who need it
to make critical strategic and operational decisions?

Reply
06/12/2012 20:42

In a world where we create over two quintillion bytes of data every day, global leaders in academia, industry, and government are grappling with the problem of how to organize, store, evaluate, share, and protect this vast amount of information.

To address the questions surrounding this powerful and growing field of data discovery, Mary Ann Liebert, Inc., publishers announces the launch of Big Data, a highly innovative, peer-reviewed journal that will provide a unique forum for world-class research exploring the challenges and opportunities in collecting, analyzing, and disseminating vast amounts of data, including big data analytics.

"There is significant need for a journal on big data that will enable discussions, exchange of important ideas, and facilitate debate through a multimedia journal platform," says Mary Ann Liebert, President of Mary Ann Liebert, Inc., publishers. "We need to harness the vast opportunities lying within big data to gain knowledge that will potentially solve many of the problems we face as a global society. This journal has this mandate."
A multidisciplinary editorial team of opinion leaders is gathering to build this new forum for the big data community, including

Executive Editor Eugene Kolker, PhD, Chief Data Officer, Seattle Children's Hospital;
Geoffrey Charles Fox, PhD, Associate Dean for Graduate Studies & Research, Professor of Computer Science and Informatics, Indiana University, Bloomington;
Sorin Istrail, PhD, Julie Nguyen Brown Professor of Computational and Mathematical Sciences, Professor of Computer Science, Brown University;
Folker Meyer, Computational Biologist, Argonne National Laboratory, Argonne, IL, Senior Fellow at the Computation Institute at the University of Chicago, Chicago, IL and Associate Division Director of the Institute of Genomics and Systems Biology, Chicago, IL; and
Rick Stevens, PhD, Associate Laboratory Director for Computing, Environment, and Life Sciences at Argonne National Laboratory, Chicago, IL, Professor of Computer Science, University of Chicago, Senior Fellow of the University of Chicago & Argonne National Laboratory Computation Institute (CI), co-Director of the Argonne Futures Lab.

Big Data journal will facilitate and support the efforts of researchers, analysts, statisticians, business leaders, and policymakers to improve operations, profitability, and communications within their organizations. Spanning a broad array of disciplines focusing on novel big data technologies, policies, and innovations, the Journal will bring together the community to address the challenges and discover new breakthroughs and trends living within this information. The Journal will be published in print and online.

Gregory PS, Editor: I have spoken to the publisher, and they are looking for editorial board members - if interested, contact them

MaryAnn Liebert, Inc., publishers is a privately held, fully integrated media company known for establishing authoritative medical and biomedical peer-reviewed journals, including Journal of Computational Biology, OMICS, Disruptive Science and Technology, and Population Health Management.

Reply
06/12/2012 20:56

Big Data represents a fundamental shift in business decision making. Getting it in a timely fashion to those who need it
to make critical strategic and operational decisions is a major project for the organization that may include professional counsel and services.

Organizations are used to analyzing internal data – sales, shipments, inventory.

Now they are increasingly analyzing external data too, gaining new insights into customers, markets, supply chains and operations.

Reply
06/12/2012 21:17

It's critical to have a clear understanding of an organization's data management requirements and a well-defined strategy before venturing too far down the big data analytics path..

Yet making a big investment to attack the big data problem without first figuring out how and where it can really add value to the business is one of the most serious missteps.

Reply
06/13/2012 07:54

Rather than starting with tech, I suggest starting from a business perspective and have the conversation between the CIO, data scientists and business people to figure out what the business objectives are and what value can be derived and drive backwards.

Reply
06/13/2012 08:02

Define exactly what data is available and map out how the organization can best leverage those resources.

CIOs and data warehouse practitioners need to examine what data is being retained, aggregated and utilized and compare that with what data is being thrown away.

It's also critical to consider external data sources that are currently not being tapped but could be a compelling addition.

Reply
06/13/2012 08:23

Even if you don't know what you're going to use the data for yet, start capturing the information so you have a deep history of information to draw on later.

Reply
06/13/2012 08:27

When analyzing big data sets it makes sense to define small, high-value opportunities and use those as a starting point. Define in meticulous detail your business objectives.

As you expand data sources and create the analytical models that will uncover patterns be vigilant about homing in on those patterns that are most important to stated business objectives.

Reply
06/13/2012 08:35

A good strategy is to pick very targeted spots and build out the data
analytics program over time.

The targeted spots should be those most important to meeting critical business goals.

Reply
06/13/2012 08:41

Start with small bites of data -- taking individual flows and migrating them into different systems.

If you're working with any kind of scale, you can't roll something like this out overnight.

Reply
06/13/2012 10:11

You need to consider scale as a primary factor when mapping out a big data analytics roadmap.

Consider what the ramp-up will look like -- how much data will you be putting in six months, one year, 3 years from now, how many more servers will you need to handle that, is the software up to the task??

Think about how much it is going to grow and how popular the solution might be once it's rolled into production.

How many different ways to play with the data? Does one data scientist playing with the data equal 10, 20, 100 regular business users?

Think about how many people will use the solution? In 3, 6 months? One year, 3 years?

Reply
06/13/2012 11:02

We are leveraging open source Hadoop and a number of new analytic tools but do not have the analytics talent to make full use.

Reply
Lars Sorenson
06/13/2012 11:19

The biggest challenges related to big data analytics boil down to a simple one-two punch:

The technology is still fairly raw and user-unfriendly, and there aren't enough skilled experts to go around.

Reply
06/13/2012 12:02

The most important issue is being able to analyze and act on data in real-time.

Reply
06/13/2012 12:16

The Forrester Wave™: Predictive Analytics And Data Mining Solutions, Q1 2010 - Sponsored Whitepaper

http://www.jazdlifesciences.com/pharmatech/research/Portrait-Software.htm?contentSetId=78417&supplierId=30014357

In Forrester's 53-criteria evaluation of predictive analytics and data mining (PA/DM) vendors, we found that SAS Institute, SPSS (evaluated separately from new parent IBM's other PA/DM offerings), KXEN, Oracle, Portrait Software, and IBM (pre-SPSS acquisition PA/DM offerings) head the pack with mature, sophisticated, scalable, flexible, and robust solutions. SAS leads, providing the most feature-rich solution portfolio and, through its recent expansion of enterprise data warehouse (EDW) vendor partnerships, taking the industry lead in promoting in-database analytics as an emerging best practice for high-performance analytics deployments. SPSS is rapidly integrating its already strong PA/DM solution portfolio with new parent IBM's extensive data management solution family. KXEN stands out for its focus on content analytics, sentiment analysis, and social network analysis. Oracle distinguishes itself through the depth of its PA/DM tool's integration into its enterprise database and application portfolio. Portrait provides a comprehensive set of customer analytics offerings that integrate with its core PA/DM tool. TIBCO Software, FICO, and Angoss Software are Strong Performers in the PA/DM market.

Reply
06/13/2012 12:26

Oracle Data Mining 11g Release 2
Competing on In-Database Analytics

http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/twp-data-mining-11gr2-160025.pdf

Oracle Data Mining provides powerful data mining functionality within the Oracle Database. It enables you to discover new insights hidden in your data and to leverage your investment in Oracle Database technology.

With Oracle Data Mining, you can build and apply predictive models that help you target your best customers, develop detailed customer profiles, and find and prevent fraud.

Oracle Data Mining, a component of the Oracle Advanced Analytics Option, helps your company better compete on analytics.

Reply
Kevin Hanson
06/13/2012 13:13

Our IT team is used to working with relational database management systems which have a different model of storing and processing data - which is not optimal for data analytics.

Reply
06/13/2012 13:21

Data mining 5 step process:

• Sample the data by creating a target data set large enough to contain the significant information, yet small enough to process.

• Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas.

• Modify the data by creating, selecting and transforming the variables to focus the model selection process.

• Model the data by using analytical tools to search for a combination of data that reliably predicts a desired outcome.

Reply
06/13/2012 13:25

Data mining 5 step process:

Step 5:

• Assess the data and models by evaluating the usefulness and reliability of the findings from the data mining process.

Reply
06/13/2012 13:36

While data management professionals have a well-defined set of expertise around managing and organizing highly structured data and modeling and creating reports in SQL, those conventional skill sets don't translate well to the unstructured, flat-file world of big data, where command lines and NoSQL technologies are the core building blocks of most of the emerging platforms.

The following fast and scalable databases should be considered as an alternative to SQL for the right project:

Cassandra
CouchDB
MongoDB
Redis
Riak
Neo4J
FlockDB

See: http://www.rosebt.com/1/post/2011/07/new-databases-to-consider.html

Reply
06/13/2012 13:43

In a recent survey more than half (54 percent) of business leaders cite access to talent as a key impediment to making the most of data, followed by the barrier of organizational silos (51 percent).

Reply
06/13/2012 13:52

Predictive analytics and data mining activities:

• Discover patterns, trends and relationships represented in data.

• Develop models to understand and describe characteristics and activity based on these patterns.

• Use those insights to helpe valuate future options and make fact-based decisions.

• Deploy scores and results for timely, appropriate action.

• Manage the life cycle of models and monitor their performance to avoid decay.

Reply
06/13/2012 14:05

Data mining is an iterative process of selecting, exploring and modeling large amounts of data to identify meaningful, logical patterns and relationships among key variables.

Data mining is used to uncover trends, predict future events and assess the merits of various courses of action.

Data mining has been used for many years by businesses, scientists and governments to transform data into business intelligence. It can be applied to a variety of business issues in any industry – from customer segmentation and targeting to fraud detection and credit risk scoring to identifying adverse drug effects during clinical trials.

Many organizations use data mining techniques to segment customers by behavior, demographics or attitudes – to understand what products or services each segment would want or need in the future.

Once you have properly identified the segments, you can create response models to predict which customers are likely to respond. you can further complement the customer acquisition model with a credit scoring model to find out which of those customers are a good credit risk and worth the investment to acquire or retain.

Data mining outperforms rules-based systems for detecting fraud, even as fraudsters become more sophisticated in their tactics. Models can be built to cross-reference data from a variety of sources, correlating nonobvious variables with known fraudulent traits to identify new patterns of fraud.

For its potential to yield predictive insights from masses of diverse data points, data mining has proven to be an invaluable component of an enterprise intelligence framework.

Reply
06/13/2012 14:34

New noSQL Databases Review

The following fast and scalable databases should be considered as an alternative to SQL for the right project:

Cassandra
CouchDB
MongoDB
Redis
Riak
Neo4J
FlockDB
Some of these new databases are quite sophisticated, while others are deliberately bare bones. But they are fast and scalable. All of the packages are relatively stable and useful -- for the right projects. But none of them are as feature-rich or sophisticated as the best commercial SQL tools.

These new databases are different than traditional relational databases following ACID rules that converged on a set of features and a standard language. They appear to take basic pairs of keys and values, but they're tuned for different use cases. The major variations aren't in the format of the data but in how often it's replicated, cached, and sharded.

The advantage is if the specific project needs fit the abilities of one of the new databases. If they line up well, the performance boosts can be incredible because the project developers aren't striving to build one Dreadnought to solve every problem.

The downside is a lack of interchangeability at this time (will change in future). Switching is more difficult than it is with the SQL databases. There's no standard query language, nor is there a vast array of abstraction layers like the JDBC.

Summary:

Cassandra

Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities. It's not just for Facebook any longer. Many of the committing programmers come from other companies, and the project chair works at DataStax.com, a company devoted to providing commercial support for Cassandra.The heritage of the Cassandra project is obvious because it's a good tool for tracking lots of data, such as status updates at Facebook. The tool helps create a network of computers that all carry the same data. Each machine is meant to be equal to the others, and all of them should end up being consistent once the data propagates around the P2P network of nodes, though it's not guaranteed.

The key phrase is "eventual consistency," not "perfect consistency." If you've watched your status updates disappear and reappear on Facebook, you'll understand what this means.The tool runs in Java as a separate process waiting for interaction. There's already a collection of higher-level libraries for Java, Python, Ruby, and PHP, as well as some of the other languages.Using Cassandra seems relatively simple, but I still found myself getting hung up on several barriers, such as defining a keyspace (which acts as a namespace but for the columns). Getting up to speed takes more than a few minutes because there are more than just the basic routines for storing collections of values. Cassandra is happy with a sparse matrix where each row stores only a few standard columns, and it builds the indices with this in mind.

Much of the complexity in the API is devoted to controlling just how quickly the cluster of nodes moves toward consistency. You can specify the speed of synchronization for columns and collections of values called supercolumns.Getting everything running is now fairly well documented, but getting it running quickly requires a fair amount of both hardware and operating system tuning. The biggest bottleneck is the commit log. Optimizing the way that this is written to disk is the most important part of improving writes. Speeding up the extraction of data involves paying attention to the pattern of reads. Did your old, fancy database do this for you fairly automatically? Ah, don't complain. It's fun to think about the hardware and how it affects your software.

CouchDBCouchDB stores documents, each of which is made up of a set of pairs that link key with a value. The most radical change is in the query. Instead of some basic query structure that's pretty similar to SQL, CouchDB searches for documents with two functions to map and reduce the data. One formats the document, and the other makes a decision about what to include.I'm guessing that a solid Oracle jockey with a good knowledge of stored procedures does pretty much the same thing. Nevertheless, the map and reduce structure should be eye-opening for the basic programmer. Suddenly a client-side AJAX developer can write a fairly complicated search procedure that can encode some sophisticated logic.

The core of CouchDB is written in Erlang, but the API and interface is all JavaScript or JSON. You won't need to worry about this detail. The JavaScript API only enhances CouchDB's appeal for the average Web developer who can store documents and even entire websites inside the database itself.There's a burgeoning community growing around CouchDB. All of the major languages now have client libraries that simplify the interaction with the database and make it possible to store your data. They don't always expose all of t

Reply
06/13/2012 14:37

MongoDB

MongoDB is just one of the examples of how JavaScript is taking over the world. The program takes data formatted as JavaScript objects (a format known as JSON) and stores them away. Queries are basic JavaScript functions. It's not much different from using the console of your browser.Well, that's simplifying things a bit. The big difference is that MongoDB will create indices for the columns of your database and return queries faster when the indices are correctly constructed. That's part of your job, by the way. You want to anticipate which indices your users will need.You don't need to speak the subset of JavaScript for this language because there's a big collection of libraries and drivers written for all of the major languages and many of the minor ones.

These libraries are extensive, and some of the major languages have extra layers that wrap and unwrap objects when storing and retrieving them.There's also a fair number of extra tools for working with the database. PHPMoAdmin, a cousin of the MySQL tool PHPMyAdmin, is just one of almost a dozen tools for admins. The proliferation of these tools is gradually erasing one of the standard reasons for sticking with a classic database. As I found more of them, I noticed that everything was more comfortable.

Reply
06/13/2012 14:38

Redis

Like CouchDB and MongoDB, Redis stores documents or rows made up of key-value pairs. Unlike the rest of the NoSQL world, it stores more than just strings or numbers in the value. It will also include sorted and unsorted sets of strings as a value linked to a key, a feature that lets it offer some sophisticated set operations to the user. There's no need for the client to download data to compute the intersection when Redis can do it at the server.This approach leads to some simple structures without much coding. Luke Melia tracked the visitors on his website by building a new set every minute. The union of the last five sets defined those who were "online" at that moment. The intersection of this union with a friends list produced the list of online friends. These sorts of set operations have many applications, and the Redis crowd is discovering just how powerful they can be.Redis is also known for keeping the data in memory and only writing out the list of changes every once and a bit. Some don't even call it a database, preferring instead to focus on the positive by labeling it a powerful in-memory cache that also writes to disk.

Traditional databases are slower because they wait until the disk gets the information before signaling that everything is OK. Redis waits only until the data is in memory, something that's obviously faster but potentially dangerous if the power fades at the wrong moment.The project leaders are still exploring how to expand the project, an intriguing decision because there's more than one official version of Redis from the main team. There's even one official build of Redis that comes with a Lua interpreter and a disclaimer saying that "there is no guarantee that scripting works correctly or that it will be merged into future versions of Redis!" Projects like these are never boring.Redis providers are starting to appear. OpenRedis promises it's "launching soon." Meanwhile, Redis Straight Up charges just $19 per month, plus all of the costs from Amazon's cloud. The service handles the configuration and passes the costs on to you.

Reply
06/13/2012 14:39

Riak

Riak is one of the more sophisticated data stores. It offers most of the features found in others, then adds more control over duplication. Although the basic structure stores pairs of keys and values, the options for retrieving them and guaranteeing their consistency are quite rich.The write operations, for instance, can include a parameter that asks Riak to confirm when the data has been propagated successfully to any number of the machines in the cluster. If you don't want to trust just one machine, you can ask it to wait until 2, 3, or 54 machines have written the data before sending the acknowledgment. This is why the team likes to toss around its slogan: "Eventual consistency is no excuse for losing data."The data itself is not just written to disk. Well, that is one of the options, but it's not the main one. Riak uses a pluggable storage engine (Bitcask by default) that writes the data to disk in its own internal format.

There are several other options, including a version of InnoDB for those who are nostalgic for MySQL. You can get all of the belts and suspenders with the clustering power of Riak.When it comes time to fetch the data, Riak offers to eliminate any of the errors that might appear. If two nodes end up with different versions of an object, Riak can either choose the youngest update or return both of the objects and leave the decision up to your client code. This is a very useful option for detecting potential errors in the data.There are a large number of query options. The basic architecture is map and reduce, but there is also the chance to write the functions in either Erlang or JavaScript.The project is shepherded by Basho, a company that provides both open source and enterprise versions of Riak. The open source version appears quite feature-rich. The main differences in the enterprise version are a slicker Web-based administration tool and the availability of high-speed, internode communication across data centers. And only the enterprise version can use SNMP.

Reply
06/13/2012 14:40

Neo4J

If there's one application that's most different in this collection, it's Neo4J, a tool optimized to store graphs instead of data. The Neo4J folks use the word "graph" like a computer scientist to mean a network of nodes and connections. Neo4J lets you fill up the data store with nodes and then add links between the nodes that mean things. Social networking applications are its strength.The code base comes with a number of common graph algorithms already implemented. If you want to find the shortest path between two people -- which you might for a site like LinkedIn -- then the algorithms are waiting for you.Neo4J is pretty new, and the developers are still uncovering better algorithms. In one recent version, they bragged about a new caching strategy: searching algorithms will run much faster because Neo4J is now caching the node information.

They've also added a new query language with pattern matching that looks a bit like XSL. You can search a graph until you identify nodes with the right type of data. It is a new syntax to learn.The Neo4J project is backed by Neo Technology, which offers commercial versions of the database with more sophisticated monitoring, fail-over, and backup features.

Reply
06/13/2012 14:41

FlockDB

If someone out there is writing code, someone else out there is complaining that the code is too complicated. It should be no surprise that some people think Neo4J is too intricate and sophisticated for what needs to be done. We know that Neo4J has truly arrived because the FlockDB fans are clucking about how FlockDB is simpler and faster.FlockDB is a core part of the Twitter infrastructure. It was released by Twitter more than a year ago as an open source project under the Apache license. If you want to build your own Twitter, you can also download Gizzard, a tool for sharding data across multiple instances of Flock. Both tools are ready and waiting to run in a JVM.Although many of us would call FlockDB a graph database because it stores relationships between nodes, some think that the term should apply only to sophisticated tools like Neo4J. Did someone start following someone else? Well, you can link up Flock's nodes with data such as the time that the relationship began. That part is like Neo4J. Where Flock differs is how deeply you can query this data.

FlockDB takes a pair of nodes and gives you the data about the connection. Everything else is up to you. Neo4J not only enables all types of graph-walking algorithms, but it provides them as services. FlockDB uses the word "non-goal" for these multihop queries, meaning that the developers have no interest in supporting them.The code is pretty new, and it doesn't seem to be attracting the same kind of widespread attention as some of the other projects. All of the recent commits have come from Twitter employees, and I wasn't able to find anyone offering FlockDB hosting as a service. FlockDB still seems to be mainly a Twitter project.

Reply
06/13/2012 14:48

How do you choose a noSQL database?

The following fast and scalable databases should be considered as an alternative to SQL for the right project:

Cassandra
CouchDB
MongoDB
Redis
Riak
Neo4J
FlockDB

All are fast and scalable.

You should select the noSQL database that is best for your special project.

There's no easy answer. Most shops would be happy with any of them, even if they select the worst one for their needs.

Choosing the best, though, is a bit harder because a good developer will want to balance the strength of the project, the availability of commercial support, and the quality of the documentation with the quality of the code.

The greatest divergence is in the extras. All of them will store piles of keys with their values, but the real question is how well they split the load across servers and how well they propagate changes across them.

Then there's the question of hosting. The idea of a cloud service that will do all of the maintenance for you is seductive.

The stakes are higher because switching is more difficult than it is with the SQL databases. There's no standard query language in this world, nor is there a vast array of abstraction layers like the JDBC.

These NoSQL databases have the power to lock you in. That's the price for all of the fun and features.

Reply
06/13/2012 14:56

Target data is generally divided into two sets, the training set and the test set.

The training set is used to train the data mining algorithm(s).

The test set is used to verify the accuracy of any patterns found.

Reply
12/06/2012 01:05



Data mining process must engage the sorting data process through the vast data amounts of data and acquire pertinent information. Data mining is usually undertaken by professional, financial and business analysts Outsourcing enables a company to shift its focus to the core business operations and improve its overall productivity. And The outsourcing can be considered as a wise choice for any business. Therefore outsourcing helps businesses in managing data effectively.

Reply
12/06/2012 01:07

Data mining process must engage the sorting data process through the vast data amounts of data and acquire pertinent information. Data mining is usually undertaken by professional, financial and business analysts Outsourcing enables a company to shift its focus to the core business operations and improve its overall productivity. And The outsourcing can be considered as a wise choice for any business. Therefore outsourcing helps businesses in managing data effectively.

Reply



Leave a Reply