High Performance Computing (HPC) plus data science allows public and private organizations get actionable, valuable intelligence from massive volumes of data and use predictive and prescriptive analytics to make better decisions and create game-changing strategies. The integration of computing resources, software, networking, data storage, information management, and data scientists using machine learning and algorithms is the secret sauce to achieving the fundamental goal of creating durable competitive advantage.
HPC has evolved in the past decade to provide "supercomputing" capabilities at significantly lower costs. Modern HPC uses parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.
HPC enables data scientists to address challenges that have been unmanageable in the past. HPC expands modeling and simulation capabilities, including using advanced data science techniques like random forests, monte carlo simulations, bayesian probability, regression, naive bayes, K-nearest neighbors, neural networks, decision trees and others.
Additionally, HPC allows an organization to conduct controlled experiments in a timely manner as well as conduct research for things that are too costly and time consuming to do experimentally. With HPC you can mathematically model and run numerical simulations to attempt to gain understanding via direct observation.
HPC technology today is implemented in multidisciplinary areas including:
• Finance and trading
• Oil and gas industry
• Electronic design automation
• Media and entertainment
• Geographical data
• Climate research
In the near future both public and private organizations in many domains will use HPC plus data science to boost strategic thinking, improve operations and innovate to create better services and products.
Introduction to Machine Learning - Slides
Ameet Talwalker and Evan Sparks present their work on the MLbase project which will be a distributed Machine Learning platform on top of Apache Spark. This presentation was given on August 6th 2013. See: http://mlbase.org/
In this talk we describe our efforts, as part of the MLbase project, to develop a distributed Machine Learning platform on top of Spark. In particular, we present the details of two core components of MLbase, namely MLlib and MLI, which are scheduled for open-source release this summer. MLlib provides a standard Spark library of scalable algorithms for common learning settings such as classification, regression, collaborative filtering and clustering. MLI is a machine learning API that facilitates the development of new ML algorithms and feature extraction methods. As part of our release, we include a library written against the MLI containing standard and experimental ML algorithms, optimization primitives and feature extraction methods.
Natural language processing (NLP) involves machine learning, artificial intelligence, algorithms and linguistics related to interactions between computers and human languages. One important goal of NLP is to design and build software that will understand and analyze human languages to simplify and optimize human - computer communication.
NLP algorithms are usually based on probability theory and machine learning grounded in statistical inference — to automatically learn rules through analysis of real-world usage. It includes word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, question answering and requires both syntactic and semantic analysis at various levels.
NLP applications today involve spelling and grammar correction in word processors, machine translation, sentiment analysis and email spam detection. NLP plus data science is now allowing us to design and implement better automatic question / answering systems and the ability to detect and predict human opinions about products or services.
Examples of NLP algorithms include n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.
Google has open sourced a tool for computing continuous distributed representations of words that provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
Download the code: svn checkout http://word2vec.googlecode.com/svn/trunk/
Run 'make' to compile word2vec tool
Run the demo scripts: ./demo-word.sh and ./demo-phrases.sh
Data Science Group Event: University of Colorado Denver - Tuesday May 21, 2013
RECOMMENDATION ENGINES - ABSTRACT
Recommendation Engines (RE) are software tools and techniques providing item suggestions to a user. The massive growth and variety of information can often overwhelm, leading to poor decisions. While choice is good, more choice is not always better. REs have proved in recent years to be a valuable means for coping with the information overload problem.
In their simplest form, personalized recommendations are offered as ranked lists of items. In performing this ranking, REs try to predict what are the most suitable products or services for a user, based on their preferences and constraints. In order to complete this computational task, REs collect preferences from users, which are either explicitly expressed (e.g., as ratings for products) or are inferred by interpreting user actions. For instance, a RE may consider the navigation to a particular product page as an implicit sign of preference for the items shown on that page.
Amazon's RE for example relies on a basic formula (collaborative filtering) that suggests products to you based on your viewing history, your purchase history and which related products other customers bought.
Tom Rampley is a data scientist with a background in finance and psychology. He received his MBA from Indiana University’s Kelley School of Business in 2012, with concentrations in finance and business analytics. Since graduation, he has been working within the Viewer Measurement group at Dish Network LLC on customer segmentation models, the development of recommendation engines, and the implementation of big data IT platforms. He prefers R to SAS, Python to any other scripting language, and while trained as a frequentist currently considers himself Bayes-curious. Outside of work he is married with no kids (yet), a lifelong martial artist, and endlessly nostalgic for the days when he played lead guitar in his grad school rock band. This is his first Data Science meetup presentation.
ACCUMULO - SQRRL NOSQL DATABASE - ABSTRACT
Apache Accumulo is an open-source highly secure NoSQL database created in 2008 by the National Security Agency. It easily integrates with Hadoop, can securely handle massive amounts of structured and unstructured data - at scale cost-effectively - and enables users to move beyond traditional batch processing and conduct a wide variety of real-time analyses. Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is a system built on top of Hadoop, ZooKeeper and Thrift. Written in Java, Accumulo has cell-level access labels and a server-side programming mechanisms.
Accumulo offers "Cell-Level Security" - extending the BigTable data model, adding a new element to the key called "Column Visibility". This element stores a logical combination of security labels that must be satisfied at query time in order for the key and value to be returned as part of a user request. This allows data of varying security requirements to be stored in the same table, and allows users to see only those keys and values for which they are authorized.
Sqrrl Enterprise, developed by Sqrrl Data, is the operational data store for large amounts of structured and unstructured data. It is the only NoSQL solution that scales elastically to tens of petabytes of data and that has fine-grained security controls. Sqrrl Enterprise enables development of real-time applications on top of BigData. Sqrrl uses HDFS for storage; Accumulo for security/speed of access; Thrift API for interactivity; and works with map/reduce, visualizations, third party software, and existing schema explored databases.
This presentation reviews Accumulo and Sqrrl Enterprise.
John Dougherty is CIO for Viriton, a consulting and systems integration organization. He is the organizer for Big Data for Business, helping to apply Big Data concepts to C-suite perspectives. He began utilizing applied strategies, using technology, in the early nineties, and has continued to incorporate blue blood technologies in forward thinking solutions.