There has been much discussion and debate about the definition of data science and the new rare breed of sexy bird called the data scientist. The Data Science Association defines "Data Science" as the scientific study of the creation, validation and transformation of data to create meaning; and the "Data Scientist" as a professional who uses scientific methods to liberate and create meaning from raw data.
While these definitions may appear overbroad, think about the definitions of a lawyer or physician. A lawyer is a legal professional who can help prevent or solve legal issues and a physician is a health professional who can help prevent or cure health issues. Like the professionalization of law and medicine in the past hundred years, data science is at the very beginning of becoming a profession - with competency standards and a Data Science Code of Professional Conduct.
This means that data science will evolve into a profession where data scientists specialize in different areas - like lawyers and physicians. When you need to hire a lawyer you usually consider the special area of law that a lawyer practices. If you have a tax problem you hire a tax lawyer, not a divorce lawyer. If you have a heart problem you do not hire a gynecologist.
The simple truth is that data science is a vast and complicated field and - like law and medicine - much too big and complex for one person to master in one lifetime. My colleague Gary Mazzaferro has been exploring the concepts and ideas surrounding data science and definitions as formalizations aligning with knowledge economies and the knowledge / science / technology maturity models. Gary has (to date) defined the following data science specializations and types of data scientists:
Data Science: A field of systematic interdisciplinary study to elucidate relationships across and within Formal, Social Natural and Special Sciences phenomenon through the application of scientific methods. Interdisciplinary areas include analytical processes, mathematics, probability and statistics, logic, modeling, machine learning, algorithms, communications, traditional sciences, business, public policy and philosophy.
Blue Sky Data Science: A purely curiosity driven exploratory branch of Data Science oriented towards the development and establish understanding about relationships across and within phenomenon with no focus on specific goals and immediate application.
Basic Data Science: A branch of Data Science research focused on clearly defined goals and oriented towards the development and establish understanding about relationships across and within phenomenon.
Applied Data Science: A branch of Data Science oriented toward the development of practical applications, technologies other interventions including engineering practices. Applied Data Science bridges the gap between Basic Data Science and the engineering domains to provide predicable, usable tools to industries including standard methods and practices.
Data Science Practice: The regular performance of Applied Data Science activities and methods for private and public organizations. May practice externally or internally. Practice may necessitate additional disciplines based on the needs of the organization including domain expertise and communications supporting presentation and reporting activities.
Data Scientist: A person that studies or has expert knowledge of the interdisciplinary field of Data Science.
Blue Sky Data Scientist: A person that studies or researches in the branch of Blue Sky Data Science.
Basic Data Scientist: A person that studies, researches or has expert knowledge in the branch of Basic Data Science.
Applied Data Scientist: A person that studies or researches in the branch of Applied Science.
Note that this is a preliminary list and is not complete. The profession of data science will evolve to create many specializations. After all, it took law and medicine over one hundred years to evolve as professions with different specialties.
One of the most popular methods or frameworks used by data scientists at the Rose Data Science Professional Practice Group is Random Forests. The Random Forests algorithm is one of the best among classification algorithms - able to classify large amounts of data with accuracy.
Random Forests are an ensemble learning method (also thought of as a form of nearest neighbor predictor) for classification and regression that construct a number of decision trees at training time and outputting the class that is the mode of the classes output by individual trees (Random Forests is a trademark of Leo Breiman and Adele Cutler for an ensemble of decision trees).
Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”. Random Forests are a wonderful tool for making predictions considering they do not overfit because of the law of large numbers. Introducing the right kind of randomness makes them accurate classifiers and regressors.
Single decision trees often have high variance or high bias. Random Forests attempts to mitigate the problems of high variance and high bias by averaging to find a natural balance between the two extremes. Considering that Random Forests have few parameters to tune and can be used simply with default parameter settings, they are a simple tool to use without having a model or to produce a reasonable model fast and efficiently.
Random Forests are easy to learn and use for both professionals and lay people - with little research and programming required and may be used by folks without a strong statistical background. Simply put, you can safely make more accurate predictions without most basic mistakes common to other methods.
The Random Forests algorithm was developed by Leo Breiman and Adele Cutler. Random Forests grows many classification trees. Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
Top Benefits of Random Forests
FastRandomForest is an efficient implementation of the Random Forests classifier for the Weka environment.
High Performance Computing (HPC) plus data science allows public and private organizations get actionable, valuable intelligence from massive volumes of data and use predictive and prescriptive analytics to make better decisions and create game-changing strategies. The integration of computing resources, software, networking, data storage, information management, and data scientists using machine learning and algorithms is the secret sauce to achieving the fundamental goal of creating durable competitive advantage.
HPC has evolved in the past decade to provide "supercomputing" capabilities at significantly lower costs. Modern HPC uses parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.
HPC enables data scientists to address challenges that have been unmanageable in the past. HPC expands modeling and simulation capabilities, including using advanced data science techniques like random forests, monte carlo simulations, bayesian probability, regression, naive bayes, K-nearest neighbors, neural networks, decision trees and others.
Additionally, HPC allows an organization to conduct controlled experiments in a timely manner as well as conduct research for things that are too costly and time consuming to do experimentally. With HPC you can mathematically model and run numerical simulations to attempt to gain understanding via direct observation.
HPC technology today is implemented in multidisciplinary areas including:
• Finance and trading
• Oil and gas industry
• Electronic design automation
• Media and entertainment
• Geographical data
• Climate research
In the near future both public and private organizations in many domains will use HPC plus data science to boost strategic thinking, improve operations and innovate to create better services and products.
Natural language processing (NLP) involves machine learning, artificial intelligence, algorithms and linguistics related to interactions between computers and human languages. One important goal of NLP is to design and build software that will understand and analyze human languages to simplify and optimize human - computer communication.
NLP algorithms are usually based on probability theory and machine learning grounded in statistical inference — to automatically learn rules through analysis of real-world usage. It includes word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, question answering and requires both syntactic and semantic analysis at various levels.
NLP applications today involve spelling and grammar correction in word processors, machine translation, sentiment analysis and email spam detection. NLP plus data science is now allowing us to design and implement better automatic question / answering systems and the ability to detect and predict human opinions about products or services.
Examples of NLP algorithms include n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.
Google has open sourced a tool for computing continuous distributed representations of words that provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
Download the code: svn checkout http://word2vec.googlecode.com/svn/trunk/
Run 'make' to compile word2vec tool
Run the demo scripts: ./demo-word.sh and ./demo-phrases.sh
Data Science Group Event: University of Colorado Denver - Tuesday May 21, 2013
RECOMMENDATION ENGINES - ABSTRACT
Recommendation Engines (RE) are software tools and techniques providing item suggestions to a user. The massive growth and variety of information can often overwhelm, leading to poor decisions. While choice is good, more choice is not always better. REs have proved in recent years to be a valuable means for coping with the information overload problem.
In their simplest form, personalized recommendations are offered as ranked lists of items. In performing this ranking, REs try to predict what are the most suitable products or services for a user, based on their preferences and constraints. In order to complete this computational task, REs collect preferences from users, which are either explicitly expressed (e.g., as ratings for products) or are inferred by interpreting user actions. For instance, a RE may consider the navigation to a particular product page as an implicit sign of preference for the items shown on that page.
Amazon's RE for example relies on a basic formula (collaborative filtering) that suggests products to you based on your viewing history, your purchase history and which related products other customers bought.
Tom Rampley is a data scientist with a background in finance and psychology. He received his MBA from Indiana University’s Kelley School of Business in 2012, with concentrations in finance and business analytics. Since graduation, he has been working within the Viewer Measurement group at Dish Network LLC on customer segmentation models, the development of recommendation engines, and the implementation of big data IT platforms. He prefers R to SAS, Python to any other scripting language, and while trained as a frequentist currently considers himself Bayes-curious. Outside of work he is married with no kids (yet), a lifelong martial artist, and endlessly nostalgic for the days when he played lead guitar in his grad school rock band. This is his first Data Science meetup presentation.
ACCUMULO - SQRRL NOSQL DATABASE - ABSTRACT
Apache Accumulo is an open-source highly secure NoSQL database created in 2008 by the National Security Agency. It easily integrates with Hadoop, can securely handle massive amounts of structured and unstructured data - at scale cost-effectively - and enables users to move beyond traditional batch processing and conduct a wide variety of real-time analyses. Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is a system built on top of Hadoop, ZooKeeper and Thrift. Written in Java, Accumulo has cell-level access labels and a server-side programming mechanisms.
Accumulo offers "Cell-Level Security" - extending the BigTable data model, adding a new element to the key called "Column Visibility". This element stores a logical combination of security labels that must be satisfied at query time in order for the key and value to be returned as part of a user request. This allows data of varying security requirements to be stored in the same table, and allows users to see only those keys and values for which they are authorized.
Sqrrl Enterprise, developed by Sqrrl Data, is the operational data store for large amounts of structured and unstructured data. It is the only NoSQL solution that scales elastically to tens of petabytes of data and that has fine-grained security controls. Sqrrl Enterprise enables development of real-time applications on top of BigData. Sqrrl uses HDFS for storage; Accumulo for security/speed of access; Thrift API for interactivity; and works with map/reduce, visualizations, third party software, and existing schema explored databases.
This presentation reviews Accumulo and Sqrrl Enterprise.
John Dougherty is CIO for Viriton, a consulting and systems integration organization. He is the organizer for Big Data for Business, helping to apply Big Data concepts to C-suite perspectives. He began utilizing applied strategies, using technology, in the early nineties, and has continued to incorporate blue blood technologies in forward thinking solutions.