MapReduce is a programming model that was created (and is used) by Google in the early 2000s to process massive amounts of data. It’s name comes from the common functions in programming known as map and reduce, but they serve different functions than the traditional definitions. It is important to understand the concept because MapReduce technologies are responsible for decentralizing data storage and processing to increase the speed and reliability of dealing with large data sets. A popular free implementation is Apache Hadoop.
2. NoSQL Databases--Document-oriented databases using a key/value interface rather than SQL (does not use the relational database management system – RDBMS) to classify and organize; created to manage volumes of data that do not have a fixed schema.
NoSQL gained popularity as major companies adopted the system due to an overload of data, which could not use the traditional RDBMS solutions. NoSQL databases provide quick, efficient performance because captured data is quickly stored using a single identifying key and therefore, can quickly store a lot of transactions.
3. Storage--The technologies that hold the distributed data.
Data can be stored using data centers, which could include any number of various cloud technologies.
4. Servers--Technologies available for renting computing power on remote machines; a program to “serve” the requests of programs.
In big data, servers offer support for data storage and management.
5. Processing--The action of extracting valuable information from large datasets.
Processing allows the user to sort through the massive amounts of data and produce information for analysis. While processing, data can be sorted and grouped based on algorithms, but it’s important to understand the limitations and constraints without applying human thought evaluation.
6. Natural Language Processing--Extracting information from human-created text.
This type of processing requires sorting through data that is created from humans and not necessarily from their “actions.” For instance, if you are analyzing Twitter data from the previous six month, you might be looking for keywords and sentiment, which would require National Language Processing.
7. Visualization--Viewing graphically represented meaningful data.
As data is collected, stored and then analyzed, it needs to be presented in a way that it can be understood and digested. Programs able to analyze big data can sometimes interpret the data and represent it in a visual display for easier consumption and/or to show results.
8. Acquisition--Techniques for cleaning up messy public data sources.
As data is collected, it is not always in its purest form and/or usable. There are various sources that help take this data and turn it into something that can be processed.
9. Serialization--Converting data structure or object state into a format able to be stored.
Serialization occurs after the data is collected and when it is being processed. As the data gets sorted and pushed around between systems, it may need to be stored. During these steps, the data will require serialization and it will be based on the different languages and APIs.
10. CPU – Centralized processing unit; the hardware within a computer, which performs the basic operations of the system (comparable to the “brain”).
CPUs are mentioned in reference to crunching the data.
11. Hadoop – Open source implementation (Apache software product).
Hadoop was developed to enable applications to work with thousands of computational independent computers and petabytes of data.
12. Petabytes, exabytes, zettabytes – Units of information used to measure data amounts.
Petabytes = one quadrillion (short scale) bytes, or 1024 terabytes
Exabytes = one quintillion bytes (short scale)
Zettabytes = one sextillion (one long scale trilliard) bytes