Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities. It's not just for Facebook any longer. Many of the committing programmers come from other companies, and the project chair works at DataStax.com, a company devoted to providing commercial support for Cassandra.The heritage of the Cassandra project is obvious because it's a good tool for tracking lots of data, such as status updates at Facebook. The tool helps create a network of computers that all carry the same data. Each machine is meant to be equal to the others, and all of them should end up being consistent once the data propagates around the P2P network of nodes, though it's not guaranteed.
The key phrase is "eventual consistency," not "perfect consistency." If you've watched your status updates disappear and reappear on Facebook, you'll understand what this means.The tool runs in Java as a separate process waiting for interaction. There's already a collection of higher-level libraries for Java, Python, Ruby, and PHP, as well as some of the other languages.Using Cassandra seems relatively simple, but I still found myself getting hung up on several barriers, such as defining a keyspace (which acts as a namespace but for the columns). Getting up to speed takes more than a few minutes because there are more than just the basic routines for storing collections of values. Cassandra is happy with a sparse matrix where each row stores only a few standard columns, and it builds the indices with this in mind.
Much of the complexity in the API is devoted to controlling just how quickly the cluster of nodes moves toward consistency. You can specify the speed of synchronization for columns and collections of values called supercolumns.Getting everything running is now fairly well documented, but getting it running quickly requires a fair amount of both hardware and operating system tuning. The biggest bottleneck is the commit log. Optimizing the way that this is written to disk is the most important part of improving writes. Speeding up the extraction of data involves paying attention to the pattern of reads. Did your old, fancy database do this for you fairly automatically? Ah, don't complain. It's fun to think about the hardware and how it affects your software.
CouchDB stores documents, each of which is made up of a set of pairs that link key with a value. The most radical change is in the query. Instead of some basic query structure that's pretty similar to SQL, CouchDB searches for documents with two functions to map and reduce the data. One formats the document, and the other makes a decision about what to include.I'm guessing that a solid Oracle jockey with a good knowledge of stored procedures does pretty much the same thing. Nevertheless, the map and reduce structure should be eye-opening for the basic programmer. Suddenly a client-side AJAX developer can write a fairly complicated search procedure that can encode some sophisticated logic.
These libraries are extensive, and some of the major languages have extra layers that wrap and unwrap objects when storing and retrieving them.There's also a fair number of extra tools for working with the database. PHPMoAdmin, a cousin of the MySQL tool PHPMyAdmin, is just one of almost a dozen tools for admins. The proliferation of these tools is gradually erasing one of the standard reasons for sticking with a classic database. As I found more of them, I noticed that everything was more comfortable.
Like CouchDB and MongoDB, Redis stores documents or rows made up of key-value pairs. Unlike the rest of the NoSQL world, it stores more than just strings or numbers in the value. It will also include sorted and unsorted sets of strings as a value linked to a key, a feature that lets it offer some sophisticated set operations to the user. There's no need for the client to download data to compute the intersection when Redis can do it at the server.This approach leads to some simple structures without much coding. Luke Melia tracked the visitors on his website by building a new set every minute. The union of the last five sets defined those who were "online" at that moment. The intersection of this union with a friends list produced the list of online friends. These sorts of set operations have many applications, and the Redis crowd is discovering just how powerful they can be.Redis is also known for keeping the data in memory and only writing out the list of changes every once and a bit. Some don't even call it a database, preferring instead to focus on the positive by labeling it a powerful in-memory cache that also writes to disk.
Traditional databases are slower because they wait until the disk gets the information before signaling that everything is OK. Redis waits only until the data is in memory, something that's obviously faster but potentially dangerous if the power fades at the wrong moment.The project leaders are still exploring how to expand the project, an intriguing decision because there's more than one official version of Redis from the main team. There's even one official build of Redis that comes with a Lua interpreter and a disclaimer saying that "there is no guarantee that scripting works correctly or that it will be merged into future versions of Redis!" Projects like these are never boring.Redis providers are starting to appear. OpenRedis promises it's "launching soon." Meanwhile, Redis Straight Up charges just $19 per month, plus all of the costs from Amazon's cloud. The service handles the configuration and passes the costs on to you.
Riak is one of the more sophisticated data stores. It offers most of the features found in others, then adds more control over duplication. Although the basic structure stores pairs of keys and values, the options for retrieving them and guaranteeing their consistency are quite rich.The write operations, for instance, can include a parameter that asks Riak to confirm when the data has been propagated successfully to any number of the machines in the cluster. If you don't want to trust just one machine, you can ask it to wait until 2, 3, or 54 machines have written the data before sending the acknowledgment. This is why the team likes to toss around its slogan: "Eventual consistency is no excuse for losing data."The data itself is not just written to disk. Well, that is one of the options, but it's not the main one. Riak uses a pluggable storage engine (Bitcask by default) that writes the data to disk in its own internal format.
If there's one application that's most different in this collection, it's Neo4J, a tool optimized to store graphs instead of data. The Neo4J folks use the word "graph" like a computer scientist to mean a network of nodes and connections. Neo4J lets you fill up the data store with nodes and then add links between the nodes that mean things. Social networking applications are its strength.The code base comes with a number of common graph algorithms already implemented. If you want to find the shortest path between two people -- which you might for a site like LinkedIn -- then the algorithms are waiting for you.Neo4J is pretty new, and the developers are still uncovering better algorithms. In one recent version, they bragged about a new caching strategy: searching algorithms will run much faster because Neo4J is now caching the node information.
They've also added a new query language with pattern matching that looks a bit like XSL. You can search a graph until you identify nodes with the right type of data. It is a new syntax to learn.The Neo4J project is backed by Neo Technology, which offers commercial versions of the database with more sophisticated monitoring, fail-over, and backup features.
If someone out there is writing code, someone else out there is complaining that the code is too complicated. It should be no surprise that some people think Neo4J is too intricate and sophisticated for what needs to be done. We know that Neo4J has truly arrived because the FlockDB fans are clucking about how FlockDB is simpler and faster.FlockDB is a core part of the Twitter infrastructure. It was released by Twitter more than a year ago as an open source project under the Apache license. If you want to build your own Twitter, you can also download Gizzard, a tool for sharding data across multiple instances of Flock. Both tools are ready and waiting to run in a JVM.Although many of us would call FlockDB a graph database because it stores relationships between nodes, some think that the term should apply only to sophisticated tools like Neo4J. Did someone start following someone else? Well, you can link up Flock's nodes with data such as the time that the relationship began. That part is like Neo4J. Where Flock differs is how deeply you can query this data.
FlockDB takes a pair of nodes and gives you the data about the connection. Everything else is up to you. Neo4J not only enables all types of graph-walking algorithms, but it provides them as services. FlockDB uses the word "non-goal" for these multihop queries, meaning that the developers have no interest in supporting them.The code is pretty new, and it doesn't seem to be attracting the same kind of widespread attention as some of the other projects. All of the recent commits have come from Twitter employees, and I wasn't able to find anyone offering FlockDB hosting as a service. FlockDB still seems to be mainly a Twitter project.
How do you choose?
There's no easy answer. Most shops would be happy with any of them, even if they select the worst one for their needs. Choosing the best, though, is a bit harder because a good developer will want to balance the strength of the project, the availability of commercial support, and the quality of the documentation with the quality of the code.The greatest divergence is in the extras. All of them will store piles of keys with their values, but the real question is how well they split the load across servers and how well they propagate changes across them.
Then there's the question of hosting. The idea of a cloud service that will do all of the maintenance for you is seductive.The stakes are higher because switching is more difficult than it is with the SQL databases. There's no standard query language in this world, nor is there a vast array of abstraction layers like the JDBC.
These NoSQL databases have the power to lock you in. That's the price for all of the fun and features.
A NoSQL database provides a mechanism for storage and retrieval of data that uses looser consistency models than traditional SQL relational databases. They are useful for managing unstructured and semi-structured data.
Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key–value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput.
NoSQL databases are finding significant and growing industry use in big data and real-time web applications. NoSQL systems are also referred to as "Not only SQL" to emphasize that they do in fact allow SQL-like query languages to be used.
Most common NoSQL classifications:
See Row vs Columnar vs NoSQL Databases.
For a more complete list of the nearly 150 different NoSQL databases, with more standard classification, see http://nosql-database.org.
Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget?
Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. present the findings of their three-month research project focused on the evolution of database technology. They offer practical advice for the best way to approach the evaluation, procurement and use of today's database management systems. Bloor and Madsen clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.
A major concern for organizations building big data analytical ecosystems is data security. One flaw of Hadoop/MapReduce and many NoSQL databases is weak security.
Apache Accumulo is an open-source highly secure NoSQL database created in 2008 by the National Security Agency. It easily integrates with Hadoop, can securely handle massive amounts of structured and unstructured data - at scale cost-effectively - and enables users to move beyond traditional batch processing and conduct a wide variety of real-time analyses. Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is a system built on top of Hadoop, ZooKeeper and Thrift. Written in Java, Accumulo has cell-level access labels and a server-side programming mechanisms.
Accumulo offers "Cell-Level Security" - extending the BigTable data model, adding a new element to the key called "Column Visibility". This element stores a logical combination of security labels that must be satisfied at query time in order for the key and value to be returned as part of a user request. This allows data of varying security requirements to be stored in the same table, and allows users to see only those keys and values for which they are authorized.
Sqrrl Enterprise, developed by Sqrrl Data, is the operational data store for large amounts of structured and unstructured data. It is the only NoSQL solution that scales elastically to tens of petabytes of data and has fine-grained security controls. Sqrrl Enterprise enables development of real-time applications on top of Big Data. Sqrrl uses HDFS for storage; Accumulo for security/speed of access; Thrift API for interactivity; and works with map/reduce, visualizations, third party software, and existing schema explored databases.
NoSQL & Non-Relational Databases
Relational databases have been the de facto technology for storing and querying data for 40 years. What is driving the recent innovation in databases? This talk will touch on the history of databases, why RDBMS have been so successful, and why we are seeing the rise of NoSQL databases. Next we will examine the different categories of NoSQL databases and technology. The presentation will finish with a specific introduction to MongoDB, its design principles, and what it looks like to code against.
Will LaForest heads up the Federal practice for 10gen, the MongoDB company. Will is focused on evangelizing the benefits of MongoDB, NoSQL, and (OSS) open source software in solving Big Data challenges in the Federal government. He believes that software in the Big Data space must scale not only from a technical perspective but also from a cost perspective. He has spent 7 years in the NoSQL space focused on the Federal government, most recently as Principal Technologist at MarkLogic. His technical career spans diverse areas from data warehousing, to machine learning, to building statistical visualization software for SPSS but began with code slinging at DARPA. He holds degrees in Mathematics and Physics from the University of Virginia.
Monte Carlo Simulation Methods in Energy Risk Management
Monte Carlo methods are stochastic techniques or probabilistic modeling - meaning they are based on the use of random numbers and probability statistics to investigate problems.
They are used to model phenomena with significant uncertainty in inputs, such as the calculation of risk in business. When Monte Carlo simulations have been applied in space exploration and oil exploration, their predictions of failures, cost overruns and schedule overruns are routinely better than human intuition or alternative "soft" methods.
For energy companies, understanding the impact of commodity price movements on the value of a portfolio is critical for hedging, risk management and planning purposes. For example, consider a gas-fired power plant which buys natural gas from a spot market, converts it into electricity, and sells that electricity into a deregulated power spot market.
The generator is exposed to fluctuations in the price it must pay to purchase natural gas and the price it will receive for the sales of power. In order to reduce risks, a power plant operator may choose to buy in advance the natural gas that it anticipates it will need, and to sell in advance the power it anticipates it will generate -- that is contract in advance for the forward purchase of gas and the sale of power at a future delivery period, for a fixed price today. This practice, known as hedging, attempts to remove the uncertainty in future cash flows from the power plant.
The decision on how much and how often to hedge will, in general, require sophisticated analytical methods. One popular method, Monte Carlo simulation, attempts to simulate future states of the world to understand the impact on cash flows.
In this talk, we discuss Monte Carlo methods for energy risk applications. We review one popular approach, which uses a set of linked simulation models to capture the fundamental physical drivers of electricity price formation, and calibrates them to match current prices being quoted in the financial markets. Monte Carlo simulations of weather, load and prices can then be used to value a portfolio of generation assets and trades, and to support hedging and risk management decisions.
Scotty Nelson is a Senior Energy Analyst at Ascend Analytics, where he deploys analytic software solutions to help companies understand and manage risk in the energy markets.