The last few years have brought a wealth of new data technologies organized around horizontal scalability. LinkedIn has built out an ecosystem of infrastructure to support products that use data in innovative ways and create significant infrastructure demands.
LinkedIn uses a mixture of apache projects like Hadoop, Zookeeper, Pig, and Avro as well as a set of open source projects of our own creation such as Voldemort, Kafka, and Azkaban.
Hadoop is the key ingredient for offline computation, but creating an agile system for offline computing requires a lot more than just a Hadoop cluster.
Stream-processing is an under-utilized model that enables real-time data processing. Kafka is LinkedIn's open source framework that enables map/reduce like processing without the high-latency turnaround of Hadoop jobs.
Finally live serving and data deployment are the last mile of analytical data processing—getting terrabytes of data delivered and available for serving with low latency is what actually gets your data in front of your users.