Hadoop is an open source, Java-based (originally, now much of it is written in 'c'), software platform that supports the processing of very large datasets - petabytes - which are distributed across many servers (often thousands of them), connected as a cluster or grid. Officially called Apache Hadoop, because it is administered by the Apache Software Foundation, it achieved release 1.0 status on December 27, 2011, after six years of real-world development.
Based on MapReduce, a data processing technology developed by Google (which it uses to index the web), Hadoop supports both structured and unstructured data and one common initial application was the processing of website logs.
Financial markets applications include everything from structured trade and quote data, and reference data for securites processing, to unstructured news stories and twitter feeds, as well as email and instant messenger conversations between investment advisors/brokers and their clients.
While Hadoop is often mentioned as a rival to traditional relational databases, it is increasingly used in a complementary way, to parallelise the ETL (Extract, Transform, Load) phase of data management.
Under the hood, Hadoop consists of two main components: (1) the Hadoop Filesystem, which is a scalable, distributed mechanism for storage, and runs on top of a host operating system's own file system; and (2) MapReduce, which performs processing on data that is highly distributed, by splitting that processing up into many jobs and running them in parallel across many servers.
Yahoo has been a major backer and contributor to the Hadoop project, and Facebook is a big user of it for user data analysis. Other high profile users include EBay, LinkedIn, the New York Times and Twitter.
As well as obtaining Hadoop from the source - that would be http://http://hadoop.apache.org/ - a number of commercial distributions exist, from the likes of Cloudera, Hortonworks (spun out of Yahoo) and MapR Technologies. As is common in the open source world, some versions are available for free, while others - with more features, scalability and support - are at a price.
A number of vendors also bundle Hadoop into packaged hardware/software bundles, or appliances, as they are sometimes referred to. Storage giant EMC last year introduced its EMC Greenplum HD Data Computing Appliance to add unstructured data support (leveraging MapR's distro), while Oracle began shipments of its Big Data Appliance earlier this month, the result of an alliance with Cloudera. Dell, Netapp and SGI also package their iron and Cloudera's offering as packages.
It's also possible to run Hadoop in the cloud - IBM offers its Hadoop-based InfoSphere BigInsights on a cloud basis, and Amazon Web Services has its Elastic MapReduce offering. Microsoft is also in preview mode with Hadoop on its Azure cloud.
Other companies are focusing on boosting the performance of Hadoop to make it more suitable for realtime applications. Hadapt's Adapative Analytics Platform is in its early access phase, while SunGard has been prototyping predictive trading systems as part of its Raptor project.

Comments