What is Hadoop?

April 29, 2013 / Technology

Hadoop

The Hadoop is an open source platform developed especially for processing and analyzing large volumes of data, whether structured or unstructured. The project is maintained by the Apache Foundation, but has the support of several companies such as Yahoo!, Facebook, Google and IBM.

It can be said that the project Hadoop began in 2002, and after working several months, Google released the Google File System paper in October 2003 and the MapReduce paper in December 2004.

The Google File System (GFS) – a file system specially prepared to handle distributed processing, as would be the case of a company like this, large volumes of data (in magnitudes of terabytes or even petabytes).

The MapReduce is a programming model that distributes processing to be performed among multiple computers to help your search engine to get faster and free of needs like expensive powerful servers.

In a nutshell, the file system is a set of instructions that determines how the data should be stored, accessed, copied, altered, appointed, and removed and so on.

In 2004, an open source implementation of GFS was incorporated into Nutch, a search engine project for the Web. The Nutch faced problems of scale – that could not handle a large volume of pages – and the variation of the GFS, which was named Nutch Distributed Filesystem (NDF), was shown as a solution. The following year, the Nutch also already had an implementation of MapReduce.

In fact, the Nutch was part of a larger project called Lucene – a high-performance, full-featured text search engine library written entirely in Java for indexing pages. Those responsible for these works soon saw that what they had on hand could also be used in applications other than search the Web. This realization led to the creation of another project that encompasses characteristics of Nutch and Lucene: Hadoop, whose system implementation files was named Hadoop Distributed File System (HDFS).

Hadoop is considered as a suitable solution for Big Data for several reasons:

  • It is an open source project, as already mentioned, a fact that it allows the modification for purposes of customization and makes them susceptible to constant improvements thanks to their collaboration network. Because of its characteristics, various derivatives or complementary projects were – and still are – created;
  • Provides economy, since it does not require the payment of licenses and supports conventional hardware, allowing you to create projects with machines considerably cheaper;
  • Hadoop account, by default, with fault tolerance capabilities such as data replication;
  • Hadoop is scalable: processing with the need to support a larger amount of data, it is possible to add computers without the necessity of performing complex system reconfiguration.

It is clear that Hadoop can be used in conjunction with NoSQL databases. The Apache Foundation has itself a solution that is kind of a sub-project of Hadoop: the aforementioned database HBase, which works hitched to HDFS. Most clients of cloud hosting providers make use of Hadoop for their projects.

The name Hadoop has an unusual origin: named after Cutting’s son’s stuffed elephant, a leading name behind the project.

Hadoop is the most prominent option, but not the only one. It is possible to find other solutions compatible with NoSQL or that are based on Massively Parallel Processing (MPP).

Leave a Reply

Your email address will not be published. Required fields are marked *