6.4. Do I need an Hadoop HDFS drive?

<< Click to Display Table of Contents >>

Navigation:  6. Hadoop Integration >

6.4. Do I need an Hadoop HDFS drive?


The advantages of HDFS are:

It’s a really cheap storage solution compared to a traditional database system because:

oIt only uses common grade hardware (i.e. low costs PC’s) that is much cheaper than the dedicated/specialized hardware (such as the “InfiniBand network cards”) found in Teradata, Exadata, etc. databases.

oit’s open source and free software: There are no licensing costs.

It’s a solution that is easily extensible: If you need more storage, simply add more servers (These servers are named “data nodes” in Hadoop terminology).

An HDFS drive can store files that are larger than one physical drive. For example, an HDFS drive can store a 2TB file although it’s composed of only 1TB physical hard drives. However, the parquet file format (that is commonly used in Hadoop) does not support such size.


The disadvantages of HDFS are:


It’s a very expensive storage solution compared to a traditional SSD/NAS/SAN/RAID6 drive.

It’s more expensive because you need to hire specialized staff for maintenance and support of all the tools inside the Hadoop ecosystem (including, of course, your HDFS drive).

The tools in the Hadoop ecosystem are known to lead to large maintenance and support costs to keep them running: i.e. you’ll need a specialized staff that is able to keep your Hadoop environment “Up & Running”. Luckily, the HDFS drive is an Hadoop component that is amongst the easiest to maintain (but a competent staff is still required).

It’s inefficient in terms of storage consumption. Let’s take a simple example: Let’s assume that you want that your storage-system stays operational (without any data loss) even if there are some catastrophic failures inside 2 physical disks (i.e. the storage system must be resilient to 2 failures).  In such condition, to store a 2GB file, we’ll have:

oin a RAID6 drive: 3GB of disk space is used.

oin a HDFS   drive: 6GB of disk space is used.

This means that the storage cost is (at least) two times higher for HDFS than for RAID6 (because you need to buy two times more physical disks to have the same capacity and the same resilience).

The data access is quite slow, especially compared to a local SSD/RAID6 drive (that runs between 500 MByte/sec and 2000Mbyte/sec). With HDFS, you can expect a read speed between 5 Mbyte/sec and 50 MByte/sec (it mainly depends on your network cards).

This is extremely SLOW compared to a SSD/RAID6 drive.


The two major drawbacks of an HDFS drive are (1) its heavy price (compared to a local SSD/RAID6 drive) and (2) its low I/O speed (again, compared to a SSD/RAID6). Despite these two major defaults, the HDFS drive is one of the only solution that offers (theoretically) an unlimited data capacity (just add more nodes to get more capacity) and it can thus make sense to use an HDFS drive if you need to store really large volumes of cold data that do not fit inside a SSD/RAID6 drive (despite the very efficient data compression algorithms used in the .gel_anatella files and in the .cgel_anatella files used in Anatella). Needless to say that a situation where the usage of an HDFS drive is really justified by technical constraints is a situation that occurs almost never (at least, not when you are using the highly compressed file formats available inside Anatella).



For example: The raw CDR (Call-Data-Record) from VIVO in Brazil (a telecom with more than 85 million customers) are processed every day in a few hours without any HDFS drive: i.e. one NAS (equipped with a RAID6 drive) and 3 laptops with Anatella&TIMi are enough to compute as many KPI’s or Predictive models as you want.


So, if you are a telecom with less than 85 millions customers, there are no technical constraints that should motivate you to use an Hadoop HDFS Drive (at least, not when you are using Anatella&TIMi).