DATA LAKE

Why is the “data lake” approach so popular these days?
Why is the “data warehouse” approach on the decline?
The answer is simple: Speed, size, agility and resilience.

Speed

You need to be able to read and process the dataset files that stored inside your data lake at the highest speed as possible. This is why, all TIMi customers are storing their data inside their data lake inside .gel_anatella files and .cgel_anatella files. Indeed, on most common infrastructure, one PC can read these files at a speed above 1000MB/sec (data stream after de-compression).

If you are already using a Hadoop system, your data lake will most certainly be composed of .parquet files, which is the fastest file format available on Hadoop. Because Anatella is coded 100% in C and assembler, it can&write natively .parquet files at a very high speed (around 200MB/sec), much higher than any other tool inside the Hadoop ecosystem (around 50MB/sec), making it the ideal solution for the analysis of the data stored inside your Hadoop data lake.

If you are already using a SAS system, your data lake will most certainly be composed of .sas7bdat files, which is the fastest file format available on SAS. Anatella can&write natively .sas7bdat files at a very high speed (i.e. most of the time, at a speed higher speed than SAS). This makes Antatella the ideal solution for the analysis of the data stored inside your SAS data lake.

Processing-Time

If you are using Anatella, your data-lake-directory can be stored on a “standard drive” or within HDFS: the only difference is speed. …But speed is the main reason behind the construction and adoption of a datalake. So, it’s very important. For example, I already saw some company that were really happy to switch to a datalake because they could barely update their data warehouse each day (i.e. Each day, the “update” time of their data warehouse was 23 hours. This is dangerously close to 24h! After switching to Anatella, the “update” time was reduced to 20 minutes).

On a standard NAS or SAN, you can read files at about 500 MB/sec. On the other hand, HDFS drives are limited to speed from 5 to 50MB/sec. So, we advise all our customers to store their datasets on a NAS or a SAN and only uses HDFS for long-term cold (artic) storage (Basically, you can replace your old “tapes” by HDFS but don’t expect much more from HDFS).

Size

A data lake is a simple large directory that can contains as many dataset files as you want. In opposition, a data warehouse is stored inside a RDBMS (i.e. inside Oracle, Teradata, SQLServer, etc.) and this directly imposes a (financial) limit to the quantity of data that can be store inside a data warehouse. Indeed, the more data store inside a database, the higher the cost of the database.

Storage cost

When using a distributed storage system such as HDFS, a large quantity (i.e. exactly two third 2/3) of your storage space is “lost” to provide resilience against hardware failure. This means that each time you want to add 1TB of storage inside HDFS, you actually need to purchase 3TB of physical space. This has a cost. When you add this additional cost (66% of your bills) to the hosting, maintenance and support cost of a large Hadoop cluster, you quickly realize that Hadoop is really expensive. Most of the companies that we see adopting HDFS storage have annual invoices from their cloud service provider that range from half a million to several million euros.

Data compression

Even if your data lake is hosted on a simple NAS or HDFS drive, you can still run out of disk space. This situation will most likely never happen if you are using .gel_anatella or .cgel_anatella files because these file are strongly compressed (compared to a standard RDBMS, the data size is usually divided by 100).

Other analytic systems such as SAS or Hadoop also offer to *optionally* store their dataset inside a compressed file format. ..but, in practice, the penalty, in terms of reading and writing speed, is so large that we know no companies using SAS or Hadoop that are actually compressing their datasets. The lack of compression directly translates to an inefficient and huge loss of disk space.

This makes Anatella a cost efficient solution for high speed data storage of large quantities of data.

Agility

You should be able to run any kind of analytics on the data stored inside your data lake.

BI tools

You should be able to run BI tools such as Kibella, Tableau, Qlikview, etc. to create rapidly any kind of (web) report. Anatella contains little boxes that are exporting at high speed (from 9 to 90 times faster than a RDBMS) the datasets stored into your data lake to the common format used by the most common BI tools (i.e. to .hyper files for Tableau or to .qvd files for Qlickview). This makes Anatella the best solution for data preparation for BI tools.

Advanced analytics

Most of the time, before engaging into any kind of Advanced analytics work, you must export first your data out of the RDBMS to the TIMi/R/Python/SAS environment. This exportation procedure is very slow (speed is around 10 to 20 MB/sec).

On the other hand, you’ll find TIMi and Anatella. Anatella integrates natively with all version of R, Python, SAS, etc. Furthermore, you can export the datasets stored inside your data lake at a very high speed if you need it. The speed is only limited to your drive: i.e. 2000MB/sec for a standard NVMe SSD drive.

No code

Using Anatella you can analyze any dataset size, whatever the volume or the complexity of the task almost without typing one line of code. This makes the analysis of the data stored inside your data lake much more efficient and reliable.

Typically, one analyst with Anatella process up to 20 more work than with other kind of technologies (Python, R, Hadoop, SQL).

Resilience

Resilience comes in three forms:

Resilience against user errors

In a typical data lake configuration, there are always a bunch of automated scripts running during the night to prepare a set of curated & cleaned datasets that will be used during the day by all the analysts to run their analysis. You absolutely want to prevent these “cleaned datasets” to be tempered with. Nobody should have the right to modify (and thus “break”) these “reference” dataset. This is easy to enforce if you use .gel_anatella or .cgel_anatella files for your data lake. Since .gel_anatella or .cgel_anatella files are read-only, nobody can “break” them.

Resilience against transportation fault

In a world dominated by the “Cloud”, your dataset files are going through many “wires” before arriving to their final destination and be analyzed. Although internet communication is more and more reliable, there will arrive a point when, during the data transfer, a single byte of data will be corrupted inside a large 1TB dataset file. This happens all the time when working on large data set in large data lakes.

So, you waited for a few hours (sometimes even a few days), to get your dataset file and one byte is corrupted. What happens now? If you data lake is based on .parquet files (i.e. it’s a Hadoop system) or .sas7bdat files (it’s a SAS system), you are pretty much “out of luck”: i.e. you’ll be forced to restart your data transfer again and prey that it won’t fail this time (spoiler: it will fail again).

That’s why .gel_anatella files are resilient: If one part of the file is corrupted, you can still read and use the 99.99% of the file that is still valid (if you want to).

Resilience against crash

Sometime computers crash. What matters is that your data is still there after the crash. This is why we advise our customer to store their data on a NAS/SAN configured with a RAID6 storage. When using a RAID6 storage one third of the physical drives inside your SAN/NAS is used to be able to re-construct your data in case of hard-drive failure. With a RAID6 storage, you are resilient against 2 simultaneous failures: i.e. you can have 2 drives that are broken at the same time and still have access to all your data without any loss.

An alternative to RAID6 is HDFS. When using a HDFS storage two third of the physical drives inside your storage is used to be able to re-construct your data in case of failure. With a HDFS storage, you have the same resilience as with RAID6: i.e. you can have 2 simultaneous failures without losing any data, but the “lost” space is the double of the lost space when using RAID6.