Data Integration : ETL tools

The main objective of an “ETL tool” is to gather the content of various databases or operational systems across your organization and move/copy/transfer all this data into a centralized database. This “centralized database” is often named a “Data warehouse”. Alternatively ETL tools are also used to perform complex data transformations that are difficult to make using simple SQL statements.

The differences between the different ETL tools are:

Speed & Scalability

Usually ETL tools are transferring (or processing) huge amount of data between different servers. Let’s assume that, each day, you need to process 100 TB of data but, due to the large volume of data, you require 28h of computing time. You see that there is a problem here: you need 28 hours of computing time every day. This is why, most of the time, Scalability comes hand-in-hand with Speed. The slowest ETL’s are Talend, Pentaho and Rapid Insight (this last one is very very very slow). Talend has some small “memory issues” that prevent it to process large volume of data. The technology behind Anatella is optimized C++; it is thus very fast.

To achieve very high throughput, one solution is to use several computers “in parallel”, but then you need to hire more IT staff to manage your whole “fleet of computers”. To reduce cost on your IT staff, you can host your “fleet of computers” on a remote location. Some companies like the “Amazon Elastic Cloud”, the “Windows Azure” or the “OVH Cloud Universe” are offering such specialized “computing servers”. This solutions is very often refred as “in the cloud” because you usually use servers located far away “in the cloud“. This looks like a good solution but there are some strong drawbacks to this approach:

  • The location fee of the servers is usually prohibitive.
  • You precisous customer data is now hosted “outside your wall”. This can be a major security issue.

Scalability is not only about the raw volume of data that you can process. All processed data always ends-up in (large) tables. When you do “advanced predictive analytics” (like we do with TIMi), you usually end-up with tables from 500 to 50000 columns. No other ETL tool than Anatella is able to handle such a large quantity of columns.

Versatility

Thanks to its advanced scripting language (based on JavaScript), Anatella is extremely flexible: it can adapt to almost any situations. With Anatella, you’ll always be able to create the data transformations that you need. The scripting language integrated inside Anatella is somewhat similar to the one used inside a SAS datastep, so that you can easily leverage your SAS-Base skills.

Cost of ownership

Most ETL tools are extremely expensive when you buy them.

The cost of owndership must also include the cost to hire the staff required to operate the ETL. For some solutions, this cost is very high because you need to hire expensive, highly trained consultants to use these software (because these consultants need to know the particular “scripting language” of these application). On the contrary, nearly anybody can use the user-friendly, gui-based interface from Anatella.

Slow ETL’s also tend to become more expensive because you must hire more IT staff to manage your whole “fleet of computers” (or another solution is to receive large bills from the company that hosts your server “in the cloud”). It should be noted that Anatella (i.e. our ETL tool) is so fast that, 99.9% of the time, you only need one server to do all computations.

User Interface

There are basically two families of ETL tools:

  • Script-Based ETL tool: Examples: SAS base,  Scriptella, …
  • Graph-based ETL tool: Examples: Kettle, CloverETL, Talend,…

The “war” between ETL practitioners to decide which ETL family (Script-based or Graph-based) is the best is now finished: Nowadays, the only commercial ETL tool still based on scripting technology is SAS.

Our ETL tool (Anatella) offers both a Script-based interface AND a Graph-Based interface.

Although Script-based ETL tools are outdated and practically non-existent, they still offer some advantages: They are usally more versatile: They offer to the final user a greater flexibility when defining new transformations. On the other hand, typically, it’s very difficult to develop new data-transformations using the few currently commercially available “script-based-ETL tools” that still remains. The main “barrier” when using a script-based-ETL tool is that you have to learn still another new language that is possibly completely different from any other “common language” known by most programmers.

Anatella is one of the very few ETL tool that uses a “hybrid techlonology”:

  • The simple transformations are described using “little boxes” (that is the most intuitive way to represent a data transformation and is a “de facto” standard for all the modern ETL tools).
  • Complex transformations are programmed using a scripting language based on JavaScript (standard ECMA-262) which is simple, complete and extremely versatile. JavaScript is one of the most widely used programming language currently used in the industry (it’s used in every webpage!). You can leverage your already-existing-JavaScript-skills to become an ETL expert instantaneously!

Differences in results between these tools

None: Given enough time and enough human resources, you can achieve the same result with all the above tools: Nearly all tools are capable of delivering the final “end-result” (.e.g the final “customer profiling table”). The only exceptions that “I can think of” might be:

  1. When you are manipulating tables with a high number of columns (e.g. above 2000) then only Anatella can do it properly.
  2. When there are some time-constraints, you need a fast ETL. Although, even in such delicate situation, there are still some ways to get around the inefficiency of the ETL.

If you apply on the same source data the same set of transformations, you’ll obtain the same results whatever the ETL tool used. This seems logical. The same is not true about Advanced Analytic Softwares: When working on the same dataset and using the same modeling methodology, you’ll still obtain different predictive models, with different ROI, depending of the modeling tool used. This difference can be very significative: On most cases, TIMi give 5 to 15% more ROI.

Although the differences between the different ETL tools are practically inexistant in terms of end-result, you can still observe large differences when it comes to: speed, versatility, cost of ownership, quality of the interface (“user-friendlyness”), energy efficiency, etc.