ETL benchmark: processing time on 1 billion rows
When it comes to data processing speed, data preparation solutions differ greatly. According to recent benchmarks by the researchers from IntoTheMind, processing speeds vary in a factor from 1 to 145 depending on the tool and the data format used.
In the daily life of many enterprise, a large quantity of data preparation operations are always carried out with (flat) files extracted from information systems. However, handling large files can quickly make the data preparation work laborious and very expensive in terms of processing costs in the cloud. If you choose a “no code” ETL solution, you should therefore choose one that is fast, especially if you work in the cloud and use it often.
Test methodology
For this test, the researchers from IntoTheMind used a 43.6 GB csv file with 1.039 billion rows and 9 columns. The data processing test consisted of 3 steps:
- Opening the csv file
- Sort downwards on the first column
- “Group by” on the values in the 7th column
4 well-known ETLs were tested:
- Talend Open Studio v7.3.1
- Anatella v2.35
- Tableau Prep 2020.2.1
- Alteryx 2020.1
The tests were carried out on a desktop machine equipped with 96 GB of Ram and a 7th generation i7 processor and the data stored on a Western Digital 6TB HDD running at 7200 rpm. The same test was then carried out with an SSD (instead of the HDD). Each query was run 3 times and the lowest value of the three was selected.
Results: Anatella in pole position
The slowest performing solution is Talend Open Studio v7.3.1 which takes almost 4 hours (3:52) to process the data. The best performing solution is Anatella v2.35 which takes just 96 seconds to process the same dataset. On this simple benchmark, Anatella is more than 145 times faster than Talend Studio. A more complex benchmark would show an even greater difference in favor of Anatella.
Another finding of the researchers from IntoTheMind is that using an SSD instead of an HDD does not always result in a performance gain. However, using a proprietary format (a feature only available in Anatella and Alteryx) can significantly improve data processing time.