5.2.7. Anatella “Columnar Gel” file reader (column-based storage)

<< Click to Display Table of Contents >>

Navigation:  5. Detailed description of the Actions > 5.2. Input Actions >

5.2.7. Anatella “Columnar Gel” file reader (column-based storage)

 
Icon: columnarRead

 
Function: readColumnarGel
 

Property window:

 

ANATEL~2_img214ANATEL~2_img213

 

 

ANATEL~2_img216

ANATEL~2_img215

 

 

 

Short description:

 

Reads a table from a “.cgel_anatella” file.

 

Long Description:

 

Reads a table from a “.cgel_anatella” file (and from the associated “column set” data files “*.NNN.cs_anatella”). See section 5.1.1. to have more information on how to specify the filename of the “.cgel_anatella” file (i.e. You can use relative path, wildcards, and Javascript to specify your filename). You can connect to the input pin of the columnarRead  ColumnarGelFile Reader a table containing (many) filenames.
 

 

ANATEL~2_img8

You can drag&drop a “.cgel_anatella” file from a MS-File-Explorer-Window into an Anatella-Graph-Window: This will directly create the corresponding columnarRead ReadGel Action inside the Anatella graph.

 

 
 
Anatella possesses two highly-efficient proprietary file formats that allows you to handle with ease any “Big Data” problem. These two files formats are:
 

“.gel_anatella” files: Optimized for speed and for low RAM consumption. Ideal when processing all the columns and all the rows inside a table. Since the “.gel_anatella” files have relatively low RAM consumption, this means that you can simultaneously open thousands of them (for example, when using the ANATEL~2_img203 mergeSortInput Action: see section 5.2.15.).

“.cgel_anatella” files: Optimized to have the best speed and the mimimum quantity of I/O transfer. To minimize the quantity of bytes extracted from the Hard drive, you can parameter the columnarRead  ColumnarGelFile Reader to read a (small) subset of the columns and a (small) subset of the rows: i.e. The smaller the subset, the higher the processing speed.

 

 
The Columnar Gel files have the same set of great features as the simpler “.gel_anatella” files: More precisely:
 

The Columnar Gel files contain the same meta-datas as inside a simple “.gel_anatella” file (i.e. To remind you, these meta-data are: the column’s names, column’s type:  Key, Float or Unknown/String, the sorting flags, the “complete” flag), plus some more meta-data that allows to only extract out of the hard drive a subset of the columns and a subset of the rows (to reduce the required I/O and gain speed).
 

All the data inside the files are compressed. In opposition to the simpler “.gel_anatella” file (that uses only one generic data compression algorithm), inside the “cgel_anatella” columnar gel file, we use different compression algorithms for the different data types, achieving a (slightly) better compression.
 

All I/O algorithms are asynchronous (i.e. non-bloking) I/O algorithm:

oInside the columnarWrite ColumnarGelWriter Action, we have an asynchronous (i.e. non-bloking) I/O algorithm to create the “.cgel_anatella” files and the “.cs_anatella” files. Furthermore, we can decide to use many threads/CPU’s to create our files, to still increase writing speed.

oInside the columnarRead  ColumnarGelFile Reader, we have an asynchronous (i.e. non-bloking) I/O algorithm to read the “.cgel_anatella” files and the “.cs_anatella” files (

See the section 5.2.6.2. about asynchronous (and synchronous) I/O algorithms.

Asynchronous I/O algorithms allows very fast reading speed.
 

It’s possible to read “incomplete” columnar gel files: See section 5.2.6.1. for more information about this subject.
 

It’s possible to read “corrupted” columnar gel files: See section 5.2.6.3. for more information about this subject.

 

 
As you can see the “.cgel_anatella” Columnar Gel files seems to improve on all aspects compared to the simpler “.gel_anatella” files. The “.gel_anatella” files have still the “upper hand” in the following situations:
 

When the number of columns is large (>300), the RAM consumption required to read&write columnar gel files might be prohibitive. This means that, for most predictive datamining tasks (that requires a large number of columns), you’ll still use the simpler, row-based “.gel_anatella” files. For classical Business-Intelligence tasks, we are usually using a small quantity of columns (out of many) and thus the “.cgel_anatella” Columnar Gel files are usually better.

When you need to read many data tables simultaneously (i.e. when the number of simultaneously opened data-file is above 40: For example, when using the ANATEL~2_img203 mergeSortInput Action), it’s better to use the simple “.gel_anatella” files (rather than the “.cgel_anatella” columnar gel files) because the simple “.gel_anatella” files require a lot less RAM to operate.

 

A complete explanation on the proper usage of all the parameters of the columnarRead  ColumnarGelFile Reader is given the section 5.26.3. about the columnarWrite ColumnarGelWriter Action.