5.5.2.3. Sorting big data

<< Click to Display Table of Contents >>

Navigation:  5. Detailed description of the Actions > 5.5. Standard > 5.5.2. Sort (High-Speed action) >

5.5.2.3. Sorting big data

When sorting big data, you need to properly set the “Memory Buffer Size” parameter of the clip0131 Sort Action. This parameter defines:
 

the size of the (uncompressed) tapes (The tape files on the HD are typically much smaller because they are in compressed format).
 

the size of the memory buffer that is used to create each tape: This means that a value of “1000 MB” consumes 1000 MB of main RAM memory.

 

 
If you increase this parameter:
 

you‘ll consume more RAM memory.
 

you’ll have a smaller number of tape files.

 

 
What happens when you want to sort a 4TB table with the default value of “Memory Buffer Size”=100MB? To compute the sort:
 

1.Anatella creates 4TB/100MB=40.000 tape files (to remind you: 4TB= 4,000GB= 4,000,000MB).
 

2.Anatella uses the “Merge Sort” algorithm to fusion all the 40.000 tapes into one sorted table.
 

This means that Anatella needs to read simultaneously from 40.000 files. . This will simply not work. This won’t work for, basically, 2 reasons:
 

a.In Win7, the maximum limit of simultaneously opened file is around 20.000 (and it’s even lower on older Windows). Furthermore, reading simultaneously from many different files strongly degrades the performances (because the hard-drive-heads have to constantly physically “jump” from one file to another on the surface of the disk).
 

b.Opening one .gel_anatella file consumes about 10MB of main RAM memory. Thus, to open the 40.000 tape files, Anatella needs 40.000x10MB=400GB of main RAM memory. This amount of RAM memory is typically not available on standard systems and Anatella refuses to sort the data.

 

 
To sort a 4TB table, you should set “Memory Buffer Size”=10GB (i.e. you need to have 10GB of main RAM memory). This will create 4TB/10GB=400 tape files. Thereafter, to “Merge Sort” these 400 tape files, Anatella will only need 400x10MB=4GB RAM memory (which is ok).

 

 
To summarize:
 

To obtain the best Sorting speed, you should reduce to the minimum the number of tapes. (i.e. you should increase to the maximum the parameter “Memory Buffer Size”).
 

To sort large tables, you must set properly the “Memory Buffer Size” parameter.

You need a large amount of RAM when sorting large table. In general, the more RAM memory you give to the clip0131 Sort Action, the faster the sort will be.

 
If you set the parameter “Memory Buffer Size” to 10 GB (as in the above example), it also means that the RAM memory consumption of the clip0131 Sort Action is around 10GB. This is not negligible and it might lead to serious difficulties if you are using the multithreading capabilities of Anatella. For more information about this subject see the section 5.3.2.7.