1. Introduction

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

1. Introduction

 

Welcome to the Quick user’s guide to Anatella. Anatella is part of the “TIMi Suitetm that is an integrated set of tools designed to solve Advanced Analytics, Big Data and Business Intelligence problems. The acronym “TIMi” stands for “The Intelligent Mining Machine (TIMi)”.

 

Anatella is a data manipulation tool, also known as “ETL tool” (“ETL” is the acronym of “Extract, Transform & Load”) that is designed for analytical and “predictive datamining” tasks. The name “Anatella” is the contraction of “Analytical ETL”: indeed Anatella is the first ETL tool of its kind: it’s the only ETL tool designed specifically for Analytical Tasks on large data volume. Anatella offers some features that are unique and extremely valuable in this field (e.g. meta-data-free data transformations, ability to handle tables with more than 50000 columns, etc.).

 

Before any data mining activity, before building any predictive model, the first task to accomplish by the data miner is to place all the available data into the proper format (when using TIMi  or Stardust, it usually means obtaining one single very large dataset). Once you have a dataset, you can analyze it with TIMi and/or StarDust.

 

1.1. General Data Manipulation Features

 

 
Anatella offers you the following capabilities:
 

 

Anatella is 100% Unicode compliant and will accept any character set (Chinese, Cyrillic, Japanese, etc.) without losing any information.
 

 

Classical ETL features : 
 

Join tables 
 

Columns & Rows filtering out of tables, 
 

Sorting, 
 

Format conversions (CSV, SQL ,..), 
 

Automatic Scoring (using the TIMi predictive models)
 

Automatic segmentation (using the StarDust segmentation models)
 

Derivation of new columns for predictive modeling: 
 

Automatic generation of hundreds of thousands of new, “derivate” columns ;
 

a full scripting language based on JavaScript (standard ECMA-262) that allows you to express the most complex transformations, validations, aggregations, derivations. The small Anatella-specific extensions to the “standard” JavaScript language are easy to use. Furthermore, thanks to these extensions, the JavaScript code becomes similar (but more versatile) to a "SAS datastep”, so that you can even leverage your SAS skills.
 

Complete Meta-Data extraction & management.
 

 

Inside Anatella, most of the data transformation operators are "meta-data free". It means that it is not necessary to define “metas-datas” to use 99% of the various transformations available in Anatella. 

 
In this regard, Anatella is very much like MS-Excel: Inside Excel, you don’t need to specify “by hand” the data-types of each of your columns or cells: Excel automatically find the right data-types, so that the equations inside your sheets are working without you spending time to define any meta-datas. The same principle applies with Anatella. 

 

The "meta-data free" functionality of Anatella makes it completely different than all the other ETL’s currently on the market. It also greatly simplifies the usage of Anatella: The difficulty in using Anatella is comparable to the difficulty of using Excel (it’s only slightly more complex). This means that business-users without technical training are usually able to use Anatella without too much headaches.

 

The "meta-data free" functionality is also important because, in predictive datamining, it is very common to manipulate tables of tens of thousands of columns and it is impossible to specify "by hand" the "meta-datas" of all these columns (as required by nearly all other ETL software). 

 
 

Anatella features two highly-optimized data file formats that have the extension “.gel_Anatella” (row-based file format) and “.cgel_anatella” (column-based file format). These two file formats are primarily optimized for speed.  

 

One “.gel_anatella” file (or one “.cgel_anatella” file) contains one data table (and all its meta-data).

 

Usually you can read “.gel_anatella” file at a throughput of 70 MB/sec (on common hardware). The data inside a simple “.gel_anatella” file is compressed by a factor around 4. This means that reading a table out of a “.gel_Anatella” file at a speed of 70 MB/sec (compressed) is actually equivalent to reading the same table at a speed of 280 MB/sec (uncompressed). 

 

A very common situation in business-intelligence is to compute aggregates based on a subset of the columns available inside the data file. Aggregates are typically computed using only about 4 columns out of the many columns stored inside the data file. The columnar file-format “.cgel_anatella” allows you to just extract, out of the hard-drive, the few bytes that are composing the 4 columns required to compute the aggregation (and avoid reading&decompressing the bytes that belong to all the other columns). In practice, this means that, for most business-intelligence tasks, the “cgel_anatella” files are usually from five to hundred times faster than the simpler “.gel_anatella” files (at the cost of a slightly higher RAM memory consumption)(simply because you avoid reading the data/columns that you don’t need to read).

 

These performances place Anatella ahead of most competitive offerings (as evidenced by its score on the TPC-H benchmark). 

 
 

Anatella can read in "native mode" compressed dataset files in text format (CSV). The supported compression formats are RAR (unique!), ZIP, GZ, Z, LZO. 

 

This functionality allows to reduce the need for hard-drive space on your server. 

 

Let’s give an example: the open source “census-income” database stored in a .RAR compressed text file “weights” 4.04 MB. The same database in an un-compressed text file “weights” 96MB (let’s say around 100MB). The same database in a classical SAS .sas7bdat dataset file “weights” around 250MB. The numbers given on this example are quite common and perfectly illustrates why an ETL should be able to process compressed data streams (even more so when dealing with “Big Data”).

 
 

Anatella reads natively SAS datasets file (.sas7bdat file), SPSS datasets files (.sav and .por files) and STATA (.dta files) datasets files.

 
 

Anatella is heavily multi-threaded. This means that one data transformation running in Anatella can exploit all the CPU’s inside your server to decrease the computation time. Classical ETL’s are only able to run different data transformations on different CPU’s but not one particular data transformation on many CPU’s. The multi-threading capabilities of Anatella usually allow dividing the computing time of a data transformation graph by a factor between 4 to 10.

 
 

Anatella offers you a direct access to all the “classical” relational databases via ODBC & OLEDB connectors (Oracle, SQLServer, MySQL, TerraData, ...)

 
 

Anatella provides some crude OLAP reporting functionalities through the use of a “Microsoft Office Data Injection operator”: This operator allows you to automatically inject “in batch” some data extracted from the Anatella-Graph into any chart or graphics contained in any Microsoft Office document. For example, you can obtain, in a few mouse clicks, each day, an automatic update of all the charts of your preferred PowerPoint presentation. Anatella can generate all types of MSOffice graphs: pie chart, 3D surface chart, bar chart, doughnut chart, bubble chart, etc...

 

 

1.2. Development Functionalities

 

 
The following capabilities are useful for development:
 

The Anatella integrated-development-environment (IDE) that is used to create the new data-manipulation-scripts is extremely simple, intuitive & versatile. This environment is based on a unique hybrid technology:
 

The simple transformations are described using "little boxes" (that is the most intuitive way to represent a data transformation and is a « de facto » standard for all the modern ETL tools). 
 

Complex transformations are programmed using a scripting language based on JavaScript (standard ECMA-262) which is simple, complete and extremely versatile. JavaScript is one of the most widely used programming language currently used in the industry (see appendix C, E and F). You can leverage your already-existing-JavaScript-skills to become an ETL expert instantaneously!

 

Anatella is the only ETL tool available on the market to offer you a direct access to a complete & powerful “debugger” with an interface similar to the famous MS Visual Studio debugger (to "debug" the scripts written in JavaScript/ECMA-262): you can add "break points" to your code,  add some "watch" on variables, see the "stack",...  This feature adds a lot of flexibility and control to the ETL process.

 

 
One major advantage of Anatella over any other ETL is that Anatella is easily extensible: you can easily add new, customized data-transformation-operators. These new operators can be developed in:
 
 

JavaScript, R or Python.

Anatella contains an automatic versioning tool that allows you to manage the different “JavaScript/R/python codes” that you have developed/downloaded. Anatella also offers you a direct access to a Javascript “debugger” to easily debug all your transformations.
 

 

C/C++, for the most extreme speed & performances.

A Software Developement Kit (SDK) to create new Actions coded in C++ inside Anatella is available for TIMi partners and clients. The SDK also allows you to extend the Anatella Javascript engine by adding new Javascript functions. 

 

This SDK was used to create all the actions inside Anatella: it offers the ultimate flexibility and speed. 

 

Deploying your new C++ Actions is easy: You only need to copy your “AnatellaPlugin*.dll” file next to the “anatella.exe” file and Anatella will automatically load your extensions at each startup.

 

1.3. Predictive Modeling Functionalities

 

 
The following capabilities are useful for predictive modeling:
 

 

Anatella provides advanced text-mining capabilities. Using the Anatella text-mining operators, you can:
 

Automatically correct spelling mistakes (in your “address” fields, for example…)
 

Translate text from one language to another
 

Do “fuzzy matching”: For example, to join 2 tables (based on multilingual sound encoding)
 

Classify texts (in combination with TIM). These operators apply the classical “bag-of-word” technique to produce, starting from raw, unstructured text-data, many new columns and new variables directly exploitable inside TIM or Stardust, for predictive analytics. You can easily enrich your datasets with unstructured data to obtain the highest predictive modeling accuracy. The Anatella predictive-text-mining functionalities are unique.

Graph mining or Social Network Analysis (SNA): This set of operators is mainly useful for telecommunication companies and banks, to create churn predictive models, cross-selling predictive models, up-selling predictive models, to estimate the share-of-wallet, etc. The objective of these operators is to extract out of the “phone- communication-network” valuable social-metrics. Typically, the “phone-communication-network” is defined in this way:
 

each individual is a node.
 

an “arc”  of the network between the two individuals A and B represents the relation “A called B”.

 
The social-metrics that could be extracted from the network are: the best connected individual, the individual who plays the most important role in any group, the groups of friends, the proximity to a churner, the number of churners in the “neighborhood” of an individual. Those metrics can improve the accuracy of your predictive models.

 
 

Operational Research (OR) optimization toolbox: In particular, Anatella integrates:
 

An efficient multi-threaded LP/IP solver that allows you to solve large scale optimization problem. The LP solver handles millions of constraints and several thousands of variables.
 

An efficient solver for the GAP (General Assignment Problem). Typical GAP problems includes: Which product do I have to offer to which customer when I have the following business-constraints: The stock of each product is limited, Each selected customer receives a folder with N offers (no less and no more than N offers), The margin on each product is different,... The GAP solver included inside Anatella handles campaigns with several millions of customers and thousands of products.

 
The Anatella Optimization plugin is typically used for operation research (OR), sales & profit optimization, stock optimization, etc.
 

 

Modeling Factory for large scale Predictive Datamining projects. Some of our customers need to re-build from scratch several thousands of predictive models every day or week. This can easy be accomplished using TIMi (as the analytical engine) and using Anatella to supervise & manage, in a 100% automated way, the whole procedure. The multithreading capabilities of Anatella allow you to exploit all the CPU’s in your server to deliver a very high computing power.