Navigation: »No topics above this level«
4. The TIMi Modeler Modeling Process
As mentioned earlier, the “Census-Income” dataset originates from the American Census Bureau (it’s an open source dataset freely available online). Each line represents a person. The target column is a binary variable, that is true (1) when the income level for a person is above $50K and false (0) otherwise. The objective of our modeling exercise is to identify which USA residents will have “Target= TRUE”, using all the information available.
With TIMi modeler, the process to create a new predictive model consists of three steps:
1-Step 1: Where is my dataset?
We need to set the location of the dataset to analyze. Typically, our dataset will be inside the “Central Dataset Repository” directory. This dataset can be a text file, a .gel_anatella file, the result of a query on a SQL database, or any file format we mentioned above. The information about the location of your dataset is stored inside a “.DSourceXML” file.
At the end of this first step, TIMi Modeler attempts to guess the type of each column/variable in the dataset based on some heuristics. Based on these guesses, it will produce a “.TypeXML” file (see next step to know more about this file)
2-Step 2: What are the types of the columns inside the dataset?
We now need to set the type of the columns. There are basically five types of columns:
a.Value type. Examples are: Age, Size, Cost, Price,…
b.Nominal type. Examples are CarLabel, Region, Sex, …
c.Binary type. Examples are: isMale, isForeigner,…
d.Target type. What is the “Target Column”?
e.Key type. What is the “Primary Key” column?
The information about the type of the columns is stored inside a “.TypeXML” file. This file is the end-result of the previous step (step 1). You (normally) have to carefully check it do the necessary changes if some column’s types were not guessed properly by Modeler during the step 1.
At the end of this step, TIMi Modeler generates several reports about the data quality of the dataset, some statistics about the content of every column, and some charts. It also generates a “.CfgXML” file (see the next step to know more about this file).
3-Step 3: Who are my targets?
We now need to set the selection of lines and of columns inside your dataset that we want to analyze. Most of the time, no sub-selection is needed as you will analyze all the lines and all the columns of your dataset, so you don’t have to provide anything here (You can leave the default values “as is”). Your selection is saved inside a “.CfgXML” file.
At the end of this step, Modeler generates an “analyst” report that explains how we can identify the “Targets” (in our example: how to recognize somebody that has an income level above $50K). This “analyst” report contains information about the exact profile of a US resident with an income level above $50K.
TIMi Modeler also generates a “predictive model”. This model is using all the information contained inside the columns of your dataset to guess if a person is “inside the target” or not. Typically, Modeler constructs a “predictive model” using no more than 15 to 25 columns of the dataset to perform the guess. Why such a small number of Columns? Because, usually, the columns that are ignored by TIMi Modeler simply do not contain any other additional relevant information compared to the columns already used by the model.
At this stage, the model creation process is complete. However, several steps must be taken to fully deploy the models and put it into production. Such steps are usually completed using Anatella. A TIMi predictive model can be:
•…used directly inside Anatella (this is the easiest for the deployment of your models and also the recommended way to put your models in production).
•…used directly on the command-line (in a batch script)
o…to “SAS base” code
o…to simple ‘SQL” code that runs in almost any database engine.
o…to VBA (to run your models inside an Excel Sheet)