1. Introduction

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

1. Introduction

 

Welcome to the Quick user’s guide to TIMi Modeler.

 

This document will guide you through the process of creating and interpreting a predictive model using TIMi Modeler. This guide is designed for beginners, and while the vocabulary used may sometime be a little bit technical, you don’t need to be a specialist with 10 years of training in intricate mathematical abstractions to understand this document.

 

Discovering new insights about your customers should be fun and easy, and the TIMi Suite has been designed with this goal in mind. The TIMi Suite lets you easily explore terabytes of data to extract some useful knowledge. There is a whole new world waiting to be discovered, hidden inside your databases that you can now easily explore with TIMi Modeler.

 

The “TIMi Suite” consists of four tools: Anatella (TIMi’s ETL), Startdust (TIMi’s Data Visualization and Segmentation tool), Kibella (unlimited BI) and TIMi Modeler, the fastest automated predictive modeling tool currently available. This document focuses only on TIMi Modeler, please refer to the appropriate tutorial for the other tools.

 

During the course of this document we will analyze together a dataset named “census-income” (in statistics data-tables are usually named “datasets”). This dataset contains information about the financial characteristics of residents of the United State of America. Here is an extraction of this dataset:

 

key

Is taxable income amount above $50K ?

age

education

marital stat

race

sex

country of birth

weeks worked in year

1

0

73

High school graduate

Widowed

White

F

USA

0

2

0

58

Some college but no degree

Divorced

White

M

USA

52

3

0

18

10th grade

Never married

Asian

F

Vietnam

0

4

0

9

Children

Never married

White

F

USA

0

5

0

10

Children

Never married

White

F

USA

0

6

0

48

Some college but no degree

Married-civilian

Indian

F

USA

52

7

0

42

Bachelors degree(BA AB BS)

Married-civilian

White

M

USA

52

8

1

28

High school graduate

Never married

White

F

USA

30

9

0

47

Some college but no degree

Married-civilian

White

F

USA

52

10

0

34

Some college but no degree

Married-civilian

White

M

USA

52

11

0

8

Children

Never married

White

F

USA

0

13

0

51

Some college but no degree

Married-civilian

White

M

USA

52

14

1

46

High school graduate

Divorced

White

F

Columbia

52

15

0

26

Bachelors degree(BA AB BS)

Never married

White

F

USA

52

16

0

13

Children

Never married

Black

F

USA

0

17

0

47

Bachelors degree(BA AB BS)

Never married

White

F

USA

52

18

0

39

10th grade

Married-civilian

White

F

Mexico

0

19

0

16

10th grade

Never married

White

F

USA

0

20

0

35

High school graduate

Married-civilian

White

M

USA

49

 
Table 1: Data Struture

 
 

During the course of this tutorial, we will explore the relationship between the column “Is taxable income amount above $50K ?” and all the other columns of the dataset (age, education level, race,…) . The column “Is taxable income amount above $50K?” is the “column to explain” inside our dataset. The “column to explain” is named, in technical term, the “Target Column”.

 

As it is often the case in machine learning problem, only a small percentage of this population belongs to the target group, and each individual (each record, or each observation) within this group is named “a target”.  In terms of data, the “Targets” (i.e. all the people with a taxable income amount above $50K)  are identified with a value of ‘1’ in the “Target Column”.

 

Within the “census-income” dataset, the “Target Column” contains only two different values: 0 or 1. This is called a “Binary Target”. Note that TIMi Modeler can analyze datasets with three types of targets:
 

“Binary Targets”,

“Continuous Targets” and

“Multi-class Targets”.

 

In this document, we will focus primarily on the census-income dataset, which contains a “Binary Target”. However, in section 8 we will extend the notions learned on a “Binary Target” problem to a “Continuous Target” problem. In this section, we will focus on the prediction of the weight of a person using only various body circumference lengths.

 

The “census-income” dataset contains another special column: the “primary key” column or primary key variable. This “primary key” contains a unique value for each line of the dataset. This allows us to uniquely identify each record (i.e. each row or observation) in our dataset. The concept of “primary key” is well known in the database world: Should you require additional information on the topic, we recommend you to read any introductory books on the “data management/data base” topic, or simply ask anyone in your IT department. The “primary key column” in our dataset is named “key”, and we recommend (although it’s not mandatory) using this convention for automatic type recognition.

 

Modeler is able to process datasets stored in many formats. These datasets can be stored inside Anatella Gel files (.gel_anatella files ; this is the preferred format), relational databases (like Oracle, Teradata, Microsoft SQL server, Informatix, MySQL,...), or simple “flat files” (text files). The preferred storage format for Modeler is a .gel_anatella file which offer a good compression algorithm and the fastest reading speed (.csv files compressed in RAR are also good).