Appendix A: The compressed-CSV-file format

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

Appendix A: The compressed-CSV-file format

 

The compressed-CSV-file format is one of the most efficient way to store a dataset that will be used inside TIMi modeler. The abbreviation CSV means “comma-separated value”. TIMi Modeler reads natively datasets compressed in RAR, ZIP, and GZ. The preferred compression format is RAR.

 

An example: The following table
 

key

Is taxable income amount above 50K ?

age

education

marital stat

race

sex

country of birth

weeks worked in year

1

0

73

High school graduate

Widowed

White

F

USA

0

2

0

58

Some college but no degree

Divorced

White

M

USA

52

3

0

18

10th grade

Never married

Asian

F

Vietnam

0

4

0

9

Children

Never married

White

F

USA

0

5

0

10

Children

Never married

White

F

USA

0

6

0

48

Some college but no degree

Married-civilian

Indian

F

USA

52

7

0

42

Bachelors degree(BA AB BS)

Married-civilian

White

M

USA

52

8

1

28

High school graduate

Never married

White

F

USA

30

9

0

47

Some college but no degree

Married-civilian

White

F

USA

52

10

0

34

Some college but no degree

Married-civilian

White

M

USA

52

11

0

8

Children

Never married

White

F

USA

0

13

0

51

Some college but no degree

Married-civilian

White

M

USA

52

14

1

46

High school graduate

Divorced

White

F

Columbia

52

15

0

26

Bachelors degree(BA AB BS)

Never married

White

F

USA

52

16

0

13

Children

Never married

Black

F

USA

0

17

0

47

Bachelors degree(BA AB BS)

Never married

White

F

USA

52

18

0

39

10th grade

Married-civilian

White

F

Mexico

0

19

0

16

10th grade

Never married

White

F

USA

0

20

0

35

High school graduate

Married-civilian

White

M

USA

49

 

 
… is equal to a “.csv” file containing:

 

key#taxable income amount#age#education#marital stat#race#sex#country of birth#weeks worked in year

1#0#73#High school graduate#Widowed#White#F#USA#0

2#0#58#Some college but no degree#Divorced#White#M#USA#52

3#0#18#10th grade#Never married#Asian#F#Vietnam#0

4#0#9#Children#Never married#White#F#USA#0

5#0#10#Children#Never married#White#F#USA#0

6#0#48#Some college but no degree#Married-civilian #Indian#F#USA#52

7#0#42#Bachelors degree(BA AB BS)#Married-civilian #White#M#USA#52

8#1#28#High school graduate#Never married#White#F#USA#30

9#0#47#Some college but no degree#Married-civilian #White#F#USA#52

10#0#34#Some college but no degree#Married-civilian #White#M#USA#52

11#0#8#Children#Never married#White#F#USA#0

13#0#51#Some college but no degree#Married-civilian #White#M#USA#52

14#1#46#High school graduate#Divorced#White#F#Columbia#52

15#0#26#Bachelors degree(BA AB BS)#Never married#White#F#USA#52

16#0#13#Children#Never married#Black#F#USA#0

17#0#47#Bachelors degree(BA AB BS)#Never married#White#F#USA#52

18#0#39#10th grade#Married-civilian #White#F#Mexico#0

19#0#16#10th grade#Never married#White#F#USA#0

20#0#35#High school graduate#Married-civilian #White#M#USA#49

 

 
The first line contains the name of the columns. Each field (inside a line) is separated from the next field using a separator character. In the example above, the separator character is ‘#’. In a classical “.csv” file the separator character is the ‘,’.

 

After RAR compression, the size of the file “census-income.csv” is reduced to 4.1 MB. It’s a compression ratio of more than 95% (Usually dataset files are very easy to compress). TIMi Modeler is able to work directly with datasets in their compressed form (in other word, you never have to decompress yourself your datasets). The economy in hard drive space is substantial. If your dataset files are stored on a remote network drive, the compression mechanism of TIMi Modeler allows to reduce substantially the network bandwidth used when manipulating your datasets.

 

See the document “DataPreparation_churn.doc” to have more information about the construction of a good Dataset.