Appendix A: The compressed-CSV-file format

The compressed-CSV-file format is one of the most efficient way to store a dataset that will be used inside TIMi modeler. The abbreviation CSV means “comma-separated value”. TIMi Modeler reads natively datasets compressed in RAR, ZIP, and GZ. The preferred compression format is RAR.

An example: The following table

key	Is taxable income amount above 50K ?	age	education	marital stat	race	sex	country of birth	weeks worked in year
1	0	73	High school graduate	Widowed	White	F	USA	0
2	0	58	Some college but no degree	Divorced	White	M	USA	52
3	0	18	10th grade	Never married	Asian	F	Vietnam	0
4	0	9	Children	Never married	White	F	USA	0
5	0	10	Children	Never married	White	F	USA	0
6	0	48	Some college but no degree	Married-civilian	Indian	F	USA	52
7	0	42	Bachelors degree(BA AB BS)	Married-civilian	White	M	USA	52
8	1	28	High school graduate	Never married	White	F	USA	30
9	0	47	Some college but no degree	Married-civilian	White	F	USA	52
10	0	34	Some college but no degree	Married-civilian	White	M	USA	52
11	0	8	Children	Never married	White	F	USA	0
13	0	51	Some college but no degree	Married-civilian	White	M	USA	52
14	1	46	High school graduate	Divorced	White	F	Columbia	52
15	0	26	Bachelors degree(BA AB BS)	Never married	White	F	USA	52
16	0	13	Children	Never married	Black	F	USA	0
17	0	47	Bachelors degree(BA AB BS)	Never married	White	F	USA	52
18	0	39	10th grade	Married-civilian	White	F	Mexico	0
19	0	16	10th grade	Never married	White	F	USA	0
20	0	35	High school graduate	Married-civilian	White	M	USA	49

… is equal to a “.csv” file containing:

key#taxable income amount#age#education#marital stat#race#sex#country of birth#weeks worked in year

1#0#73#High school graduate#Widowed#White#F#USA#0

2#0#58#Some college but no degree#Divorced#White#M#USA#52

3#0#18#10th grade#Never married#Asian#F#Vietnam#0

4#0#9#Children#Never married#White#F#USA#0

5#0#10#Children#Never married#White#F#USA#0

6#0#48#Some college but no degree#Married-civilian #Indian#F#USA#52

7#0#42#Bachelors degree(BA AB BS)#Married-civilian #White#M#USA#52

8#1#28#High school graduate#Never married#White#F#USA#30

9#0#47#Some college but no degree#Married-civilian #White#F#USA#52

10#0#34#Some college but no degree#Married-civilian #White#M#USA#52

11#0#8#Children#Never married#White#F#USA#0

13#0#51#Some college but no degree#Married-civilian #White#M#USA#52

14#1#46#High school graduate#Divorced#White#F#Columbia#52

15#0#26#Bachelors degree(BA AB BS)#Never married#White#F#USA#52

16#0#13#Children#Never married#Black#F#USA#0

17#0#47#Bachelors degree(BA AB BS)#Never married#White#F#USA#52

18#0#39#10th grade#Married-civilian #White#F#Mexico#0

19#0#16#10th grade#Never married#White#F#USA#0

20#0#35#High school graduate#Married-civilian #White#M#USA#49

The first line contains the name of the columns. Each field (inside a line) is separated from the next field using a separator character. In the example above, the separator character is ‘#’. In a classical “.csv” file the separator character is the ‘,’.

After RAR compression, the size of the file “census-income.csv” is reduced to 4.1 MB. It’s a compression ratio of more than 95% (Usually dataset files are very easy to compress). TIMi Modeler is able to work directly with datasets in their compressed form (in other word, you never have to decompress yourself your datasets). The economy in hard drive space is substantial. If your dataset files are stored on a remote network drive, the compression mechanism of TIMi Modeler allows to reduce substantially the network bandwidth used when manipulating your datasets.

See the document “DataPreparation_churn.doc” to have more information about the construction of a good Dataset.