<< Click to Display Table of Contents >> Navigation: »No topics above this level« Appendix A: The compressed-CSV-file format |
The compressed-CSV-file format is one of the most efficient way to store a dataset that will be used inside TIMi modeler. The abbreviation CSV means “comma-separated value”. TIMi Modeler reads natively datasets compressed in RAR, ZIP, and GZ. The preferred compression format is RAR.
An example: The following table
key |
Is taxable income amount above 50K ? |
age |
education |
marital stat |
race |
sex |
country of birth |
weeks worked in year |
1 |
0 |
73 |
High school graduate |
Widowed |
White |
F |
USA |
0 |
2 |
0 |
58 |
Some college but no degree |
Divorced |
White |
M |
USA |
52 |
3 |
0 |
18 |
10th grade |
Never married |
Asian |
F |
Vietnam |
0 |
4 |
0 |
9 |
Children |
Never married |
White |
F |
USA |
0 |
5 |
0 |
10 |
Children |
Never married |
White |
F |
USA |
0 |
6 |
0 |
48 |
Some college but no degree |
Married-civilian |
Indian |
F |
USA |
52 |
7 |
0 |
42 |
Bachelors degree(BA AB BS) |
Married-civilian |
White |
M |
USA |
52 |
8 |
1 |
28 |
High school graduate |
Never married |
White |
F |
USA |
30 |
9 |
0 |
47 |
Some college but no degree |
Married-civilian |
White |
F |
USA |
52 |
10 |
0 |
34 |
Some college but no degree |
Married-civilian |
White |
M |
USA |
52 |
11 |
0 |
8 |
Children |
Never married |
White |
F |
USA |
0 |
13 |
0 |
51 |
Some college but no degree |
Married-civilian |
White |
M |
USA |
52 |
14 |
1 |
46 |
High school graduate |
Divorced |
White |
F |
Columbia |
52 |
15 |
0 |
26 |
Bachelors degree(BA AB BS) |
Never married |
White |
F |
USA |
52 |
16 |
0 |
13 |
Children |
Never married |
Black |
F |
USA |
0 |
17 |
0 |
47 |
Bachelors degree(BA AB BS) |
Never married |
White |
F |
USA |
52 |
18 |
0 |
39 |
10th grade |
Married-civilian |
White |
F |
Mexico |
0 |
19 |
0 |
16 |
10th grade |
Never married |
White |
F |
USA |
0 |
20 |
0 |
35 |
High school graduate |
Married-civilian |
White |
M |
USA |
49 |
… is equal to a “.csv” file containing:
key#taxable income amount#age#education#marital stat#race#sex#country of birth#weeks worked in year
1#0#73#High school graduate#Widowed#White#F#USA#0
2#0#58#Some college but no degree#Divorced#White#M#USA#52
3#0#18#10th grade#Never married#Asian#F#Vietnam#0
4#0#9#Children#Never married#White#F#USA#0
5#0#10#Children#Never married#White#F#USA#0
6#0#48#Some college but no degree#Married-civilian #Indian#F#USA#52
7#0#42#Bachelors degree(BA AB BS)#Married-civilian #White#M#USA#52
8#1#28#High school graduate#Never married#White#F#USA#30
9#0#47#Some college but no degree#Married-civilian #White#F#USA#52
10#0#34#Some college but no degree#Married-civilian #White#M#USA#52
11#0#8#Children#Never married#White#F#USA#0
13#0#51#Some college but no degree#Married-civilian #White#M#USA#52
14#1#46#High school graduate#Divorced#White#F#Columbia#52
15#0#26#Bachelors degree(BA AB BS)#Never married#White#F#USA#52
16#0#13#Children#Never married#Black#F#USA#0
17#0#47#Bachelors degree(BA AB BS)#Never married#White#F#USA#52
18#0#39#10th grade#Married-civilian #White#F#Mexico#0
19#0#16#10th grade#Never married#White#F#USA#0
20#0#35#High school graduate#Married-civilian #White#M#USA#49
The first line contains the name of the columns. Each field (inside a line) is separated from the next field using a separator character. In the example above, the separator character is ‘#’. In a classical “.csv” file the separator character is the ‘,’.
After RAR compression, the size of the file “census-income.csv” is reduced to 4.1 MB. It’s a compression ratio of more than 95% (Usually dataset files are very easy to compress). TIMi Modeler is able to work directly with datasets in their compressed form (in other word, you never have to decompress yourself your datasets). The economy in hard drive space is substantial. If your dataset files are stored on a remote network drive, the compression mechanism of TIMi Modeler allows to reduce substantially the network bandwidth used when manipulating your datasets.
See the document “DataPreparation_churn.doc” to have more information about the construction of a good Dataset.