<< Click to Display Table of Contents >> Navigation: »No topics above this level« 3. Directory Structure |
When working with TIMi Modeler, we suggest to adopt the following directory structure:
•A data lake: One “Central Dataset Repository” directory
The “dataset” directory is a central repository that will contain all the datasets that you will analyze. Dataset files are usually very large files (several gigabytes) and it’s better to avoid duplication of these files in various directories on your system. To prevent duplication, we use a central repository that contains the “one and only” copy of our dataset files.
If you are working on a distributed file system, you’ll most certainly have a “central-network-drive” shared by all the analysts. I suggest you to create the dataset repository on the shared drive to prevent duplication of the data. TIMi Modeler has been designed to minimize the load on the network and it will read your data files at the lowest possible frequency. Typically, once the first “read” is completed, the work happens locally which reduces the load from the central server.
Note that you can easily change this directory by simply selecting another one in the standard Mode Interface.
•One “working directory” for each different analysis
The first thing to do when starting a new analysis with TIMi Modeler is to create a working directory that will contain all the reports and, in general, all the results of the data mining process. The TIMi Suite installation software automatically creates the directory “{My_documents}\TIMi\ INCOME”, which will be used as “working directory” during the course of this tutorial.
By default, the “Central Dataset Repository” is the directory:
● C:\soft\TIMiPortable\AnatellaDemo\datasets |
when using the standard TIMi installer. |
● {My_documents}\TIMi\DATASETS |
when using the Wizard-based off-line installer. |
Here is a screenshot of our default demo “Central Dataset” Repository:
(It contains our demo-dataset: “Census-Income.rar”)
As you see in the sceenshot above, our “Central Dataset Repository” directory contains several “.rar” files. Each “.rar” file actually contains one compressed Text/CSV files (see appendix A about compressed CSV files). TIMi is able to read natively compressed CSV files: The compression formats that are supported are “.rar”, “.zip”, “.gz” (the compressed archive must always contain one unique Text file). When TIMi uses these file formats, it does NOT decompress the files on the Hard Drive: TIMi decompresses the dataset “on-the-fly” in central core memory, thus reducing:
-the load on the hard drive
-the hard-drive consumption required to do the analysis.
The “Census-Income” dataset has been prepared for you by an expert data miner so that it’s directly “ready to use” for predictive analytics purpose.
You can use Anatella (included inside the “TIMi suite”) to easily build such dataset. Technically, all we need is the creation of a target column, although many transformations are often required to generate high quality models. Please refer to the following document to have more information about the (many) data preparation steps and how to construct a good Dataset:
http://download.timi.eu/docs/DataPreparation_Churn.pdf
http://download.timi.eu/docs/DataPreparation_PropensityToBuy.pdf