Data Warehousing

A “Data warehouse” is simply a centralized place where you can find all the data of your company. Data Warehouse software are very often simple database: Oracle, Teradata, SQL Server, MySQL,…

You usually use a “Data warehouse” in combination with a “OLAP tool” or a “reporting tool”. Based on the content of the “Data warehouse”, the reporting tool produces “standard reports” like:

  • Evolution of your sales over the years, month by month
  • Evolution of the number of Acquired customers, month by month
  • Evolution of the number of “lost customers”, month by month

These classical questions (or requests) directly translates to SQL queries that are executed by the Data Warehouse system. The OLAP/Reporting tools only displays in a fancy way, the results computed by the Data Warehouse. Computing those results usually take a large amount of computing-time (during which the Data Warehouse system might become irresponsive). So, if you hired good consultants to create your Data Warehouse, they normally created inside your database a “small, special table” that already stores all the answers to these “tactical” questions (so that there is no need to re-compute all time the same answers to your different “tactical” questions: you only need to consult the “special table” that already contains the answer). This “special table” is re-computed & updated every week or so…

There are two major technologies when it comes to database

  1. Non-clustered databases(i.e. the old, standard approach).
    These databases are running on one unique “big” server.
    Pro: Lower acquisition & maintenance price.
    Con: Limited storage (i.e. a few Terabyte at max.) and limited Computing power (i.e. complex sql queries
    runs for a very long time and sometime do not execute at all).
  2. Distributed (or clustered) databases
    A clustered database is a database that is running simultaneously on many different servers connected together with high-speed data cables (e.g. infiny band).
    Pro: Potentiallly unimited storage and unlimited Computing power (i.e. If you need more computing power or
    more storage-space, you simply add new servers inside the cluster).
    Con: Higher acquisition & maintenance price.

The differences between all these database systems are:

Speed, Scalability, Versatility, Cost of Ownership

All major database systems are equally able to store vast amounts of information. For Non-clustered databases, the storage space is limited to a few Terabytes. A complete analysis of the advantages of each database system is out of the scope of this simple document.

Some database systems are more focused on “data warehouse” usage than others. Typically, in a “data warehouse”:

  • You only have infrequent “big updates”.
  • A large quantity of the submitted (SQL) queries are heavy “tactical queries” that require a large amount of computing power to execute (because they process a large amount of data).

Most of the time, the users are only performing queries. So, the databases that are optimized for “data warehouse purposes” have very fast queries. Since Distributed (or clustered) databases have more computing power, they are usually better suited for data warehousing purposes (because they can easily run the required tatical queries).

For more information on this very vast subject, see here:

Differences in results between tools

The differences in result between all the Non-clustered databases are:

  • None: given enough time and enough human resources, you can achieve similar results (similar ROI) with all theNon-clustered databases.

The differences in result between all the Distributed (or clustered) databases are:

  • None: given enough time and enough human resources, you can achieve similar results (similar ROI) with all theDistributed (or clustered) databases.

If you apply on the same tables the same set of transformations, you’ll obtain the same results whatever the datawarehouse tool used. This seems logical. The same is not true about Advanced Analytic Software: When working on the same dataset and using the same modeling methodology, you’ll still obtain different predictive models, with different ROI, depending of the modeling tool used. This difference can be very significative: On most cases, TIMi give 5 to 15% more ROI.

Although the differences between the different database tools are inexistant in terms of end-result, you can still observe large differences when it comes to: speed, cost of ownership, quality of the interface (“user-friendliness”), energy efficiency, etc.